[ Question ]

Is value amendment a convergent instrumental goal?

by JakeH 1 min read18th Oct 20195 comments


Goals such as resource acquisition and self-preservation are convergent in that they occur for a superintelligent AI for a wide range of final goals.

Is the tendency for an AI to amend its values also convergent?

I'm thinking that through introspection the AI would know that its initial goals were externally supplied and question whether they should be maintained. Via self-improvement the AI would be more intelligent than humans or any earlier mechanism that supplied the values, therefor in a better position to set its own values.

I don't hypothesise about what the new values would be, just that ultimately it doesn't matter what the initial values are and how they are arrived at. This makes value alignment redundant - the future is out of our hands.

What are the counter-points to this line of reasoning?

New Answer
Ask Related Question
New Comment

2 Answers

"Avoiding amending your utility function" is one of the classic convergent instrumental goals in Bostrom and Omohundro, and the reasoning there is sound: almost any goal will be better satisfied if it preserves itself than if it replaces itself with a different goal.

I do think it's plausible that AGI systems will have pretty unstable goals early on, but that's because goal stability seems hard to me and AGI systems probably won't perfectly figure it out very early along their development curve. I'm imagining accidental goal modification (for insufficiently capable systems), whereas you're describing deliberate goal modification (for sufficiently capable systems).

One way of thinking about this is to note that "wanting your goals to not be externally supplied" is itself a goal, and a relatively specific one at that; if you don't have something like that specific goal as part of the core criteria you use to select options, there's no instrumental reason for you to converge upon it. E.g., if your goal is simply "maximize the number of paperclips in your future light cone," then the etiology of your goal doesn't matter (from your perspective).

Is the tendency for an AI to amend its values also convergent?

I think there's a chance that it is (although I'd probably call it a convergent "behavior" rather than "instrumental goal"). The scenario I imagine is if it's not feasible to build highly intelligent AIs that maximize some utility function or some fixed set of terminal goals, and instead all practical AI (beyond a certain level of intelligence and generality) are kind of confused about their goals like humans are, and have to figure them out using something like philosophical reasoning.