Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Note: working on a research agenda, hence the large amount of small individual posts, to have things to link to in the main documents.

For X, consider three different partial preferences:

  1. If X were poor, they would prioritise consumption over saving.
  2. X: If I were poor, I would prioritise saving over consumption.
  3. X: If I were poor, I'd get my personal accountant to advise me on the best saving/consumption plan for poor people.

1 is what X's judgement would be in a different, distant situation. 2 is what X's current judgement about what their judgement would be in that situation. 3 is similar, but is based on a factually wrong model of what that distant situation is.

So what are we to make of these in terms of X's preferences? 3 can be discounted as factually incorrect. 2 is a correct interpretation of X's current (meta-)preferences over that distant situation, but we know that these will change if they actually reach that situation. It might be tempting to see 1 as the genuine preference, but that's tricky. It's a preference that X doesn't have, and may never have. Even if X were certain to end up poor, their preference may depend on the path that they took to get there - medical bankruptcy, alcoholism, or one dubious investment, could result in different preferences. And that's without considering the different ways the AI could put X in that situation - we don't want the AI to influence its own learning process by indirectly determining the preferences it will maximise.

So, essentially, using 1 is a problem because the preference is many steps removed and can be influenced by the AI (though that last issue may have solutions). Using 2 is a problem because the current (meta-)preferences are projected into a situation where they would be wrong. This can end up with someone railing against the preferences of their past self, even if those preferences now constrain them. This is, in essence, a partial version of the Gödel-like problem mentioned her, where the human rebels against the preferences the AI has determined them to have.

So, what is the best way of figuring out X's "true" preferences? This is one of the things that we expect the system to be robust to. Whether type 1 or type 2 preferences are prioritised, the synthesis should still reach an acceptable outcome. And the rebellion against the synthesised values is a general problem with these methods, and should be solved in some way or another - possibly by the human agreeing to freeze their preferences under the partial guidance of the AI.

Avoid ambiguous distant situation

If the synthesis of X's preferences in situation S is ambiguous, that might be an argument to avoid situation S entirely. For example, suppose S involves very lossy uploads of current humans, so that the uploads seem pretty similar to the original human but not identical. Rather than sorting out whether or not human preferences apply here, it might be best to reason "there is a chance that human flourishing has been lost entirely here, so we shouldn't pay too much attention to what human preferences actually are in S, and just avoid S entirely".

Note that this means avoiding morally ambiguous distant situations, not distant situations per se. Worlds with voluntary human slaves may be worth avoiding, while worlds with spaceships, uploads, but same-as-now morality, are basically just "today's world - with lasers!" and are not morally ambiguous.

New Comment