Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Talk given by Rebecca Gorman and Stuart Armstrong at the CHAI 2022 Asilomar Conference. We present an example of AI wireheading (an AI taking over its own reward channel), and show how value extrapolation can be used to combat it.

https://www.youtube.com/watch?v=REUanSy0SgU

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 3:45 PM

So you start with an AI with an external reward structure that the AI is trained to follow. And the AI is incentivized to take over its reward structure.

Then you talk about "solving" this by having multiple possible rewards and the AI asks humans which one is best. Well, OK, but if this solves it, the solution has to do with how the AI (as opposed to the reward structure) now works, which you don't explain at all. That is, if the AI is still works the same way, trained by the external reward structure, where does the decision to ask the humans come from?

But more than this, the background assumptions seem very concerning. Some particular issues IMO:

  1. Hedonic (as opposed to preference-based) utility is a bad idea. This will, once the AI is powerful, reliably lead to it killing all humans to replace with hedonium, or replace with hedonium with rules-lawyered characteristics to barely count as "human", or whatever.
  2. The AI needs to not care about the future directly, only via human preferences which care about the future. Otherwise, it will still create utility monsters (just now preference-based instead of hedonium-based).
  3. At least because of (2), but probably for many other reasons as well (including wireheading as was the original problem mentioned in the video), you probably can't make an aligned AI by training it on external rewards.

One possible alternative to external reward training would be to:

  1.  First train a world-modeling non-agent AI (using e.g. prediction of future inputs) to understand enough about the world that it also understands a plan-scoring function that you will later want it to follow
  2. Then agentify it, by adding a decision-making procedure that points to the concept of the scoring function in the world-model it learned earlier, using the world model to generate and select plans according to how high they score according to the function.

Note that the score function should not care about the future directly, e.g. the score function could be something like: if I follow this plan, how well does this entire process and the resulting outcome align with averaged informed/extrapolated human preferences as of the time of decision to take the plan (not any future time)?