I think Rohin's point is that the model of

"if I give the humans heroin, they'll ask for more heroin; my Boltzmann-rationality estimator module confirms that this means they like heroin, so I can efficiently satisfy their preferences by giving humans heroin".

is more IRL than CIRL. It doesn't necessarily assume that the human knows their own utility function and is trying to play a cooperative strategy with the AI that maximizes that same utility function. If I knew that what would really maximize utility is having that second hit of heroin, I'd try to indicate it to the AI I was cooperating with.

Problems with IRL look like "we modeled the human as an agent based on representative observations, and now we're going to try to maximize the modeled values, and that's bad." Problems with CIRL look like "we're trying to play this cooperative game with the human that involves modeling it as an agent playing the same game, and now we're going to try to take actions that have really high EV in the game, and that's bad."

How should AIs update a prior over human preferences?

by Stuart_Armstrong 1 min read15th May 20209 comments


Ω 9

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I've always emphasised the constructive aspect of figuring out human preferences, and the desired formal properties of preference learning processes.

A common response to these points is something along the line of "have the AI pick a prior over human preferences, and update it".

However, I've come to realise that a prior over human preferences is of little use. The real key is figuring out how to update it, and that contains almost the entirety of the problem.

I've shown that you cannot deduce preferences from observations or facts about the world - at least, without making some assumptions. These assumptions are needed to bridge the gap between observations/facts, and updates to preferences.

For example, imagine you are doing cooperative inverse reinforcement learning[1] and want to deduce the preferences of the human . CIRL assumes that knows the true reward function, and is generally rational or noisily rational (along with a few other scenarios).

So, this is the bridging law:

  • knows their true reward function, and is noisily rational.

Given this, the AI has many options available to it, including the "drug the human with heroin" approach. If is not well-defined in the bridging law, then "do brain surgery on the human" also becomes valid.

And not only are those approaches valid; if the AI wants to maximise the reward function, according to how this is defined, then these are the optimal policies, as they result in the most return, given that bridging law.

Note that the following is not sufficient either:

  • has a noisy impression of their true reward function, and is noisily rational.

Neither of the "noisy" statements are true, so if the AI uses this bridging law, then, for almost any prior, preference learning will come to a bad end.

Joint priors

What we really want is something like:

  • has an imperfect impression of their true reward function, and is biased.

And yes, that bridging law is true. But it's also massively underdefined. We want to know how 's impression is imperfect, how they are biased, and also what counts as versus some brain-surgeried replacement of them.

So, given certain human actions, the AI can deduce human preferences. So this gives a joint prior over , the possible human reward functions and possible the human's policies[2]. Given that joint prior, then, yes, an AI can start deducing preferences from observations.

So instead of a "prior over preferences" and a "update bridging law", we need a joint object that does both.

But such a joint prior is essentially the same object as the assumptions needed to overcome the Occam's razor result.

Other areas

It seems to me that realisability has a similar problem: if the AI has an imperfect model of how they're embedded in the world, then they will "learn" disastrously wrong things.

  1. This is not a criticism of CIRL; it does its task very well, but still requires some underlying assumptions. ↩︎

  2. And the human's identity, which we're implicitly modelling as part of the policy. ↩︎


Ω 9