Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

1 Introduction

This post started off as a response to a pair of comments by Rohin Shah, and by Charlie Steiner. But the insights are (I hope) sufficiently general that I made them into a full post.

EDIT: It seems I misunderstood Rohin's point. In any case, I'll keep this post up, because the things I wrote in response to the things I thought Rohin said, are useful things to have written out.

In my original post, I pointed out that since an agent's preferences are not simply facts about the universe, "updating" a prior requires bridging assumptions - something that tells the AI how to update human preferences, depending on what the human does.

It needs to know that:

  1. If the human goes to a restaurant, orders Sichuan Hot Pot, eats it, comments on how much they enjoyed it, goes back to the restaurant again and orders Sichuan Hot Pot after recommending it to their friends... this is at evidence that they like Sichuan Hot Pot.
  2. However, if that restaurant is run by the mob, and the human owes money to the prominent mobster that owns the restaurant, and that mobster occasionally cooks the Sichuan Hot Pot themselves... then this isn't really evidence of much. Same goes if the restaurant is owned by the human's boss, a prominent local politician with power over something the human wants, the human's probation officer, and so on.
  3. Similarly, to 2., if the restaurant is owned by the human's child, or their lover, or a close friend - though for subtly different reasons, and to different extents.

Anyway, not to belabour the point, but the AI needs some bridging assumptions so that it can take context like that into account when figuring out human preferences.

Rohin Shah and Charlie Steiner pointed to some ways that CIRL (Cooperative Inverse Reinforcement Learning) might get round this problem.

Thanks for those responses! They're very useful. Still, I'll argue in this post that the problems still remain (and that the only of solving them is to solve the hard problems of what human preferences and meta preferences are, reproducing human theory of mind in AI format, without being able to use machine learning for the key assumptions).

2 Rigging learning or reward-maximisation?

Rohin Shah's comment starts out with:

The key point is that in a CIRL game, by construction there is a true (unknown) reward function, and thus an optimal policy must be viewable as being Bayesian about the reward function, and in particular its actions must be consistent with conservation of expected evidence about the reward function; anything which "rigs" the "learning process" does not satisfy this property and so can't be optimal.

I've talked about "rigging" a learning process in the past, and have a paper coming out about it (summary post imminent).

To simplify, a learning process can be rigged if it violates conservation of expected evidence, allowing the AI to "push" the learning in one direction or another. In that old example, there are two reward functions, and , awaiting human confirmation for which is which. The AI finds it easier to produce "death" than "cake", so will try and trick/force/bribe the human to say "death" when the AI asks.

Now, this a definitely a poor learning process. But it's perfectly fine as a reward-function maximising behaviour. Let is the indicator variable for the human saying "cake", and for them saying death, then "rigging the learning process" is the same as "maximising the reward " for:

Now, that particular might seem like a silly reward function to maximise. But what about some sort of medical consent requirements? Suppose the AI knows that a certain operation would improve the outcome for a human, but can't operate on them without their consent. Then they would try and convince them to consent[1] to the operation. To first order, this reward function would look something like:

If or some things similar to are legitimate human reward functions - and I don't see why they wouldn't be - then even if the AI is updating a prior, it can still be rigging a learning process in practice.

3 AI believing false facts is bad

Let's look again at:

by construction there is a true (unknown) reward function, and thus an optimal policy must be viewable as being Bayesian about the reward function

Yes, by construction there is a true (unknown) reward function. But, as shown just above, you can still get behaviour that looks identical to "rigging a learning process", even with purely Bayesian updating.

But there's an even more severe problem: the human doesn't know their own reward function, and is neither rational nor noisily rational.

So, let's consider a few plausible facts that the AI could eventually learn:

  1. If asked initially whether they want heroin, the human would normally say "no".
  2. If given heroin and then asked, the human would normally say "yes".
  3. In the first situation, there are ways of phrasing the question that would result in the human saying "yes".
  4. Similarly, in the second situation, there are ways of phrasing the question that would result in the human saying "no".
  5. It's easier to get the reversal in 3. than in 4.

Assume that these are true facts about the world (they are at least somewhat plausible); assume also that the AI could deduce them, after a while. I'm not assuming here that the AI would force heroin on the human. I'm just saying that these are facts the AI could infer after getting to know humans a bit.

So, given 1.-5., and the assumption that humans know their own reward function, and wishes to communicate it to the AI (a key CIRL assumption)... what would the AI conclude? What kind of strange reward function justifies the behaviours 1.-5.? It must be something highly peculiar, especially if we run through all the possible scenarios in which the human accepts/rejects heroin (or acts in ways that increase/decrease their chance of getting it).

You might feel that and are complicated compound reward-functions; but they are absurdly simple compared with whatever the AI will need to construct to justify human behaviour, given the assumption that humans know their reward function. It will need to add epicycles on epicycles.

It's bad enough if the process just goes randomly wrong, but it's likely to be worse than that. This kind of reasoning cannot distinguish a preference from a bias, and so will end up treating preferences as biases, or vice-versa, or some perverse mix of the two. The optimal reward function is likely to be very very compound, made up of all sorts of conditional reward functions, so the chances of their being a bad optimal policy - heroin or death - is very high.

Being Bayesian actually makes it worse. You could imagine a non-Bayesian approach where the AI doesn't draw any conclusion from something forced on humans without their consent. In that case, fact 2. is irrelevant. But, because the process is actually Bayesian, then, in CIRL, as soon as the AI knows that 2. is true, even without physically witnessing it, it now has to infer a reward function that explains/is compatible with 2.

4 Changing preferences or satisfying them

You might reasonably ask where the magic happens. The CIRL game that you choose would have to commit to some connection between rewards and behavior. It could be that in one episode the human wants heroin (but doesn't know it) and in another episode the human doesn't want heroin (this depends on the prior over rewards). However, it could never be the case that in a single episode (where the reward must be fixed) the human doesn't want heroin, and then later in the same episode the human does want heroin. Perhaps in the real world this can happen; that would make this policy suboptimal in the real world. (What it does then is unclear since it depends on how the policy generalizes out of distribution.)

Consider human hunger. Humans often want food (and/or drink, sex, companionship, and so on), but they don't want it all the time. When hungry they want food; when full, they don't want it. This looks a lot like this:

Now, one might object that the "real" reward function is not a compound of two pieces like this; instead, it should be a single unambiguous:

But this actually makes it worse for the argument above. We can admit that is a simple unambiguous reward function; however, maximising it is the same as maximising the compound . So, there exists non-compound reward functions that are indistinguishable from combination reward functions. Thus there is no distinction between "compound" and "non-compound" rewards; we can't just exclude the first type. So saying a reward is "fixed" doesn't mean much.

5 Humans learning new preferences

This still feels different from the heroin example Rohin Shah mentioned. We're presumably imagining an episode where the human initially doesn't ever want heroin and can articulate that if asked.

But what about moral or preference learning on the part of the human? Suppose a human feels sick when about to eat strawberries; they would claim (correctly) that they would not enjoy it. But the AI knows that, if they tried strawberries a few times, they would come to enjoy it, and indeed it would become their favourite food.

The AI tries to convince them of this, and eventually gets the human to give strawberries a try, and the human indeed comes to appreciate them.

This seems fine; but if we substitute "heroin" for "strawberries", and there's immediately a problem. We want the AI to not force heroin on a human; but we're ok with it mildly arguing the human into trying strawberries; and we're not ok with it forcing the human to avoid any preference updates at all ("no, you may not learn anything about how tastes evolve; indeed, you may not learn anything at all").

But forcing and arguing might almost be the same thing from a superpowered AI's perspective. The only way to avoid this is the hard way: to have some bridging update assumptions that includes our own judgements as to what distinguishes the heroin/forcing situation from the strawberries/arguing one.

6 The AI doesn't trick any part of itself

Perhaps another way to put this: I agree that if you train an AI system to act such that it maximizes the expected reward under the posterior inferred by a fixed update rule looking at the AI system's actions and resulting states, the AI will tend to gain reward by choosing actions which when plugged into the update rule lead to a posterior that is "easy to maximize". This seems like training the controller but not training the estimator, and so the controller learns information about the world that allows it to "trick" the estimator into updating in a particular direction (something that would be disallowed by the rules of probability applied to a unified Bayesian agent, and is only possible here because either a) the estimator is uncalibrated or b) the controller learns information that the estimator doesn't know).

Instead, you should train an AI system such that it maximizes the expected reward it gets under the prior; this is what CIRL / assistance games do. This is kinda sorta like training both the "estimator" and the "controller" simultaneously, and so the controller can't gain any information that the estimator doesn't have (at least at optimality).

Not too sure what is meant by a "fixed" update rule; every update rule looks "at the AI system's actions and resulting states" and updates the prior based on that. Even if the update rule is something different (say something constructed in a model-based view of the world) it still needs to use the AI's actions and states/observations in order to infer something about the world.

I think Rohin Shah means the "controller" to be the part of the AI that deduces physical facts about the world, and the "estimator" to be the part that deduces preference facts about the world (via the prior and the bridging updating process). I see no reason one should be dumb while the other is smart; indeed, they're part of the same agent.

Suppose the following two facts are true:

  1. If asked initially whether they want heroin, the human would normally say "no".
  2. If given heroin and then asked, the human would normally say "yes".

Then I think "the controller learns information about the world that allows it to "trick" the estimator" means that the controller learns both facts, while the estimator doesn't know either yet. And then it doesn't ask the human initially (so avoiding triggering 1.), and instead forces heroin on the human and then asks (so triggering 2.) so that the estimator then endorses giving the human heroin.

But, instead, estimator and controller both learn facts 1. and 2. together. Then they have to explain these facts. Let be the indicator variable for the human being asked about heroin initially, while is the indicator variable for being forced to take heroin before being asked. Then one reward function which fits the data is:

Once the estimator updates to make a plausible candidate, both it and the controller will agree that forcing heroin on the human is the optimal policy for maximising the likely reward. No tricking between them.

7 IRL vs CIRL

From Charlie Steiner's comment:

I think Rohin's point is that the model of

"if I give the humans heroin, they'll ask for more heroin; my Boltzmann-rationality estimator module confirms that this means they like heroin, so I can efficiently satisfy their preferences by giving humans heroin".

is more IRL than CIRL. It doesn't necessarily assume that the human knows their own utility function and is trying to play a cooperative strategy with the AI that maximizes that same utility function. If I knew that what would really maximize utility is having that second hit of heroin, I'd try to indicate it to the AI I was cooperating with.

Problems with IRL look like "we modeled the human as an agent based on representative observations, and now we're going to try to maximize the modeled values, and that's bad." Problems with CIRL look like "we're trying to play this cooperative game with the human that involves modeling it as an agent playing the same game, and now we're going to try to take actions that have really high EV in the game, and that's bad."

I agree that CIRL is slightly safer than IRL, because the interaction allows the process to better incorporate human meta-preferences. The human has the option of saying "don't give me heroin!!", which give them a better chance than in the IRL.

But, in the limit of super-intelligent AIs, it makes little to no difference. If the AI can deduce the full human policy from IRL or CIRL, then it knows in what circumstances the human would or wouldn't say "don't give me heroin!!". The human's behaviour becomes irrelevant at this point; all that matters is the bridging assumptions between human policy and human reward function.

CIRL might still be safer; if the AI concludes that reward (from section 6) is correct, then maybe it will have already asked the human about heroin, so it won't be able to go down the "force heroin on the human" branch of reality.

But is not a single reward function; it's one example of a general pattern of human behaviour (that we can be pushed into endorsing things that we would actually want to avoid). Almost certainly, the AI will realise this general pattern long before it has asked every individual question. And then it will be able to exploit some that follows this pattern, but that hasn't come up yet in the CIRL exchange.

But how can this be? Didn't I say "we can be pushed into endorsing things that we would actually want to avoid"? Wouldn't the AI know this and "realise" that we don't really want it?

Indeed; all it has to do is figure out what "we would actually want to avoid". So, yes, if it can solve the human preferences problem, it can use this knowledge to... solve the human preferences problem. And thus, to do that, all it needs to do is... solve the human preferences problem.


  1. The fundamental problem being, of course, defining what is ok to use to "convince" and what counts as genuine "consent". ↩︎

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 1:54 AM

My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.

2 Rigging learning or reward-maximisation?

I agree that you can cast any behavior as reward maximization with a complicated enough reward function. This does imply that you have to be careful with your prior / update rule when you specify an assistance game / CIRL game.

I'm not arguing "if you write down an assistance game you automatically get safety"; I'm arguing "if you have an optimal policy for some assistance game you shouldn't be worried about it rigging the learning process relative to the assistance game's prior". Of course, if the prior + update rule themselves lead to bad behavior, you're in trouble; but it doesn't seem like I should expect that to be via rigging as opposed to all the other ways reward maximization can go wrong.

3 AI believing false facts is bad

Tbc I agree with this and was never trying to argue against it.

4 Changing preferences or satisfying them
Thus there is no distinction between "compound" and "non-compound" rewards; we can't just exclude the first type. So saying a reward is "fixed" doesn't mean much.

I agree that updating on all reward functions under the assumption that humans are rational is going to be very strange and probably unsafe.

5 Humans learning new preferences

I agree this is a challenge that assistance games don't even come close to addressing.

6 The AI doesn't trick any part of itself

Your explanation in this section involves a compound reward function, instead of rigged learning process. I agree that these are problems; I was really just trying to make a point about rigged learning processes.

My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.

Ah, ok, I see ^_^ Thanks for making me write this post, though, as it has useful things for other people to see, that I had been meaning to write up for some time.

On your main point: if the prior and updating process are over things that are truly beyond the AI's influence, then there will be no rigging (or, in my terms: uninfluenceable->unriggable). But there are many things that look like this, that are entirely riggable. For example, "have a prior 50-50 on cake and death, and update according to what the programmer says". This seems to be a prior-and-update combination, but it's entirely riggable.

So, another way of seeing my paper is "this thing looks like a prior-and-update process. If it's also unriggable, then (given certain assumptions) it's truly beyond the AI's influence".