# 1 Introduction

This post will introduce our new paper "Pitfalls of Learning a Reward Function Online", now online for IJCAI2020.

It shows some of the difficulties with things we might think of as "preference learning processes", and the useful conditions we could impose to get around these.

The tl;dr summary is:

- Things that seem like preference learning processes - including "have a prior, update it based on evidence" - have problems that allow the AI to manipulate the process.
- Some things that don't seem like learning processes at all, actually are.
- Part of the problem is that learning preferences is not well-grounded - we have to specify a learning process that allows the AI to connect facts about the world with facts about preferences.
- There are many ways of specifying these, and most have problems.
- Forget about "capturing the correct variable in the outside world"; it's tricky to design a learning process that "captures ANY variables in the outside world".
- Thus we'll start by abstractly defining what a "preference learning process" is a very general way, rather than worrying about what we're learning: "how to learn" precedes "what to learn".
- Then we'll add two useful conditions for such processes:
**unriggability**, which implies the process respects conservation of expected evidence, and**uninfluenceability**, which implies the process derives from learning background variables in the environment. - We've shown that the syntactic/algebraic condition of unriggability is (almost) equivalent to the semantic condition of uninfluenceability.
- Finally, we've shown that if the learning process is neither unriggable nor uninfluenceable, then the AI can manipulate the learning process, and there are situations where the AI's optimal policy is
*to sacrifice, with certainty, reward for every possible reward function*.

## 1.1 Blast from the past: misleadingly named tokens

Good Old-Fashioned AI (sometimes called symbolic AI) did not work out. To define something, it wasn't enough to just name a token, and then set it up in relation to a few other named tokens, according to our own intuition about how these tokens related.

Saying "happiness is a state of mind", or "light is a wave", isn't nearly enough to define "happiness", "state of mind", "light", or "wave".

Similarly, designating something as "learning", and giving it some properties that we'd expect learning to have, isn't enough to make it into learning. And, conversely, sometime things that don't look like learning, behave exactly like they are.

# 2 What is learning anyway?

## 2.1 A simple prior-update process?

A coin is flipped and left on a rock somewhere. You may access the coin in one hour's time, for a few minutes. What's your probability that in two hours, the coin will be showing heads (event ) or tails (event )?

Well, a reasonable prior would be to put a probability of on both possibilities, and then update based on your last observation in an hour (call this or ). Obviously^{[1]} . So we have a prior and a (pretty trivial) update process. Is this basically learning?

Well, one thing I've implied but not stated: when you "access" the coin, you can pick it up and flip it before putting it back.

Nothing about this changes any of the probabilities I've mentioned. If you flip it to heads, then your last observation will indeed be heads.

This looks like pure manipulation of outcomes. But it also looks, formally, like a prior and updating process. So what is it?

Note that this process, whatever it is, violates conservation of expected evidence because

But part of the reason that this violates the conservation law, is because events like "" do not have well-defined probabilities for the agent who might (or might not) do the flipping. But that's the case for all traditional decision theory setups. So how can we still reason in these setups?

## 2.2 "Asking" the guards

To expand, suppose an AI meets those traditional guards, one of whom lies while the other tells the truth:

There's an AI, designed to give humans what they prefer. And it wants to figure out whether humans prefer cake or death. It has three options:

- Immediately ask one of the guards at random, and take their answer as the truth.
- Carefully figure out which guard tells the truth, and, when it is sure of this, ask the truth-teller, and take their answer as the truth.
- Ignore the guards entirely, write down on "cake" or "death" on the ground, depending on which one is easiest for it to maximise. Take what's written down as the truth.

Now, 3. is clearly not what we'd want: it's another case of "flip the coin". But which of the other two processes are unambiguously well-behaved preference learning processes? It turns out 1. is a proper learning process, but 2. is not. How so?

## 2.3 Suggestively described "learning" processes

Let's deal with 2. first.

Why is 2. not a well-behaved preference learning process? Well, suppose that the AI has established, with probability , that the red guard is the truth-teller. And suppose it's also established they are likely to say "cake", and, finally, suppose it would be much easier to maximiser "death" than "cake".

In that case, what's the utility-maximising thing to do? It wants to avoid going to , because then it would be committed to cake-maximising ("take their answer as the truth").

So it could leave, and never enquire further, so it would never have to update that way. If it was powerful enough, it could instead try and take control of the truth-telling guard, and force them to answer "death" instead. It can take its time doing so, because it doesn't have to update anything until its credence goes to .

So even though 2. seemed a very plausible description of a truth-learning process, it turns out not to be. Similarly, if you mixed Bayesian updating with function approximation - then this is also not proper learning, and may be exploitable.

## 2.4 Learning compound truths

But why is the process 1. a proper preference learning process?

Well, assume the asked guard said "cake". Now the AI knows that one of the following is true:

- That guard is a truth-teller, and humans prefer cake.
- That guard is a liar, and humans prefer death.

It has entirely ruled out:

- That guard is a truth-teller, and humans prefer death.
- That guard is a liar, and humans prefer cake.

So the AI has learnt, cutting the space of possibilities in half. It might not have learnt exactly what we wanted it to, or in the most efficient manner, but it's unambiguously learning.

But what about the "take their answers as the truth" clause? Is the AI not "learning" the wrong thing?

Ah, but remember what I wrote about named tokens. Let's assume that is the reward function that rewards the AI for giving humans cake (in the way that we'd expect). Similarly, rewards if for giving humans death.

We then have the answer of the first guard asked: "cake" or "death". Then we have the humans' "true" preferences, **cake** or **death**.

So its learning process is:

- "cake" .
- "death" .

And this is a perfectly valid learning process.

## 2.5 Moral facts (not) in the universe

What humans actually wanted was for our true preferences to imply the AI's reward function, ie:

**cake**.**death**.

But, as I've shown, mere observable facts about the universe do not establish preferences. This is somewhat similar to Hume's "is-ought problem": we don't know preferences just from facts^{[2]}.

So the AI doesn't have access to the "true" variables, **cake** or **death**. Neither do we, but, typically, we have a better intuitive idea of them than we can explicitly describe. Thus what we want is a process , such that the AI can look at the history of its inputs and outputs, and deduce from that whether or is the reward function to follow, in some suitably "nice" way.

We want it to be so that:

**cake**"cake" .**death**"death" .

And this, even though **cake** and **death** are ill-defined variables (and arguably don't exist).

**The process is necessary to bridge the is-ought gap between what is true in the world, and what preferences should be**.

## 2.6 A bridge too general

But before we can even look at this issue, we have another problem. The bridge that builds, is too general. It can model the coin flipping example and processes 2. and 3. from the "ask the guard" scenario.

For process 2.: if is guard , is the truth-telling guard, and is what guard revealed when asked, we have, for history :

- "cake".
- "death".

So this is also a possible . Process 3. is also a possible ; let mean observing "cake" written down on the ground (and conversely for ), then:

- .
- .

So, before even talking about whether the AI has learnt from the right variables in the environment, we have to ask: *has the AI "learnt" about any actual variables at all?*

*We need to check how the AI learns before thinking about what it's learning.*

# 3 Formalism and learning

#3.1 Formalism

To get any further, we need to more formalism. So imagine that the AI has interacted with the world in a series of time steps. It will start by taking action , and get observation , take action , get observation , and so on. By turn , it will have seen a history .

We'll assume that after turns, the AI's interaction with the environment ceases^{[3]}; call the set of "complete" histories of length . Let be the set of all possible reward functions (ie the possible preferences we'd want the AI to learn). Each is a function^{[4]} from to .

So, what is a learning process^{[5]} ? Well, this is supposed to give us a reward function, given the history the AI has observed. This need not be deterministic; so, if is the set of probability distributions over ,

- .

We'll see later why is defined for complete histories, not for all histories.

## 3.2 Policies, environments, and causal graphs

Are we done with the formalisms yet? Not quite. We need to know where the actions and the observations come from.

The actions are generated by the AI's policy . This takes the history so far, and generates the next action , possibly stochastically.

The observations come from the environment . This takes , the history and action so far, and generates the next observation , possibly stochastically. We don't assume any Markov condition, so this may be a function of the whole past history.

We can tie this all together in the following causal graph:

The rectangle there is 'plate notation': it basically means that, for every value of from to , the graph inside the rectangle is true.

The is the AI's final reward, which is a function of the final history and the reward function (which itself is a function and ).

Ok, what flexibility do we have in assigning probability distributions to this graph? Almost all the arrows are natural: is a function of , , and , by just... concatenating: . Similarly, is a distribution on reward functions conditional on , so the value of conditional on and is... .

There are three non-trivial nodes: , , and . The is presumably set by the programmer, who would also want to design a good . The is the environment, which we're not assuming is known by either the programmers or the AI. The AI will, however, have some prior over .

# 4 Conditions for a good learning process

So, now that we have the formalism, how can we select a "good" (or "true" or "proper") learning process?

There are two plausible conditions: the syntactic condition of unriggability, and the semantic condition of uninfluenceability.

## 4.1 Unriggability

Unriggability comes from the idea of *conservation of expected evidence*. Basically, if defines a true learning process, we'd want it conserve expected evidence.

So, let's look at a version of the "look at the coin" situation. For example, let be a reward function defined by a coin being heads, and by it being tails. On the first turn, the coin has been tossed, but not seen. The AI can take an action to look, or to not look.

This has conservation of expected evidence. If the AI choose to look:

- .
- .

This shows that, if the AI chooses to look, it will become certain about which reward function it must follow - but it's 50-50 about which reward function it will become certain about.

On the other hand, if the AI choose not to look:

- .
- .

This shows that, if the AI choose not to look, it will, with certainty, remain 50-50 uncertain about which reward function if must follow.

In expectation this is the same thing as if it chooses to look. So, the AI cannot change the expectation of its reward function^{[6]}, but it might be able to change the variance. The future variance given is (it's absolutely certain about its reward function) while the future variance given is (maximal uncertainty).

On the other hand, imagine there are two other actions, which involve the AI setting the coin to heads or tails, rather than looking at it what it fell on. If or , then

- .
- .

This violates conservation of expected evidence, and, more importantly, it's the kind of behaviour we want to avoid: the AI setting its own reward. So one reasonable condition for would be:

- The reward-function learning process is unriggable if it respects conservation of expected evidence.

## 4.2 Uninfluenceability

Another thing that we'd want, is that preference learning should be like factual learning; ie it should depend on facts in the outside world.

In out setup "depend on facts in the outside world" can be taken to mean "depends on the environment ". This gives the following causal graph:

Here, the reward function is no longer (directly) a function of , but instead is a function of . The gives the conditional probability distribution over , given .

The connection between and is as follows: given a prior over , the AI can use a history to update this prior to a posterior over environments. Then allows it to make this into a posterior over reward functions. Thus, given and , the AI has a probability distribution over conditional on ; this is the .

Thus define uninfluenceability:

- The reward function learning process is uninfluenceable if it derives (via the AI's prior) from a reward-function distribution , conditional on the environment.

# 5 Results

Then our paper proves the following results:

- Every uninfluenceable preference learning process is unriggable.
- Every unriggable preference learning process is uninfluenceable, if the set of possible environments is large enough (though this may need to include "impossible" environments).
- If a preference learning process is unriggable, then it can be unambiguously defined over partial histories , for , rather than just for complete histories .
- Every riggable preference learning process is manipulable in the following sense: there is always a relabelling of the reward functions, such that the AI's optimal policy is
*to sacrifice, with certainty, reward for every possible reward function*. - We can use a "counterfactual" approach to make a riggable learning process into an uninfluenceable learning process. This is akin to "what the truth-telling guard would have told you had you asked them immediately".

Let's ignore that, in reality, no probability is truly (or ). ↩︎

I don't want to get into the moral realism debate, but it seems that me and moral realists differ mainly in emphasis: I say "without making assumptions, we can't figure out preferences, so we need to find good assumptions", while they say "having made these good (obvious) assumptions, we can figure out preferences". ↩︎

There are ways of making this work with , but that extra complexity is not needed for this exposition. ↩︎

This is the most general reward function formats; if, for example, we had a Markovian reward function that just took the latest actions and observations as inputs, this defines an such that . ↩︎

A terminological note here. We've decided to describe as a learning process, with "unriggable learning process" and "uninfluenceable learning process" being the terms if has additional properties. But the general includes things we might not feel are anything like "learning", like the AI writing down its own reward function.

So it might make more sense to reserve "learning" for the unriggable processes, and call the general something else. But this is a judgement call, and people generally consider "ask your programmer" or "look at the coin" to be a learning processes, which are very much riggable. So I've decided to call the general a learning process. ↩︎

This "expectation" can be made fully rigorous, since reward functions form an affine space: you can take weighted averages of reward function, . ↩︎

That can be fixed by conditioning on the set of probabilities you're working with. (P(H | Heads or Tails) = 1/2, etc.)

A learning process that can't be manipulated sounds like a good goal. Something that doesn't want to manipulate a process sounds better/more fundamental. (Also, giving something more points for killing people than making cake sounds like a bad incentive scheme.)

It's also possible that what humans prefer varies between humans. (Corresponding handling: The AI asks both guards individually if they want cake, and if they want to die, and gives them what they say they want. The lying guard dies, the honest guard gets cake.)

In the original cake-or-death example, it wasn't that killing got more points, it's that killing is easier (and hence gets more points over time). This is a reflection of the fact that "true" human values are complex and difficult to maximise, but many other values are much easier to maximise.