Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Learning and manipulating learning

5Koen.Holtman

2Pattern

2Stuart_Armstrong

-6Leonie Löwenherz

3Pattern

New Comment

5 comments, sorted by Click to highlight new comments since: Today at 6:30 PM

*Meta: This comment has my thoughts about the paper Pitfalls of
Learning a Reward Function Online.
I figure I should post them here so that others looking for comments
on the paper might find them.*

*I read the paper in back in 2020; it was on my backlog ever since to
think more about it and share my comments. Apologies for the delay,
etc.*

First off, I agree with the general observations in the introduction that there are pitfalls to learning a reward function online, with a human in the loop.

The paper looks at options for removing some of these pitfalls, or at
least to make them less dangerous. The research agenda pursued by the
paper is one I like a lot, an agenda of mathematical innovation. The
paper mathematically defines certain **provable safety properties**
(uninfluencability and unriggability), and also explores how useful
these might be.

Similar agendas of of mathematical innovation can be found in the work of Everitt et al, for example in Agent Incentives: A Causal Perspective, and in my work, for example in AGI Agent Safety by Iteratively Improving the Utility Function. These also use causal influence diagrams in some way, and try to develop them in a way that is useful for defining and analyzing AGI safety. My personal intuition is that we need more of this type of work, this agenda is important to advancing the field.

That being said: the bad news is that I believe that the mathematical route explored by Pitfalls of Learning a Reward Function Online is most likely a dead end. Understanding why is of course the interesting bit.

The main issue I will explore is: we have a mathematical property that we label with the natural language word 'uninfluencability'. But does this property actually produce the beneficial 'uninfluencability' effects we are after? Section 4 in the paper also explores this issue, and shows some problems, my main goal here is to add further insights.

My feeling is that 'uninfluencability', the mathematical property as defined, does not produce the effects I am after. To illustrate this, my best example is as follows. Take a reward function that measures the amount of smiling, by the human teaching the agent. observed over the entire history . Take a reward function learning process which assumes (in its prior ) that the probability of the choice for this reward function at the end of the history, , cannot be influenced by the actions taken by the agent during the history, so for example is such that , This reward function learning process is unriggable. But the agent using this reward function learning process also has a major incentive to manipulate the human teacher into smiling, by injecting them with smile-inducing drugs, or whatever.

So it seems to me that the choice taken in the paper to achieve the following design goal:

Ideally, we do not want the reward function to be a causal descendant of the policy.

is not taking us on a route that goes anywhere very promising, given the problem statement. The safety predicate of uninfluencability still allows for conditions that insert the mind of the human teacher directly into the path to value of a very powerful optimizer. To make the mathematical property of 'uninfluencability' do what it says on the tin, it seems to me that further constraints need to be added.

Some speculation: to go this route of adding constraints, I think we need a model that separates the mind state of the teacher, or at least some causal dependents of this mind state, more explicitly from the remainder of the agent environment. There are several such increased-separation causal models in Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective and in Counterfactual planning. This then brings us back on the path of using the math of indifference, or lack of causal incentives, to define safety properties.

Here are some further secondary remarks.

With the above remarks. I do not mean to imply that the uninfluencability safety property as defined lacks any value: I may still want to have this as a desirable safety property in an agent. But if it were present, this triggers a new concern: if the environment is such that the reward function is clearly influencable, any learning system prior which is incompatible with that assumption may be making some pretty strange assumptions about the environment. These might produce unsafe consequences, or just vast inefficiencies, in the behavior of the agent.

This theme could be explored more, but the paper does not do so, and I have also not done so. (I spent some time trying to come up with clarifying toy examples, but no example I constructed really clarified things for me.)

More general concern: the approach in the paper suffers somewhat from a methodological problem that I have seen more often in the AI and AGI safety literature. At this point in time, there is a tendency to frame every possible AI-related problem as a machine learning problem, and to frame any solution as being the design of an improved machine learning system. To me, this framing obfuscates the solution space. To make this more specific: the paper sets out to define useful constraints on , a prior over the agent environment, but does not consider the step of first exploring constraints on , the actual agent environment itself. To me, the more natural approach would be to first look for useful constraints on , and only then to consider the option of projecting these into as a backup option, when happens to lack the constraints.

In my mind, the problem of an agent manipulating its teacher or
supervisor to maximize its reward is not a problem of machine
learning, but more fundamentally a problem of **machine reasoning**,
or even more fundamentally a problem which is present in any
game-theoretical setup where rewards are defined by a **level of
indirection**. I talk more at length about these methodological
points in my paper on
counterfactual planning.

If I use this level-of-indirection framing to back-project the design in the paper, my first guess would be that 'uninfluencability' might possibly say something about the agent having no incentive to hack its own compute core in order to change the reward function encoded within. But I am not sure if that first guess would pan out.

Let's ignore that, in reality, no probability is truly 1 (or 0). ↩︎

That can be fixed by conditioning on the set of probabilities you're working with. (P(H | Heads or Tails) = 1/2, etc.)

It wants to avoid going to 99%, because then it would be committed to cake-maximising ("take their answer as the truth").

A learning process that can't be manipulated sounds like a good goal. Something that doesn't want to manipulate a process sounds better/more fundamental. (Also, giving something more points for killing people than making cake sounds like a bad incentive scheme.)

Well, assume the asked guard said "cake". Now the AI knows that one of the following is true:

That guard is a truth-teller, and humans prefer cake.

That guard is a liar, and humans prefer death.

It has entirely ruled out:

That guard is a truth-teller, and humans prefer death.

That guard is a liar, and humans prefer cake.

It's also possible that what humans prefer varies between humans. (Corresponding handling: The AI asks both guards individually if they want cake, and if they want to die, and gives them what they say they want. The lying guard dies, the honest guard gets cake.)

Also, giving something more points for killing people than making cake sounds like a bad incentive scheme.

In the original cake-or-death example, it wasn't that killing got more points, it's that killing is easier (and hence gets more points over time). This is a reflection of the fact that "true" human values are complex and difficult to maximise, but many other values are much easier to maximise.

## 1 Introduction

This post will introduce our new paper "Pitfalls of Learning a Reward Function Online", now online for IJCAI2020.

It shows some of the difficulties with things we might think of as "preference learning processes", and the useful conditions we could impose to get around these.

The tl;dr summary is:

unriggability, which implies the process respects conservation of expected evidence, anduninfluenceability, which implies the process derives from learning background variables in the environment.to sacrifice, with certainty, reward for every possible reward function.## 1.1 Blast from the past: misleadingly named tokens

Good Old-Fashioned AI (sometimes called symbolic AI) did not work out. To define something, it wasn't enough to just name a token, and then set it up in relation to a few other named tokens, according to our own intuition about how these tokens related.

Saying "happiness is a state of mind", or "light is a wave", isn't nearly enough to define "happiness", "state of mind", "light", or "wave".

Similarly, designating something as "learning", and giving it some properties that we'd expect learning to have, isn't enough to make it into learning. And, conversely, sometime things that don't look like learning, behave exactly like they are.

## 2 What is learning anyway?

## 2.1 A simple prior-update process?

A coin is flipped and left on a rock somewhere. You may access the coin in one hour's time, for a few minutes. What's your probability that in two hours, the coin will be showing heads (event H2) or tails (event T2)?

Well, a reasonable prior would be to put a probability of 0.5 on both possibilities, and then update based on your last observation in an hour (call this H1 or T1). Obviously

^{[1]}P(H2∣H1)=P(T2∣T1)=1. So we have a prior and a (pretty trivial) update process. Is this basically learning?Well, one thing I've implied but not stated: when you "access" the coin, you can pick it up and flip it before putting it back.

Nothing about this changes any of the probabilities I've mentioned. If you flip it to heads, then your last observation will indeed be heads.

This looks like pure manipulation of outcomes. But it also looks, formally, like a prior and updating process. So what is it?

Note that this process, whatever it is, violates conservation of expected evidence because

1=P(H2∣flip to heads)≠P(H2∣flip to tails)=0.

But part of the reason that this violates the conservation law, is because events like "flip to heads" do not have well-defined probabilities for the agent who might (or might not) do the flipping. But that's the case for all traditional decision theory setups. So how can we still reason in these setups?

## 2.2 "Asking" the guards

To expand, suppose an AI meets those traditional guards, one of whom lies while the other tells the truth:

There's an AI, designed to give humans what they prefer. And it wants to figure out whether humans prefer cake or death. It has three options:

Now, 3. is clearly not what we'd want: it's another case of "flip the coin". But which of the other two processes are unambiguously well-behaved preference learning processes? It turns out 1. is a proper learning process, but 2. is not. How so?

## 2.3 Suggestively described "learning" processes

Let's deal with 2. first.

Why is 2. not a well-behaved preference learning process? Well, suppose that the AI has established, with probability 98%, that the red guard is the truth-teller. And suppose it's also established they are likely to say "cake", and, finally, suppose it would be much easier to maximiser "death" than "cake".

In that case, what's the utility-maximising thing to do? It wants to avoid going to 99%, because then it would be committed to cake-maximising ("take their answer as the truth").

So it could leave, and never enquire further, so it would never have to update that way. If it was powerful enough, it could instead try and take control of the truth-telling guard, and force them to answer "death" instead. It can take its time doing so, because it doesn't have to update anything until its credence goes to 99%.

So even though 2. seemed a very plausible description of a truth-learning process, it turns out not to be. Similarly, if you mixed Bayesian updating with function approximation - then this is also not proper learning, and may be exploitable.

## 2.4 Learning compound truths

But why is the process 1. a proper preference learning process?

Well, assume the asked guard said "cake". Now the AI knows that one of the following is true:

It has entirely ruled out:

So the AI has learnt, cutting the space of possibilities in half. It might not have learnt exactly what we wanted it to, or in the most efficient manner, but it's unambiguously learning.

But what about the "take their answers as the truth" clause? Is the AI not "learning" the wrong thing?

Ah, but remember what I wrote about named tokens. Let's assume that Rcake is the reward function that rewards the AI for giving humans cake (in the way that we'd expect). Similarly, Rdeath rewards if for giving humans death.

We then have the answer of the first guard asked: "cake" or "death". Then we have the humans' "true" preferences,

cakeordeath.So its learning process is:

And this is a perfectly valid learning process.

## 2.5 Moral facts (not) in the universe

What humans actually wanted was for our true preferences to imply the AI's reward function, ie:

cake⟹ Rcake.death⟹ Rdeath.But, as I've shown, mere observable facts about the universe do not establish preferences. This is somewhat similar to Hume's "is-ought problem": we don't know preferences just from facts

^{[2]}.So the AI doesn't have access to the "true" variables,

cakeordeath. Neither do we, but, typically, we have a better intuitive idea of them than we can explicitly describe. Thus what we want is a process ρ, such that the AI can look at the history ht of its inputs and outputs, and deduce from that whether Rcake or Rdeath is the reward function to follow, in some suitably "nice" way.We want it to be so that:

cake⟹ "cake" ⟹ Rcake.death⟹ "death" ⟹ Rdeath.And this, even though

cakeanddeathare ill-defined variables (and arguably don't exist).The process ρ is necessary to bridge the is-ought gap between what is true in the world, and what preferences should be.## 2.6 A bridge too general

But before we can even look at this issue, we have another problem. The bridge that ρ builds, is too general. It can model the coin flipping example and processes 2. and 3. from the "ask the guard" scenario.

For process 2.: if gi is guard i, gT is the truth-telling guard, and ri is what guard gi revealed when asked, we have, for history ht:

So this is also a possible ρ. Process 3. is also a possible ρ; let ocake mean observing "cake" written down on the ground (and conversely for ocake), then:

So, before even talking about whether the AI has learnt from the right variables in the environment, we have to ask:

has the AI "learnt" about any actual variables at all?We need to checkhowthe AI learns before thinking aboutwhatit's learning.## 3 Formalism and learning

#3.1 Formalism

To get any further, we need to more formalism. So imagine that the AI has interacted with the world in a series of time steps. It will start by taking action a1, and get observation o1, take action a2, get observation o2, and so on. By turn t, it will have seen a history ht=a1o1…atot.

We'll assume that after n turns, the AI's interaction with the environment ceases

^{[3]}; call Hn the set of "complete" histories of length n. Let R be the set of all possible reward functions (ie the possible preferences we'd want the AI to learn). Each R∈R is a function^{[4]}from Hn to R.So, what is a learning process

^{[5]}ρ? Well, this is supposed to give us a reward function, given the history the AI has observed. This need not be deterministic; so, if Δ(R) is the set of probability distributions over R,We'll see later why ρ is defined for complete histories, not for all histories.

## 3.2 Policies, environments, and causal graphs

Are we done with the formalisms yet? Not quite. We need to know where the actions at and the observations on come from.

The actions are generated by the AI's policy π. This takes the history ht so far, and generates the next action at+1, possibly stochastically.

The observations come from the environment μ. This takes htat+1, the history and action so far, and generates the next observation ot+1, possibly stochastically. We don't assume any Markov condition, so this may be a function of the whole past history.

We can tie this all together in the following causal graph:

The rectangle there is 'plate notation': it basically means that, for every value of j from 1 to n, the graph inside the rectangle is true.

The r is the AI's final reward, which is a function of the final history hn and the reward function R (which itself is a function hn and ρ).

Ok, what flexibility do we have in assigning probability distributions to this graph? Almost all the arrows are natural: hj is a function of hj−1, aj, and oj, by just... concatenating: hj=hj−1ajoj. Similarly, ρ is a distribution on reward functions conditional on hn, so the value of R conditional on ρ and hn is... ρ(hn).

There are three non-trivial nodes: π, μ, and ρ. The π is presumably set by the programmer, who would also want to design a good ρ. The μ is the environment, which we're not assuming is known by either the programmers or the AI. The AI will, however, have some prior over μ.

## 4 Conditions for a good learning process

So, now that we have the formalism, how can we select a "good" (or "true" or "proper") learning process?

There are two plausible conditions: the syntactic condition of unriggability, and the semantic condition of uninfluenceability.

## 4.1 Unriggability

Unriggability comes from the idea of

conservation of expected evidence. Basically, if ρ defines a true learning process, we'd want it conserve expected evidence.So, let's look at a version of the "look at the coin" situation. For example, let RH be a reward function defined by a coin being heads, and RT by it being tails. On the first turn, the coin has been tossed, but not seen. The AI can take an action a1=look to look, or a1=¬look to not look.

This has conservation of expected evidence. If the AI choose to look:

This shows that, if the AI chooses to look, it will become certain about which reward function it must follow - but it's 50-50 about which reward function it will become certain about.

On the other hand, if the AI choose not to look:

This shows that, if the AI choose not to look, it will, with certainty, remain 50-50 uncertain about which reward function if must follow.

In expectation this is the same thing as if it chooses to look. So, the AI cannot change the expectation of its reward function

^{[6]}, but it might be able to change the variance. The future variance given a1=look is 0 (it's absolutely certain about its reward function) while the future variance given a1=¬look is 1/4 (maximal uncertainty).On the other hand, imagine there are two other actions, which involve the AI setting the coin to heads or tails, rather than looking at it what it fell on. If a1=setH or a1=setT, then

This violates conservation of expected evidence, and, more importantly, it's the kind of behaviour we want to avoid: the AI setting its own reward. So one reasonable condition for ρ would be:

## 4.2 Uninfluenceability

Another thing that we'd want, is that preference learning should be like factual learning; ie it should depend on facts in the outside world.

In out setup "depend on facts in the outside world" can be taken to mean "depends on the environment μ". This gives the following causal graph:

Here, the reward function R is no longer (directly) a function of hn, but instead is a function of μ. The η gives the conditional probability distribution over R, given μ.

The connection between ρ and η is as follows: given a prior pμ over μ, the AI can use a history hn to update this prior to a posterior over environments. Then η allows it to make this into a posterior over reward functions. Thus, given pμ and η, the AI has a probability distribution over R conditional on hn; this is the ρ.

Thus define uninfluenceability:

## 5 Results

Then our paper proves the following results:

to sacrifice, with certainty, reward for every possible reward function.Let's ignore that, in reality, no probability is truly 1 (or 0). ↩︎

I don't want to get into the moral realism debate, but it seems that me and moral realists differ mainly in emphasis: I say "without making assumptions, we can't figure out preferences, so we need to find good assumptions", while they say "having made these good (obvious) assumptions, we can figure out preferences". ↩︎

There are ways of making this work with n→∞, but that extra complexity is not needed for this exposition. ↩︎

This is the most general reward function formats; if, for example, we had a Markovian reward function S that just took the latest actions and observations as inputs, this defines an RS∈R such that RS(ht)=∑ti=1S(aioi). ↩︎

A terminological note here. We've decided to describe ρ as a learning process, with "unriggable learning process" and "uninfluenceable learning process" being the terms if ρ has additional properties. But the general ρ includes things we might not feel are anything like "learning", like the AI writing down its own reward function.

So it might make more sense to reserve "learning" for the unriggable processes, and call the general ρ something else. But this is a judgement call, and people generally consider "ask your programmer" or "look at the coin" to be a learning processes, which are very much riggable. So I've decided to call the general ρ a learning process. ↩︎

This "expectation" can be made fully rigorous, since reward functions form an affine space: you can take weighted averages of reward function, (1−x)R1+xR2. ↩︎