Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

To aid communication, I’m going append a technical rephrasing after some paragraphs.

It’s known to be hard to give non-trivial goals to reinforcement learning agents. However, I haven’t seen much discussion of the following: even ignoring wireheading, it seems impossible to specify reward functions that get what we want – at least, if the agent is farsighted, smart, and can’t see the entire world all at once, and the reward function only grades what the agent sees in the moment. If this really is impossible in our world, then the designer’s job gets way harder.

Even ignoring wireheading, it could be impossible to supply a reward function such that most optimal policies lead to desirable behavior – at least, if the agent is farsighted and able to compute the optimal policy, the environment is partially observable (which it is, for the real world), and the reward function is Markovian.

I think it’s important to understand why and how the designer’s job gets harder, but first, the problem.

Let’s suppose that we magically have a reward function which, given an image from the agent’s camera, outputs what an idealized person would think of the image. That is, given an image, suppose a moral and intelligent person considers the image at length (magically avoiding issues of slowly becoming a different person over the course of reflection), figures out how good it is, and produces out a scalar rating – the reward.

The problem here is that multiple world states can correspond to the same camera input. Is it good to see a fully black image? I don’t know – what else is going on? Is it bad to see people dying? I don’t know, are they real, or perfectly Photoshopped? I think this point is obvious, but I want to make it so I can move on to the interesting part: there just isn’t enough information to meaningfully grade inputs. Contrast with being able to grade universe-histories via utility functions: just assign 1 to histories that lead to better things than we have right now, and 0 elsewhere.

The problem is that the mapping from world state to images is not at all injective... in contrast, grading universe-histories directly doesn’t have this problem: simply consider an indicator function on histories leading to better worlds than the present (for some magical, philosophically valid definition of “better”).

Now, this doesn’t mean we need to have systems grading world states. But what I’m trying to get at is, Markovian reward functions are fundamentally underdefined. To say the reward function will incentivize the right things, we have to consider the possibilities available to the agent: which path through time is the best?

The bad thing here is that the reward function is no longer actually grading what the agent sees, but rather trying to output the right things to shape the agent’s behavior in the right ways. For example, to consider the behavior incentivized by a reward function linear in the number of blue pixels, we have to think about how the world is set up. We have to see, oh, this doesn’t just lead to the agent looking at blue objects; rather, there exist better possibilities, like showing yourself solid blue images forever.

But maybe there don’t exist such possibilities – maybe we have in fact made it so the only way to get reward is by looking at blue objects. The only way to tell is by looking at the dynamics – at how the world changes as the agent acts. In many cases, you simply cannot make statements like “the agent is optimizing for ” without accounting for the dynamics.

Under this view, alignment isn’t a property of reward functions: it’s a property of a reward function in an environment. This problem is much, much harder: we now have the joint task of designing a reward function such that the best way of stringing together favorable observations lines up with what we want. This task requires thinking about how the world is structured, how the agent interacts with us, the agent’s possibilities at the beginning, how the agent’s learning algorithm affects things

Yikes.

Qualifications

The argument seems to hold for -step Markovian reward functions, if isn’t ridiculously large. If the input observation space is rich, then the problem probably relaxes. The problem isn't present in fully observable environments: by force of theorem (which presently assumes determinism and a finite state space), there exist Markovian reward functions whose only optimal policy is desirable.

This doesn’t apply to e.g. Iterated Distillation and Amplification (updates based on policies), or Deep RL from Human Preferences (observation trajectories are graded). That is, you can get a wider space of optimal behaviors by updating policies on information other than a Markovian reward.

It’s quite possible (and possibly even likely) that we use an approach for which this concern just doesn’t hold. However, this “what you see” concept feels important to understand, and serves as the billionth argument against specifying Markovian observation-based reward functions.

Thanks to Rohin Shah and TheMajor for feedback.

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 9:17 PM

Planned summary:

This post makes the point that for Markovian reward functions on observations, since any given observation can correspond to multiple underlying states, we cannot know just by analyzing the reward function whether it actually leads to good behavior: it also depends on the environment. For example, suppose we want an agent to collect all of the blue blocks in a room together. We might simply reward it for having blue in its observations: this might work great if the agent only has the ability to pick up and move blocks, but won't work well if the agent has a paintbrush and blue paint. This makes the reward designer's job much more difficult. However, the designer could use techniques that don't require a reward on individual observations, such as rewards that can depend on the agent's internal cognition (as in iterated amplification), or rewards that can depend on histories (as in Deep RL from Human Preferences).

Planned opinion:

I certainly agree that we want to avoid reward functions defined on observations, and this is one reason why. It seems like a more general version of the wireheading argument to me, and applies even if you think that the AI won't be able to wirehead, as long as it is capable enough to find other plans for getting high reward besides the one the designer intended.

Under this view, alignment isn’t a property of reward functions: it’s a property of a reward function in an environment. This problem is much, much harder: we now have the joint task of designing a reward function such that the best way of stringing together favorable observations lines up with what we want. This task requires thinking about how the world is structured, how the agent interacts with us, the agent’s possibilities at the beginning, how the agent’s learning algorithm affects things…

I think there are ways of doing this that don't involve explicitly working through what observation sequences lead to good outcomes. AFAICT this was originally outlined in Model Based Rewards quite a while ago. Essentially, the idea is to make the reward (or even better, utilty) a function of the agent's internal model of the world. Then when the agent goes to make a decision, the utility of the worlds where the agent does and does not make take an action are compared. Doing things this way has a couple of nice properties, including eliminating the incentive to wirehead, and making it possible to specify utilities over possible worlds rather than just what the AI sees.

The relevant point however, is that it takes the problem from trying to pin down what chains of events lead to good outcomes, and splits it into a problem of identifying good and bad worldstates in the agents model and building an accurate model of the world. This is because an agent with an accurate model of the world will be able to figure out what sequence of actions and observations lead to any given worldstate.

I feel somewhat pessimistic about doing this robustly enough to scale to AGI. From an earlier comment of mine:

It isn't obvious to me that specifying the ontology is significantly easier than specifying the right objective. I have an intuition that ontological approaches are doomed. As a simple case, I'm not aware of any fundamental progress on building something that actually maximizes the number of diamonds in the physical universe, nor do I think that such a thing has a natural, simple description.

I'm personally far more optimistic about ontology identification. Work in representation learning, blog posts such as OpenAI's sentiment neuron, and style transfer, all indicate that it's at least possible to point at human level concepts in a subset of world models. Figuring out how to refine these learned representations to further correspond with our intuitions, and figuring out how to rebind those concepts to representations in more advanced ontologies are both areas that are neglected, but they're both problems that don't seem fundamentally intractable.

I wasn't aware of that work, thanks for linking! It's true that we don't have to specify the representation; instead, we can learn it. Do you think we could build a diamond maximizer using those ideas, though? The concern here is that the representation has to cleanly demarcate what we think of as diamonds, if we want the optimal policy to entail actually maximizing diamonds in the real world. This problem tastes like it has a bit of that 'fundamentally intractable' flavor.

Do you think we could build a diamond maximizer using those ideas, though?

They're definitely not sufficient, almost certainly. A full fledged diamond maximizer would need far more machinery, if only to do the maximization and properly learn the representation.

The concern here is that the representation has to cleanly demarcate what we think of as diamonds.

I think this touches on a related concern, namely goodharting. If we even slightly miss-specify the utility function at the boundary and the AI optimize in an unrestrained fashion, we'll end up with weird situations that are totally de-correlated with what we we're initially trying to get the AI to optimize.

If we don't solve this problem, I agree, the problem is extremely difficult at best and completely intractable at worst. However, If we can reign in goodharting, then I don't think things are intractable.

To make the point, I think the problem of a AI goodharting a representation is very analogous to the problems being tackled in the field of adversarial perturbations for image classification. In this case, the "representation space" is the image itself. The boundaries are classification boundaries set by the classifying neural network. The optimizing AI that goodharts everyting is usually just some form or gradient decent.

The field started when people noticed that even tiny imperceptible perturbations to images in one class would fool a classifier into thinking it was an image from another class. The interesting thing is that when you take this further, you get deep dreaming and inceptionism. The lovecraftian dog-slugs that would arise from the process are are result of the local optimization properties of SGD combined with the flaws of the classifier. Which, I think, is analogous to goodharting in the case of a diamond maximiser with a learnt ontology. The AI will do something weird, it becomes convinced that the world is full of diamonds. Meanwhile, if you ask a human about the world it created, "lovecraftian" will probably precede "diamond" in the description.

However, the field of adversarial examples seems to indicate that it's possible to at least partially overcome this form of goodharting and, by anaogy, the goodharting that we would see with a diamond maximiser. IMO, the most promising and general solution seems to be to be more bayesian, and keep track of the uncertainty associated with class label. By keeping track of uncertainty in class labels, it's possible to avoid class boundaries altogether, and optimize towards regions of the space that are more likely to be part of the desired class label.

I can't seem to dig it up right now, but I once saw a paper where they developed a robust classifier. When they used SGD to change a picture from being classified as a cat to being classified as a dog, the result was that the underlying image went from looking like a dog to looking like a cat. By analogy, an diamond maximizer with a robust classification of diamonds in it's representation should actually produce diamonds.

Overall, adversarial examples seem to be a microcosm for evaluating this specific kind of goodharting. My optimism that we can do robust ontology identification is tied to the success of that field, but at the moment the problem doesn't seem to be intractable.

They're definitely not sufficient, almost certainly. A full fledged diamond maximizer would need far more machinery, if only to do the maximization and properly learn the representation.

Clarification: I meant (but inadequately expressed) "do you think any reasonable extension of these kinds of ideas could get what we want?" Obviously, it would be a quite unfair demand for rigor to demand whether we can do the thing right now.

Thanks for the great reply. I think the remaining disagreement might boil down to the expected difficulty of avoiding Goodhart here. I do agree that using representations is a way around this issue, and it isn't the representation learning approach's job to simultaneously deal with Goodharting.

do you think any reasonable extension of these kinds of ideas could get what we want?

Conditional on avoiding Goodhart, I think you could probably get something that looks a lot like a diamond maximiser. It might not be perfect, the situation with the "most diamond" might not be the maximum of it's utility function, but I would expect the maximum of it's utility function will still contain a very large amount of diamond. For instance, depending on the representation, and the way the programmers baked in the utilty function, it might have a quirk in it's utility function of only recognizing something as a diamond if it's stereotypically "diamond shaped". This would bar it from just building pure carbon planets to achieve it's goal.

IMO, you'd need something else outside of the ideas presented to get a "perfect" diamond maximizer.

I'm a bit confused, but here's my current understanding and questions:

1. You're mostly talking about partially observable markov decision problems (POMDP)

2. The link above has rewards given by the environment that go from (State, Action) to a real number, while Markovian observation-based reward function is given by the Agent itself (?) and goes from Observation to a real number?

  • What's an n-step version of one? .

I have a few other questions, but they depend on whether the reward is given by agent based on observations or by the environment based on it's actual state.

I was thinking of it being over observations, but having it be over States x Actions leads to a potentially different outcome. An -step version is, your reward function is a mapping (you're grading the last observations jointly). Eg in Atari DRL you might see the last four frames being fed to the agent as an approximation (since the games might well be 4-step Markovian; that is, the four previous time steps fully determine what happens next).

Thanks!

  1. So observation based rewards lead to bad behavior when the rewarded observation maps to different states (with at least one of those states being undesired)?

  2. And a fully observable environment doesn’t have that problem because you always know which state you’re in? If so, wouldn’t you still be rewarded by observations and incentivized to show yourself blue images forever?

  3. Also, a fully-observable environment will still choose to wirehead if that’s a possibility, correct?

Let me try and reframe. The point of this post isn't that we're rewarding bad things, it's that there might not exist a reward function whose optimal policy does good things! This has to do with the structure of agent-environment interaction, and how precisely we can incentivize certain kinds of optimal action. If the reward functions linear functionals over camera RGB values, then excepting the trivial zero function, plugging in any one of these reward functions to AIXI leads to doom! We just can't specify a reward function from this class which doesn't (this is different from there maybe existing a "human utility function" which is simply hard to specify).