Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post discusses an issue that could lead to catastrophically misaligned AI even when we have access to a perfect reward signal and there are no misaligned inner optimizers. Instead, the misalignment comes from the fact that our reward signal is too expensive to use directly for RL training, so we train a reward model, which is incorrect on some off-distribution transitions. The agent might then exploit these off-distribution deficiencies, which I’ll refer to as reward model hacking.

I’m sure that others have thought about this issue before, but I didn’t find much discussion focused on it. So I’m writing this post so that either someone can explain to me why this isn’t a big deal, or to give it a name and some explicit analysis. Depending on how hard reward model hacking is to deal with, it could present a significant challenge to the entire approach of doing RL + reward learning, and my main goal is figuring out whether that’s the case.

The Setting

I will focus on the case where the policy learned via RL is able to do some online planning or reasoning about the world—it can come up with action sequences that lead to high reward without ever having tried out those action sequences before. I don’t care much here whether we have a learned mesa-optimizer, or a search process that we built explicitly, or just a bunch of really good heuristics that when taken together yield similar behavior.

Reward model hacking is also an issue without online planning. But the online planning setting makes it clearer that reward model hacking could be a really hard to deal with fundamental issue for reward learning, rather than a small technical problem. I'm much less sure whether that's also the case without online planning capabilities.

Apart from that, I’ll make rather optimistic assumptions to isolate reward model hacking from other potential failure modes:

  • We have access to a perfect reward signal,[1] but it’s expensive to evaluate (e.g. because it requires human feedback)
  • We train an RL agent in parallel with a reward model (for concreteness, you can imagine a setup similar to Deep RL from human preferences[2]
  • The agent is trying to optimize its reward according to the reward model (in particular, it hasn’t learned some other misaligned inner objective)

If we furthermore assumed that the reward model had learned the reward signal perfectly, then we would have solved alignment—the agent would optimize the assumed-to-be-perfect reward signal.

But what if the reward model is not quite perfect, and in particular if it gives incorrect rewards on some off-distribution transitions? The next section makes a case for why this could be really bad rather than just slightly inconvenient.

The Problem

Here’s one concrete plan that the agent could come up with: get direct read access to the reward model's weights, then use gradient descent to find states that get extremely high reward. Then look for action sequences that lead to those states. This precise plan is probably too specific to be likely to be the one the agent will implement. But my point is that the agent could be really good at maximizing reward according to the reward model, and this example is supposed to make that possibility more salient.

So what happens if the reward model is optimized for really hard? What do these states with extremely high reward look like? One analogy we can draw comes from interpretability research, where people optimize the input to an image classifier to get the image that looks the most dog-like to the network. What they get is not an image of a dog:

Input that maximizes the “Dalmatian” logit (up to an L2 regularizer). From

Now, I expect we can make our future reward models much more robust than the early CNN that this image was generated from. But it also seems likely to me that we won’t be able to get rid of issues like this entirely.

(Part of) what’s going on here is that there’s a huge space of inputs that’s wildly off-distribution, such as the image above. If you have a random function that fits the training data, it’s likely that it will give an even higher output on some off-distribution input than it does for any actual image of a dog. Of course we don’t have just any random function that fits the training data—inductive biases from the model architecture, optimizer, and regularization lead to some amount of generalization. But ensuring that the state with the highest reward is one we actually like could be a high bar; I’ll discuss some challenges later.

How bad would it be if the agent finds weird off-distribution states that get very high reward? I think it’s likely catastrophically bad—most weird off-distribution states are bad states for humans. That said, we might end up with a reward model that’s not quite perfect, but good enough that the reward-maximizing states are at least close to the states we actually like best. (For example, the image above is at least more similar to a Dalmatian than white noise would be, and perhaps our reward models will be much better in this respect). In such a scenario, it becomes less obvious how bad this outcome is. It will depend on just how close the off-distribution states that are found by the agent are, and according to what metric.

Won’t online learning just fix the reward model?

The entire reason why people train the RL agent and reward model in parallel is that if the RL agent does something new, we can get human feedback on these new outcomes, and then train the reward model on that feedback. So hopefully, whenever the RL agent finds some weird way to exploit the reward model, the reward model will quickly be updated.

But this doesn’t help for the failure mode I’m most worried about: the agent comes up with a plan to get high reward using its world model, rather than by trying out the plan, and once the plan is executed, it’s immediately catastrophic. In particular, the agent might know that we will update the reward model if it doesn’t stop us from doing so, and that would be bad according to the current reward model.


I’m still uncertain how hard reward model hacking will be to prevent. This is an important question: if there’s some relatively easy fix, we don’t need to worry about it too much now. On the other extreme, if it’s essentially unfixable within an RL + reward learning framework, that would be important to know as soon as possible.

Maybe everything is just fine by default

I wouldn’t be shocked if reward model hacking turned out not to be a problem in practice, though my best guess is that it will be. Some ways in which we might just be fine without much directed effort:

  • More powerful models just become more robust, and a reward model good enough to perform well at all will be robust enough to avoid this entire problem.
    • This is a higher bar than just avoiding adversarial examples: the generalization needs to extend across the entire space of inputs, not just those close to the training distribution.
    • In principle, it might be possible to get this level of generalization. For example, whatever process generates the training signal for the reward model could do so by our assumptions (this might involve things like AI-assisted humans deliberating for a while in difficult cases).
    • Apart from practical issues of getting a model that generalizes this well, a fundamental issue could be that this requires some form of inner consequentialist that could make the reward model itself dangerous.
    • Perhaps the most plausible path for such robust generalization is something like what's described in Alignment by Default.
  • We do get somewhat worse outcomes than we’d get without reward model hacking, but all in all, the reward model is good enough that the best-according-to-the-model outcomes are still very good.
  • We directly reward action sequences (either solely or in addition to states), and the action sequences where the agent searches for inputs to the reward model that achieve high reward are penalized.
    • One big challenge is recognizing the action sequences we want to penalize. We want the agent to find action sequences that get high reward, we just don’t want it to optimize too hard off-distribution.
    • More generally, this scenario still seems really scary: the agent will still have reason to look for action sequences that give it direct access to the reward model while also avoiding penalties from the reward model.

Deliberate solutions

Even if the issue I’ve outlined is dangerous “by default”, maybe it’s quite easy to solve. Some avenues I can think of:

  • Use out-of-distribution detection to penalize any outcomes that are very unlike anything seen so far.
    • First of all, getting sufficiently good OOD detection could be hard. In some sense we’ve only passed the buck: now our OOD detection has to be good enough that the agent can’t fool it. Nevertheless, this should be a lot easier: we only need to recognize weird new outcomes, not assign them any rewards. This is something that humans should be able to do fairly well, and it seems plausible we can get models to do this very reliably.
    • A potentially thornier problem is that we had a perfectly good specification of what we want (the loss signal for the reward model), and now we’ve messed it up by adding an ad-hoc penalty term. We don’t actually want to penalize every new outcome, it’s just a hack for avoiding parts of outcome space where our reward model doesn’t work well. An example of how this could go wrong would be the agent taking actions to ensure the world doesn’t change too much—things like mind uploading, space colonization, etc. are pretty out-of-distribution! We might well be able to avoid this particular failure mode, but the fundamental problem is that the agent now cares about something else than just maximizing the correct reward signal.
  • Learn a distribution over reward functions, rather than a single one.
    • One thing we could then do is to make the agent optimize the mean of that distribution. Maybe this is more robust than just learning a point estimate and optimizing that? But it’s unclear to me whether this helps. We could still have a systematic bias in our distribution that leads to overestimation for some out-of-distribution outcomes.
    • Alternatively, we could make the agent optimize the worst case over the distribution (or some softer version thereof). But that takes us back into the regime of penalizing out-of-distribution outcomes, with the associated problem of optimizing for something that’s systematically different from what we want.
  • Have a strong prior over action sequences that the agent should take.
    • This is essentially what people do when they want to visualize the image that a network would most strongly classify as a dog but that’s also a “reasonable” image. For example, you can have a generative network, and then optimize over the latent state of that network, rather than directly over images.
    • Learning to summarize from human feedback also used something like this: they first trained a policy using supervised learning, and then fine-tuned using RL on a learned reward model. But instead of only maximizing expected reward, they also included a KL divergence penalty that prevents the policy from deviating too much from the supervised baseline.
    • It might still be hard to make this safe (even the prior we use might contain some out-of-distribution action sequences that are bad but achieve high reward from the reward model).
    • Making this safe could make it hopelessly uncompetitive. (We can make it safe by only allowing a few carefully pre-approved action sequences to be considered, but then the agent isn’t that useful).

These are examples of trying to address the problem without changing the overall framework of RL + reward learning. Another approach would of course be to solve the problem “at its root”. The fundamental issue is that the RL training process and the reward learning process are in some sense at odds with each other—they’re not explicitly maximizing the other’s loss, but essentially the RL training constantly attempts to “exploit” the current reward model by getting high reward in some easy way. The scenario I’ve described here, where the RL agent itself is deliberately searching for outcomes with high reward, is just an extreme case of that.

I am very enthusiastic about trying to avoid this fundamental problem altogether. Cooperative Inverse RL would be one aspiration, but doesn't really tell us how to implement it in practice—if we just assume some model p(actions|reward function) for the human, then that model will probably be somewhat wrong, which leads to similar issues. Another approach could be Semi-supervised RL, where we directly use the expensive ground-truth reward signal, rather than first training a reward model to approximate it. But currently, reward learning is by far the dominant approach to aligning AI systems in practice, presumably because it's the approach that we can get to work best. That's why I've focused on solutions within the RL + reward learning framework—if we need to leave that frame work to avoid reward model hacking, that's important to know!


My best guess is that reward model hacking is a serious problem that we need to deliberately solve if we want to get RL + reward learning to work, at least if our agents are capable of zero-shot generation of plans for achieving high reward. A crucial question, which I am less certain about, is whether reward model hacking can be addressed within reward learning at all, or whether it is a sufficiently fundamental problem that we'd be better served by looking for alternative frameworks.

To be clear, I don’t think that reward model hacking will be more challenging than e.g. getting a “perfect” loss signal for the reward model in the first place, or avoiding inner optimizers with clearly bad objectives. But I’m somewhat worried about spending a lot of effort on improving reward learning techniques and then later finding out that we need to fundamentally change our approach and thereby invalidate a lot of progress.

I'd be excited to hear about either reasons why reward model hacking won't be a big problem in practice, or conversely why it will require an entirely different approach to solve!

Thanks to Adam Gleave, Anson Ho, Jan Kirchner, and Tom Lieberum for feedback and discussions on a draft of this post!

  1. ^

    There may be no such thing given that humans aren’t expected utility maximizers, but I think if anything that fact will make things even more challenging. 

  2. ^

    Parallel training is meant to be the best-case assumption, see the section on "Won’t online learning just fix the reward model?". It's not an important part of the setting, the argument also works if you first train a reward model and then the RL agent.

New Comment
1 comment, sorted by Click to highlight new comments since:

This post discusses an issue that could lead to catastrophically misaligned AI even when we have access to a perfect reward signal and there are no misaligned inner optimizers. Instead, the misalignment comes from the fact that our reward signal is too expensive to use directly for RL training, so we train a reward model, which is incorrect on some off-distribution transitions. The agent might then exploit these off-distribution deficiencies, which I’ll refer to as reward model hacking.

Fwiw, I would say that in this case you had an inner alignment failure in your training of the reward model.

(Or alternatively, I would think of the policy + reward model as a unified AI system, and then say that you had an inner alignment failure w.r.t the unified AI system.)

I'm not sure everyone would agree with this; I've found that different people mean different things by outer and inner alignment.