Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In our latest paper and accompanying blog post, we provide several new examples of goal misgeneralization in a variety of learning systems. The rest of this post picks out a few upshots that we think would be of interest to this community. It assumes that you’ve already read the linked blog post (but not necessarily the paper).

Goal misgeneralization is not limited to RL

The core feature of goal misgeneralization is that after learning, the system pursues a goal that was correlated with the intended goal in the training situations, but comes apart in some test situations. This does not require you to use RL – it can happen with any learning system. The Evaluating Expressions example, where Gopher asks redundant questions, is an example of goal misgeneralization in the few-shot learning regime for large language models.

The train/test distinction is not crucial

Sometimes people wonder whether goal misgeneralization depends on the train/test distinction, and whether it would no longer be a problem if we were in a continual learning setting. As Evan notes, continual learning doesn’t make much of a difference: whenever your AI system is acting, you can view that as a “test” situation with all the previous experience as the “training” situations. If goal misgeneralization occurs, the AI system might take an action that breaks your continual learning scheme (for example, by creating and running a copy of itself on a different server that isn’t subject to gradient descent).

The Tree Gridworld example showcases this mechanism: an agent trained with continual learning learns to chop trees as fast as possible, driving them extinct, when the optimal policy would be to chop the trees sustainably. (In our example the trees eventually repopulate and the agent recovers, but if we slightly tweak the environment so that once extinct the trees can never come back, then the agent would never be able to recover.)

It can be hard to identify goal misgeneralization

InstructGPT was trained to be helpful, truthful, and harmless, but nevertheless it will answer "harmful" questions in detail. For example, it will advise you on the best ways to rob a grocery store.

An AI system that competently does something that would have gotten low reward? Surely this is an example of goal misgeneralization?

Not so fast! It turns out that during training the labelers were told to prioritize helpfulness over the other two criteria. So maybe that means that actually these sorts of harmful answers would have gotten high reward? Maybe this is just specification gaming?

We asked the authors of the InstructGPT paper, and their guess was that these answers would have had high variance – some labelers would have given them a high score; others would have given them a low score. So now is it or is it not goal misgeneralization?

One answer is to say that it depends on the following counterfactual: “how would the labelers have reacted if the model had politely declined to answer?” If the labelers would have preferred that the model decline to answer, then it would be goal misgeneralization, otherwise it would be specification gaming.

As systems become more complicated we expect that it will become harder to (1) aggregate and analyze the actual labels or rewards given during training, and (2) evaluate the relevant counterfactuals. So we expect that it will become more challenging to categorize a failure as specification gaming or goal misgeneralization.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 3:04 PM

How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?

The examples found "in the wild" (cultural transmission, InstructGPT) involved no coaxing at all. Details for the other examples (going off of memory, probably some of this will be wrong, but it should be right in broad strokes):

  1. Monster gridworld: We knew from the beginning that the mechanism we wanted was "agent needs to collect shields in training episodes; over longer time horizons it should collect apples but it will continue to collect shields because shields were way more important during training". We had to play around with the setup quite a bit before we got the relatively clean results in the paper. Two canonical examples of issues:
    1. The agent learned to run around the gridworld to avoid the monsters instead of picking up shields. We fixed this by making monsters faster than the agent.
    2. The agent didn't learn competent path planning (and instead looked like it was moving around somewhat randomly). I don't remember exactly why this was, but it might have been that the apples / shields were too densely packed in the environment and so there wasn't much benefit to competent path planning (in which case we probably solved it by reducing the number of apples / shields or increasing the size of the gridworld).
  2.  Tree gridworld: This was originally supposed to be the same sort of environment as Monster gridworld, but with different hyperparameters to showcase the same issue for non-episodic / never-ending / continual learning RL. Our biggest issue here was that we failed to find an RL algorithm that actually worked for this; the agent typically didn't even learn to collect shields. We spent quite a while trying to fix this before we realized we could simplify the environment by removing the shields and still show a similar issue; with this simpler environment the agent finally started to learn. After that there was a bit of tweaking of hyperparameters but I think it worked pretty quickly.
  3. Evaluating Linear Expressions: I think for this one we thought "well one way you could get GMG is if a task required you to gather information to solve it, and the AI learns that information gathering is valuable for its own sake", which then turned into this idea, which then worked immediately.
  4. We tried lots of other things that never ended up being good enough to put in the paper. For example, one hypothesis we had was that an LLM that summarized news articles might learn that "stating real-world facts is important" and so when summarizing LW essays that don't have real-world facts, it might make up facts. When we tested this iirc it did sometimes make up facts but the overall vibe was "it does weird stuff" rather than "it competently pursues a misgeneralized goal".

seems like this means the takeaway has to be that in weird circumstances, you can misgeneralize in ways that maintain surprisingly large amounts of competence, but this isn't the default in most situations. the problem is, those misgeneralizations might be surprisingly bad if the competence is strong enough. these are specifically situations where empowerment is reliable but purpose is confusing, yeah? and it seems like language models would be an exception to that because their empowerment and purpose are deeply tied.

seems like this means the takeaway has to be that in weird circumstances, you can misgeneralize in ways that maintain surprisingly large amounts of competence, but this isn't the default in most situations.

Sure, I endorse that conclusion today, when systems aren't particularly general / competent. I don't endorse that conclusion for the future, when systems will predictably become more general / competent.

(And if you take language models and put them in weird circumstances, they still look competent on some axes, they're just weird enough that we had trouble attributing any simple goal to them.)

I'm not sure I understand what you mean by empowerment and purpose as it relates to language models, can you say it a different way?

empowerment as in ability to control an environment; I just wanted to use a different term of art because it felt more appropriate, despite not being evaluated directly, empowerment is the question we care about out of capability, is it not?

and by purpose I simply meant goal.

I understand that part, but I'm not seeing what you mean by empowerment being reliable but purpose being confusing, and why language models are an exception to that.

The generative modeling objective applied to human datasets only makes behavior that causes empowerment because doing so correlates with behavior that causes accuracy; a reinforcement learning objective applied to the same dataset will still learn the convergent empowerment capability well, but the reward signal is relatively sparse, the model will fit whatever happens to be going on at the time.

in general it seems like the thing all of the example situations have in common is much less dense feedback from anything approaching a true objective.

situations where it's obvious how to assemble steps to get things, but confusing which results of the different combinations are the ones you really want, are ones where feedback is hard to be sure you have pushed into the correct dimensions. or something.

These examples seem like capabilities failures rather than alignment failures. Reading them doesn’t make me feel any more convinced that there will be rebellious AI, accidental paperclip maximizers, deceptive alignment, etc.

In the first example, the environment the AI is in suddenly changes, and the AI is not given the capability to learn and adapt to this change. So of course it fails.

In the second example, the AI is given the ability to continuously learn and adapt, and in this case, it actually succeeds at the intended goal. It almost depopulates the trees, because it’s a relatively simple reinforcement learner & it has to screw up once to learn from its mistakes, whereas a more sophisticated intelligence might have more foresight. Still, only messing up once is pretty impressive.

The third example is an LLM, about which it’s sort of awkward to apply the concept of having “goals”. LLMs are capable of astonishing examples of intelligence sometimes but also frequently are very “stupid” when statistical next-token-prediction just leads to faulty pattern-matching. This failure is one such example.