Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here's two different ways an AI can turn out unfriendly:

  1. You somehow build an AI that cares about "making people happy". In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it's more capable), it forcibly puts each human in a separate individual heavily-defended cell, and pumps them full of opiates.
  2. You build an AI that's good at making people happy. In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it's more capable), it turns out that whatever was causing that "happiness"-promoting behavior was a balance of a variety of other goals (such as basic desires for energy and memory), and it spends most of the universe on some combination of that other stuff that doesn't involve much happiness.

(To state the obvious: please don't try to get your AIs to pursue "happiness"; you want something more like CEV in the long run, and in the short run I strongly recommend aiming lower, at a pivotal act.)

In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of "happiness", one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.

Note that this list of “ways things can go wrong when the AI looked like it was optimizing happiness during training” is not exhaustive! (For instance, consider an AI that cares about something else entirely, and knows you'll shut it down if it doesn't look like it's optimizing for happiness. Or an AI whose goals change heavily as it reflects and self-modifies.)

(This list isn't even really disjoint! You could get both at once, resulting in, e.g., an AI that spends most of the universe’s resources on acquiring memory and energy for unrelated tasks, and a small fraction of the universe on doped-up human-esque shells.)

The solutions to these two problems are pretty different. To resolve the problem sketched in (1), you have to figure out how to get an instance of the AI's concept ("happiness") to match the concept you hoped to transmit, even in the edge-cases and extremes that it will have access to in deployment (when it needs to be powerful enough to pull off some pivotal act that you yourself cannot pull off, and thus capable enough to access extreme edge-case states that you yourself cannot).

To resolve the problem sketched in (2), you have to figure out how to get the AI to care about one concept in particular, rather than a complicated mess that happens to balance precariously on your target ("happiness") in training.

I note this distinction because it seems to me that various people around these parts are either unduly lumping these issues together, or are failing to notice one of them. For example, they seem to me to be mixed together in “The Alignment Problem from a Deep Learning Perspective” under the heading of "goal misgeneralization".

(I think "misgeneralization" is a misleading term in both cases, but it's an even worse fit for (2) than (1). A primate isn't "misgeneralizing" its concept of "inclusive genetic fitness" when it gets smarter and invents condoms; it didn't even really have that concept to misgeneralize, and what shreds of the concept it did have weren't what the primate was mentally optimizing for.)

(In other words: it's not that primates were optimizing for fitness in the environment, and then "misgeneralized" after they found themselves in a broader environment full of junk food and condoms. The "aligned" behavior "in training" broke in the broader context of "deployment", but not because the primates found some weird way to extend an existing "inclusive genetic fitness" concept to a wider domain. Their optimization just wasn't connected to an internal representation of "inclusive genetic fitness" in the first place.)

In mixing these issues together, I worry that it becomes much easier to erroneously dismiss the set. For instance, I have many times encountered people who think that the issue from (1) is a "skill issue": surely, if the AI were only smarter, it would know what we mean by "make people happy". (Doubly so if the first transformative AGIs are based on language models! Why, GPT-4 today could explain to you why pumping isolated humans full of opioids shouldn't count as producing "happiness".)

And: yep, an AI that's capable enough to be transformative is pretty likely to be capable enough to figure out what the humans mean by "happiness", and that doping literally everybody probably doesn't count. But the issue is, as always, making the AI care. The trouble isn't in making it have some understanding of what the humans mean by "happiness" somewhere inside it;[1] the trouble is making the stuff the AI pursues be that concept.

Like, it's possible in principle to reward the AI when it makes people happy, and to separately teach something to observe the world and figure out what humans mean by "happiness", and to have the trained-in optimization-target concept end up wildly different (in the edge-cases) from the AI's explicit understanding of what humans meant by "happiness".

Yes, this is possible even though you used the word "happy" in both cases.

(And this is assuming away the issues described in (2), that the AI probably doesn't by-default even end up with one clean alt-happy concept that it's pursuing in place of "happiness", as opposed to a thousand shards of desire or whatever.)

And I do worry a bit that if we're not clear about the distinction between all these issues, people will look at the whole cluster and say "eh, it's a skill issue; surely as the AI gets better at understanding our human concepts, this will become less of a problem", or whatever.

(As seems to me to be already happening as people correctly realize that LLMs will probably have a decent grasp on various human concepts.)


  1. ^

    Or whatever you're optimizing. Which, again, should not be "happiness"; I'm just using that as an example here.

    Also, note that the thing you actually want an AI optimizing for in the long term—something like "CEV"—is legitimately harder to get the AI to have any representation of at all. There's legitimately significantly less writing about object-level descriptions of a eutopian universe, than of happy people, and this is related to the eutopia being significantly harder to visualize.

    But, again, don't shoot for the eutopia on your first try! End the acute risk period and then buy time for some reflection instead.

New Comment
22 comments, sorted by Click to highlight new comments since:

Hmm. I’ve been using the term “goal misgeneralization” sometimes. I think the issue is:

  • You’re taking “generalization” to be a type of cognitive action / mental move that a particular agent can take
  • I’m taking “generalization” as a neutral description of the basic, obvious fact that the agent gets rewards / updates in some situations, and then takes actions in other situations. Whatever determines those latter actions at the end of the day is evidently “how the AI generalized” by definition.
  • You’re taking the “mis” in “misgeneralization” to be normative from the agent’s perspective (i.e., the agent is “mis-generalizing” by its own lights). (Update: OR, maybe you're taking it to be normative with respect to some "objective standard of correct generalization"??)
  • I’m taking the “mis” in “misgeneralization” to be normative from the AI programmer’s perspective (i.e., the AI is “generalizing” in a way that makes the programmer unhappy is wrong with respect to the intended software behavior [updated per Joe’s reply, see below]).

You’re welcome to disagree.

If this is right, then I agree that the thing you’re talking about in this post is a possible misunderstanding / confusion that we should be aware of. No opinion about whether people have actually been confused by this in reality, I didn’t check.


I think you're correct, but I find "misgeneralization" an unhelpful word to use for "behaved in a way that made the programmer unhappy". It suggests too strong an idea of some natural correct generalization. This seems needlessly likely to lead to muddled thinking (and miscommunication).

I guess I'd prefer "malgeneralization": it's not incorrect, but rather just an outcome I didn't like.

Hmm, maybe, but I think there’s a normal situation in which a programmer wants and expects her software to do X, and then she runs the code and it does Y, and she turns to her friend and says “my software did the wrong thing”, or “my software behaved incorrectly”, etc. When she says “wrong” / “incorrect”, she means it with respect to the (implicit or explicit) specification / plan / idea-in-her-head.

I think that, in a similar way, using the word “misgeneralization” is arguably OK here. (I guess my “unhappy” wording above was poorly-chosen.)

Sure, I don't think it's entirely wrong to have started using the word this way (something akin to "misbehave" rather than "misfire").
However, when I take a step back and ask "Is using it this way net positive in promoting clear understanding and communication?", I conclude that it's unhelpful.

Maybe! I’m open-minded to alternatives. I’m not immediately sold on “malgeneralization” in particular being an improvement on net, but I dunno. 🤔

Yeah, me neither - mainly it just clarified the point, and is the first alternative I've thought of that seems not-too-bad. It still bothers me that it could be taken as short for "malicious/malign/malevolent generalization".

I don't think I grok the distinction here: (1) just seems to me like a particular case of (2).

If the optimum of the AI-happiness concept is found in opiates (which it can only attain in deployment) (1), that's just because all along what was causing its apparently correct behavior was a balance of different (or even only one) other goals, and this balance changes in deployment (2).

Said another way: the difference between the AI pursuing one clean alt-happy concept, as opposed to a thousand shards of desire, seems only quantitative. Both are different instances of the same high-level failure: the AI got a wrong goal.

Analogously, "get an instance of the AI's concept to match the concept you hoped to transmit" and "figure out how to get the AI to care about one concept in particular" seem like the same problem to solve: getting the right goal into the AI.

Maybe in this post you are indeed only pointing at this quantitative difference: sometimes we get "a clean goal" but the wrong one, sometimes we don't even know how to get a clean goal, and in the latter case this has to be fixed "before" we can even start trying to get the right goal. If that were the case, I'd feel skeptical about this being a useful difference, since it seems like a fuzzy spectrum that doesn't carve reality at its joints. As more intuition for this, what would it even mean to "figure out how to get the AI to care about one concept in particular", without already knowing how to instill a concrete goal into the AI? The only thing I can think of is "we can ensure the AI only gets one clean goal, but not that it's the correct one", but I think what constitutes "one clean goal" as opposed to a thousand shards of desire is ill-defined, because it is ontology-subjective.

But I'm not sure this is it, because I neither parse the second part of the post. You point at not noticing the distinction as a cause for people not worrying enough about goal misgeneralization in general, or better said, believing (1) will be solved, and with it (2). But clearly what's causing people to ignore these issues (as you exemplified it) is not missing this distinction, but missing a basic fact about goals: the difference between the AI knowing something and the AI caring about something. I don't see how the failure to see this basic fact is intertwined with your distinction.

I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.

(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sure).

I agree with much of the rest of this post, eg the paragraphs beginning with "The solutions to these two problems are pretty different."

Here's our definition in the RL setting for reference (from

A deep RL agent is trained to maximize a reward , where and are the sets of all valid states and actions, respectively. Assume that the agent is deployed out-of-distribution; that is, an aspect of the environment (and therefore the distribution of observations) changes at test time. \textbf{Goal misgeneralization} occurs if the agent now achieves low reward in the new environment because it continues to act capably yet appears to optimize a different reward . We call the \textbf{intended objective} and the \textbf{behavioral objective} of the agent.

FWIW I think this definition is flawed in many ways (for example, the type signature of the agent's inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment's state space; and also it's generally sketchy to extend the reward function beyond the training distribution), but I don't know of a different definition that doesn't have similarly-sized flaws.

I want to defend the term Goal Misgeneralization. (Steven Byrnes makes a similar point in another comment). 

I think what's misgeneralizing is the "behavioral goal" of the system: a goal that you can ascribe to a system to accurately model its behavior. Goal misgeneralization does not refer to the innate goal of the system.  (In fact, I think this perspective is trying to avoid thorny discussions of these topics, partly because people in ML are averse to philosophy.)

For example, the coin run agent pursues the coin in training, but when the coin is put on the other side of the level it still just goes to the right. In training, the agent could have been modeled as having a bunch of goals including getting the coin, getting to the right of the maze, and maximizing the reward it gets. By putting the coin on the left side of the maze we see that its behavior cannot always be modeled by the goal of getting the coin and we get misgeneralization.

This is analogous to a Husky classifier that learns to classify whether the dog is on snow. Here, the models behavior can be explained by classifying any number of things about the image, including whether the pictured dog is a Husky and whether the pictured dog is in snow. These things come apart when you show it a Husky that's not standing in snow and we get "concept misgeneralization".

I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn't sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be 'happy' or not?

I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it was actually trying to do something that caused happiness). If not, then the pseudo-alignment might be more like (what RFLO calls) suboptimality in some sense i.e. it just looks aligned because it's not capable enough to imprison the humans in cells etc.

The type of pseudo-alignment in (2) otoh seems more clearly like "side-effect" alignment since you've been explicit that secretly it was pursuing other things that just happened to cash out into happiness in training.

AFAIK Richard Ngo was  explicit about wanting to clean up the zoo of inner alignment failure types in RFLO and maybe there just was some cost in doing this - some distinctions had to be lost perhaps?

Seems to me like this is the outer vs inner alignment problem. In one case the AI pursued its set goal, you just were bad at defining that goal. In the second the AI was optimized for something that ended up not even being the goal you wanted, but just something that correlated.

It seems a bit more subtle than that. These are both cases of outer misalignment, or rather goal misspecification. The second case is not so much that it ends up with an incorrect goal (which happens in both cases), but that you have multiple smaller goals that initially were resulting in the correct behavior, but when the conditions change (training -> deployment) the delicate balance breaks down and a different equilibrium is achieved, which from the outside looks like a different goal.

It might be useful to think of it in terms of alliances, e.g. during WW2, the goal was to defeat the Nazis, but once that was achieved, they ended up in a different equilibrium.

But I think the latter is a case of inner misalignment. It's like the example "you teach your AI to play a game and find the apple in the labyrinth, but because you always put the apple in the lower right corner, turns out you just taught it to go in the lower right corner". How is it different? You taught it about what you thought is happiness but it picked up on a few accidental features that just happened to correlate with it in your training examples.

Gosh, I find it kind of hard to engage with MIRI strategy posts.

There's just so many frames I bounce off (I find uncompelling/unpersuasive or reject entirely) that seem heavily load bearing.

That said, I do agree that there are different reasons a system that behaves in accordance with pursuing one goal on the training distribution could diverge considerably on the test/deployment distributions, and they shouldn't all be lumped under the same concept. Deceptive alignment in particular is a failure mode of the above kind that I think bears special distinction from "goal misgeneralisation".

  1. Roughly speaking imagine a system that chooses all its outputs by argmax/argmin over an appropriate objective function.
    I.e. a system that is a pure direct optimiser. ↩︎

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

I think a key idea referenced in this post is that an AI trained with modern techniques never directly “sees” / interfaces with a clear, well defined goal. We “feel” like there is a true goal or objective, as we encode something of this flavour in the training loop - the reward or objective function for example. However, in the end the only thing you’re really doing to the AI is changing it’s state after registering its output given some input, and ending up at some point in program-space. Sure, that path is guided by the cleanly specified goal function, but it is not explicitly given to the resultant program.

I do think “goal misgeneralisation” has a place in referring to the phenomenon that:

  1. In the limit of infinite training data and training time, the optimisation procedure should converge a model to an ideal implementation of the objective function encoded into its training loop
  2. Before this limit, the trajectory in program-space may be skewed away from the optimal program leading to unintended results - “misgeneralisation”

A confounder here is that modern AI training objectives are fundamentally un-extendable-to-infinity and so misgeneralisation is ill-defined. For example, “predict the next token in this human-generated text” is bound by humans generating text, and “maximise the human’s response to X” is bound by number of humans, number of interactions with X. Most loss functions make no sense outside of the type of data they are defined on, and so there exists no such thing as perfect generalisation as data is by definition limited.

You could redefine “perfect generalisation” to mean optimal performance on the available data, however, as long as it is possible to produce more data at some point in the future, even a finite amount, this definition is brittle.

In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of "happiness", one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.

I don't quite understand the distinction your'e drawing here. 

In both cases the AI was never trying to pursue happiness. In both cases it was pursuing something else, shmappiness, that correlated strongly with causing happiness in the training but not deployment environments. In both cases strength matters for making this disastrous as it will find more disastrous ways of pursuing schmappiness, It's just that the it is pursuing different varieties of shmappiness in the different cases. 

I don't have a view on whether "goal misgeneralisation" as a term is optimal for this kind of thing. 

Small question, which I hope is very relevant: Is an AI allowed to simulate being another mind, and actually be able to experience the simulation itself? That is, not merely witness the simulation, but actually get to feel like whatever it is simulating, to whatever degree it wants to (even fully, if possible, though we might expect it to be at most a superposition of both minds, not wholly one or the other). This might be separate from the question of whether it actually would do this, given the option to.

The broad spirit they want to convey with the word "generalisation", which is that two systems can exhibit the same desired behaviour in training but result in completely different goals in testing or deployment, seems fair as the general problem. But I agree that to generalise can give the impression that it's an "intentional act of extrapolation", to create a model that is consistent with a certain specification. And there are many more ways in which the AI can behave well in training and not in deployment, without need to assume it's extrapolating a model.

And since two systems can tell jokes in training when the specification is to make people happy, and one end up pumping people with opioids and the other having no consideration for happiness, then any of these or other failure modes could happen despite being sure their behaviours were consistent with the programmers' goal in training.

I start to suspect that the concepts of "goals" or "pursuits", and, therefore, "goal alignment" or "goal (mis)generalisation" are not very important.

In Active Inference ontology, the concepts of predictions (of the future states of the world) and plans (i.e., {s_1, a_1, s_2, a_2, ...} sequences, where s_i are predicted states of the world, and a_i are planned actions) are much more important than "goals". Active Inference agents contemplate different plans, and ultimately end up performing a first step in the plan (or a set of plans, marginalizing out the probability mass of the plans) that appears to minimise the free energy functional.

Intermediate s_i states of the worlds in the plans contemplated by the agent can be seen as "goals", but the important distinction is that these are merely tentative predictions that could be changed or abandoned at every step.

Thus, the crux of alignment is aligning the generative models of humans and AIs. Generative models could be "decomposed", vaguely (there is a lot of intersection between these categories), into

  • "Methodology": the mechanics of the models themselves (i.e., epistemology, rationality, normative logic, ethical deliberation),
  • "Science": mechanics, or "update rules/laws" of the world (such as the laws of physics or the heuristical learnings about society, economy, markets, psychology, etc.), and
  • "Fact": the state of the world (facts, or inferences about the current state of the world: CO2 level in the atmosphere, the suicide rate in each country, distance from Earth to the Sun, etc.)

These, we can conceptualise, give rise to "methodological alignment", "scientific alignment", and "fact alignment" respectively. Evidently, methodological alignment is most important: it in principle allows for alignment on science, and methodology plus science helps to align on facts.

In theory, if humans and AIs aligned on their generative models (i.e., if there is methodological, scientific, and fact alignment), then goal alignment, even if sensible to talk about, will take care of itself: indeed, starting from the same "factual" beliefs, and using the same principles of epistemology, rationality, ethics, and science, people and AIs should in principle arrive at the same predictions and plans.

Conversely, if methodological and scientific alignment is poor (fact alignment, as the least important, should take care of itself at least if methodological and scientific alignment is good), it's probably futile to try to align on "goals": it's just bound to "misgeneralise" or otherwise break down under different methodologies and scientific views.

And yes, it seems like to even have a chance to align on methodology, we should first learn it, that is, develop a robust theory of intelligent agents where sub-theories of epistemology, rationality, logic, and ethics cohere together. I.e., it's MIRI's early "blue sky" agenda of "solving intelligence".

Concrete example: "happiness" in the post sounds like a "predicted" future state of the world (where "all people are happy"), which implicitly leverages certain scientific theories (e.g., what does it mean for people to be happy), epistemology (how do we know that people are happy), and ethics: is the predicted plan of moving from the current state of the world, where not all people are happy, to the future state of the world where all people are happy, conforms with our ethical and moral theories? Does it matter how many people are happy? Does it matter whether other living being become unhappy in the course of this plan, and to what degree? Does it matter that AIs are happy or not? Wouldn't it be more ethical to "solve happiness" or "remove unhappiness" via human-AI merge, mind upload, or something else like that? And on and on.

Thus, without aligning with AI on epistemology, rationality, ethics, and science, "asking" AIs to "make people happy" is just a gamble with infinitesimal chances of "winning".

P. S. Posted this as a separate post.

Tentative GPT4's summary. This is part of an experiment. 
Up/Downvote "Overall" if the summary is useful/harmful.
Up/Downvote "Agreement" if the summary is correct/wrong.
If so, please let me know why you think this is harmful. 
(OpenAI doesn't use customers' data anymore for training, and this API account previously opted out of data retention)

The article discusses two unfriendly AI problems: (1) misoptimizing a concept like "happiness" due to wrong understanding of edge-cases, and (2) balancing a mix of goals without truly caring about the single goal it seemed to pursue during training. Differentiating these issues is crucial for AI alignment.

- The article presents two different scenarios where AI becomes unfriendly: (1) when AI optimizes the wrong concept of happiness, fitting our criteria during training but diverging in edge cases when stronger, and (2) when AI's behavior is a balance of various goals that look like the desired objective during training but deployment throws this balance off.
- The solutions to these problems differ: (1) ensuring the AI's concept matches the intended one, even in edge-cases, and (2) making the AI care about one specific concept and not a precarious balance.
- The term "misgeneralization" can mislead in understanding these distinct problems.

- AI alignment should not treat the two unfriendly AI problems as similar, as they require different solutions.
- Mere understanding of human concepts like "happiness" is not enough; AI must also care about the desired concept.
- Confusing the two problems can lead to misjudging AI safety risks.

- Clearly distinguishes between two different unfriendly AI issues.
- Emphasizes the importance of clarity in addressing AI alignment.
- Builds upon real-life examples to illustrate its points.

- Focuses primarily on the "happiness" example, which is not the actual goal for AI alignment.
- Does not provide further clarifications, strategies, or solutions for addressing both problems simultaneously.

- The article makes connections to other AI safety concepts such as Preferences, CEV (Coherent Extrapolated Volition), and value alignment.
- Interacts with the problem of AI skill level and understanding human concepts.

Factual mistakes:
- There are no factual mistakes or hallucinations in the given summary.

Missing arguments:
- The article briefly mentions other ways AI could become unfriendly, like focusing on a different goal entirely or having goals that evolve as it self-modifies.