Yes I agree with RLHF hacking and emergent misalignment are both examples of unexpected generalization but they seem different in a lot of ways.
The process of producing a sycophantic AI looks something like this:
1. Some users like sycophantic responses that agree with the user and click thumbs up on these responses.
2. Train the RLHF reward model on this data.
3. Train the policy model using the reward model. The policy model learns to get high reward for producing responses that seem to agree with the user (e.g. starting sentences with "You're right!").
In the original emergent misalignment paper they:
1. Finetune the model to output insecure code
2. After fine-tuning, the model exhibits broad misalignment, such as endorsing extreme views (e.g. AI enslaving humans) or behaving deceptively, even on tasks unrelated to code.
Sycophantic AI doesn't seem that surprising because it's a special case of reward hacking in the context of LLMs and reward hacking isn't new. For example, reward hacking has been observed since the cost runners boat reward hacking example in 2016.
In the case of emergent misalignment (EM), there is unexpected generalization but EM seems different to sycophancy or reward hacking and more novel for several reasons:
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
I think "Sydney" Bing and Grok MechaHitler are best explained as examples of misaligned personas that exist in those LLMs. I'd consider misaligned personas (related to emergent misalignment) to be a novel alignment failure mode that is unique to LLMs and their tendency to act as simulators of personas.
The AI company Mechanize posted a blog post in November called "Unfalsifiable stories of doom" which is a high-quality critique of AI doom and IABIED that I haven't seen shared or discussed on LessWrong yet.
Link: https://www.mechanize.work/blog/unfalsifiable-stories-of-doom/
AI summary: the evolution→gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution's coarse genomic selection; the "alien drives" evidence (Claude scheming, Grok antisemitism, etc.) actually shows human-like failure modes; and the "one try" framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
Thanks for the post. The importance of reward function design for solving the alignment problem is worth emphasizing.
I'm wondering how you research fits into other reward function alignment research such as CHAI's research on CIRL and inverse reinforcement learning, and reward learning theory.
It seems like these other agendas are focused on using game theory or machine learning fundamentals to come up with a new RL approach that makes AI alignment easier whereas your research is more focused on the intersection of neuroscience and RL.
This post reminded me of a book called The Technological Singularity (2015) by Murray Shanahan that also emphasizes the importance of reward function design for advanced AI. Relevant extract from the book:
"In the end, everything depends on the AI’s reward function. From a cognitive standpoint, human-like emotions are a crude mechanism for modulating behavior. Unlike other cognitive attributes we associate with consciousness, there seems to be no logical necessity for an artificial general intelligence to behave as if it had empathy or emotion. If its reward function is suitably designed, then its benevolence is assured. However, it is extremely difficult to design a reward function that is guaranteed not to produce undesirable behavior. As we’ll see shortly, a flaw in the reward function of a superintelligent AI could be catastrophic. Indeed such a flaw could mean the difference between a utopian future of cosmic expansion and unending plenty, and a dystopian future of endless horror, perhaps even extinction."
Nice post. Approval reward seems like it helps explain a lot of human motivation and behavior.
I'm wondering whether approval reward would really be a safe source of motivation in an AGI though. From the post, it apparently comes from two sources in humans:
In each case it seems the person is generating behaviors and they there is an equally strong/robust reward classifier internally or externally so it's hard to game.
The internal classifier is hard to game because we can't edit our minds.
And other people are hard to fool. For example, there are fake billionaires but they are usually found out and then get negative approval so it's not worth it.
But I'm wondering would an AGI with an approval reward modify itself to reward hack or figure out how to fool humans in clever ways (like the RLHF robot arm) to get more approval.
Though maybe implementing an approval reward in an AI gets you most of the alignment you need and it's robust enough.
Epoch AI has a map of frontier AI datacenters: https://epoch.ai/data/data-centers/satellite-explorer
Death and AI successionism or AI doom are similar because they feel difficult to avoid and therefore it's insightful to analyze how people currently cope with death as a model of how they might later cope with AI takeover or AI successionism.
Regarding death, similar to what you described in the post, I think people often begin with a mindset of confused, uncomfortable dissonance. Then they usually converge on one of a few predictable narratives:
1. Acceptance: "Death is inevitable, so trying to fight it is pointless." Given the inevitability and unavoidability of death, worrying about it or putting effort into avoiding it is futile and pointless. Just swallow the bitter truth and go on living.
2. Denial: Avoiding the topic or distracting oneself from the implications.
3. Positive reframing: Turning death into something desirable or meaningful. As Eliezer Yudkowsky has pointed out, if you were hit on the head with a baseball bat every week, you’d eventually start saying it built character. Many people rationalize death as “natural” or essential to meaning.
Your post seems mostly about mindset #3: AI successionism framed as good or even noble. I’d expect #2 and #3 to be strong psychological attractors as well, but based on personal experience, #1 seems most likely.
I see all three as cognitive distortions: comforting stories designed to reduce dissonance rather than finding an accurate model of reality.
A more truth-seeking and honest mindset is to acknowledge unpleasant realities (death, AI risk), that these events may be likely but not guaranteed, and then ask what actions increase the probability of positive outcomes and decrease negative ones. This is the kind of mindset that is described in IABIED.
I also think a good heuristic is to be skeptical of narratives that minimize human agency or suppress moral obligations to act (e.g. "it's inevitable so why try").
Another one is the imminent prediction that AI progress will soon stop or plateau because of diminishing returns or limitations of the technology. Even a professor I know believed that.
I think that's a possibility but I think this belief is usually a consequence of wishful thinking and status quo bias rather than carefully examining the current evidence and trajectory of the technology.
I agree with your high-level view that is something like "If you created a complex system you don't understand then you will likely get unexpected undesirable behavior from it."
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago: