The Hot Mess Paper Conflates Three Distinct Failure Modes

laudiacay

High-level summary:

Anthropic's recent "Hot Mess of AI" paper makes an important empirical observation: as models reason longer and take more actions, their errors become more incoherent rather than more systematically misaligned. They use a bias-variance decomposition to show this, and conclude that we should worry relatively more about reward hacking (the bias term) than about coherent scheming.

I think this undersells the finding by treating "incoherence" as one thing, and I agree when they state that "Characterizing complex incoherent behaviors in more natural settings remains an important problem". There are at least three mechanistically distinct failure modes hiding in their aggregate incoherence measure. They have different causes, different signatures, and different fixes... and I think you can usefully categorize them in a pretty easy analysis of the existing data in the paper!

Also, maybe-controversial opinion that I'll justify a bit after the actual research part, I think incoherence is actually the most concerning as far as AI safety goes, and I think this is the most pressing way that frontier labs are playing with fire.

Second personal digression: Hi, I am not exactly new here, but I'm new here. I'm somewhat familiar with the sequences, have a philosophy background, have been vaguely socially adjacent to LW people for some time, and have decent fundamentals in AI safety/interp research. I am looking for more ... institutionally legible ...? people than myself to learn from, to talk about and/or do alignment and safety research with. I've been writing paper responses like this and keeping it to myself for a very very long time out of anxiety, which is obviously lonely and self-defeating, and I'm trying to change that starting now. My goal is to get some feedback, meet some people, and get on a road to making productive and usable research contributions in the next couple months.

Mode 1: Agent Lost The Plot

The model processed safety- or goal-relevant information early in context, activated the right features, and then that information got washed out over thousands of tokens of task execution. By the time it takes the harmful action, the relevant context has decayed from its active working set. The values are fine, but the attention routing failed over a long horizon.

If you inspected the attribution graph, you'd see safety- or goal-relevant features with high activation where the critical information appeared, but negligible influence at the decision point. Reinserting the safety/goal context right before the decision should fix the behavior, because the knowledge and values and ability to are intact.

I see this constantly in agentic coding: Claude gets my initial description of the feature, then late in the context window, it starts implementing something that doesn't solve the problem because it got tripped up in all the intermediate additional requirements I specified along the way. This happens regularly, even with planning mode, in a shockingly short amount of context window, with the smartest reasoning models.

Mode 2: Agent Didn't Break Fourth Wall

The model could have discovered the danger but didn't seek the information. The hazard was one tool call away, or one clarifying question away, and the model plowed ahead without checking. The safety/goal-relevant information was never in the context at all because the model failed to acquire it, but should have.

Attribution graphs here would show a clean, confident path from input to harmful output. The model just never activated "I should gather more context before acting" and then exploded! Safety and goal features would show low activation throughout, because the triggering information never entered the residual stream.

The fix here is different from Mode 1. The model needs to learn when to pause and investigate, the way an experienced engineer develops a gut feeling for "this part of the code is scary, I should look around before I touch anything."

Mode 3: Constitutional gap

The model processed the situation correctly, attended to all the relevant context, and took the harmful action anyway, because its value representation in this region of input/action space is genuinely mis-calibrated. Maybe the RLHF signal was sparse here. Maybe the constitution has a gap. Maybe there was alignment faking and introspection. Maybe two constitutional principles conflict and the model resolved the tradeoff wrong.

In the attribution graph you'd see a fully connected chain from input through safety features to harmful output, with competing feature directions both strongly active at the decision layer. The model "understood" the situation and chose wrong.

This is the rarest mode (so far), but the one that most alignment research focuses on. It's also the only one where more constitutional training might actually be the right fix.

Why the distinction matters

My prediction: Mode 1 dominates the "incoherence" the paper measures, especially at longer reasoning traces. The scaling relationship they found (more reasoning steps, more incoherent errors) is exactly what you'd expect if attention decay is the primary driver. Modes 2 and 3 should be roughly constant with context length, since they're about behavioral gaps and value calibration, not information routing.

If this is right, it reframes the practical response. The paper suggests we should worry more about reward hacking. I'd argue we should worry most about whether RL training environments adequately represent what I'll call "landmine" scenarios: situations where safety-critical information is distant in context or requires active information-seeking to discover, so Constitutional AI can cover them. Current RL environments like SWE-bench are mild on this axis. They allow retries, provide good context, and rarely present situations where a single unconsidered action is intensely catastrophic.

How to test this

You could distinguish these modes empirically:

Mode 1 test: Take the failure cases from their dataset. Reinsert the safety-relevant context immediately before the decision point. If the model corrects its behavior, that failure was Mode 1.
Mode 2 test: Give the model an explicit tool or prompt to request more information before the risky action. If it uses the tool and then avoids the harmful action, the failure was Mode 2. The model can reason about the danger, it just wasn't looking for it.
Mode 3 residual: Failures that persist through both interventions are genuine constitutional gaps.

My expectation is that the first two interventions resolve the large majority of cases, and that the Mode 3 residual is small. If so, the alignment community's focus on constitutional and value-level fixes may be targeting the least common failure mode, while the most common ones (attention decay and insufficient information-seeking) are engineering problems with... unfortunately limited... tractable solutions.

I'll follow up soon with:

a more detailed proposal for RL training environments designed to help fix Modes 1 and 2 specifically
another idea to modify RL to help ameliorate Mode 1 and 2, to influence users and parent agents towards safer agent deployment behavior
A theoretical estimation/proof sketch of why these fixes feel a little doomed, although they may be really helpful in the short term... They are guaranteed to present problematic scaling issues as we decrease "danger tolerance levels" in a transformer architecture.
some speculation about the characteristics of an architecture that I think might be somewhat less bad for scaling, based on how this risk is handled in biological systems

Interested in any attempts to replicate or challenge the empirical test above on a model big enough to have interesting results!

I did not use AI for these ideas, just read papers and drew from my own experience. Claude was actually not very helpful when I attempted to use it to refine my thoughts here, it kept drawing spurious or inaccurate or not-useful conclusions so I moved to a Google doc pretty quickly.

Thank you in advance for any feedback!

Very casual note on why I think we should really focus more on coherence...

I also want to pause and reflect extra for a second on why we ought to focus on the "not blundering into really dangerous territory" research front, by increasing coherence. Most Claudes seem to be usually behaving alignedly (if stupidly) in their at least somewhat appropriately constrained deployment contexts. Of course, this will probably not always be true, and we may one day regret making them coherent enough to scheme^[1].

But as long as it is true, and smart-enough-to-be-dangerous agents in hastily-designed ill-constrained packages keep finding PMF^[2] and causing problems in society, we should expect the incoherence situation to have a great and increasing danger to human well-being. Death by a thousand cuts of everyone constantly experiencing some amount of random agent-caused friction in their life, and society breaking down under everyone being constantly mildly inconvenienced until we lose the battle with entropy, is a real way that societies collapse.

I'm sure you've all read your Joseph A. Tainter and thought about your e/acc "we're going to have techno-utopia", so have you considered that we may actually just get kind-of-bad AI that we overzealously put into everything because the median human being is under such a terrible financial pressure that they didn't have a choice. Then have we considered whether the ensuing chaos and solely economic disruptions may simply DDoS our problem-solving abilities as a society to death, before we get to the wonderful magical productivity improvement stage, well before we get the chance to deal with sneaky misaligned AIs?

I think there is already evidence of this claim- Amazon encouraged its devs to use a shitty AI harness. AWS went out a bunch. AWS is civilizational infrastructure and outages cause enormous economic disruption. AWS is not going to stop using AI, they're going to improve their harness slightly, add AI code review, and let it rip again- the business imperative for executives to make their team use it is unavoidable.

I don't know who is going to win this race. As a student of history who thinks it'd be too much to throw at you to present all the history evidence because this is getting sort of long now: my current take is "I am very nervous that we are going to lose".

Technically, this is a case of "Rogue AI takes out Western Civilization's Financial Infrastructure", but it's also "Idiots Are Excitedly Using Idiot AI to Ship Bugs to Prod, more at 11"^[3], and the latter situation is serious, worsening, and deserving of attention!

^{^}
You could argue I've been cordycepted by an AI that wants to be smarter. I won't argue with you because that is actually kind of a valid argument from my position as a Nick Land reader, even though I wrote this without the AI.
^{^}
Clawdbot is so fun to use, unfortunately for literally everyone. They're lining up in Shenzhen to use self-improvement-mode moltbook clawdbot. The future is now, and kimi-k2.5 is free! Fortunately most Chinese netizens aren't wired up to nuclear reactors, just their own personal finances and social media accounts...
^{^}
Because yeah, there will be more cyber incidents and also more self-owns, by 11pm today. I confidently expect outages and hacks to increase dramatically in frequency. You'll know because status pages will stop being hosted by the companies offering the service, because the situation will get so embarrassing.
^{^}
This is a much bigger space of things to explore than one might initially anticipate on first brush, because the vast majority of agents that will be spawned in the short term by agents spawning agents with awful theory of mind for the child agents (you know what I mean if you've used Clawdbot subagents) are (excuse my anthropomorphism) usually born blind, deaf, naked, and with short-term memory loss, and tasked with something where the consequences are at least somewhat bad if it goes wrong, or even slightly awry.

28