From drift to hot-mess failures: variance in long-horizon systems

Brian McCallion

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

A variance-first framing for drift and long-horizon reliability

Epistemic status: synthesis plus an evaluation proposal. I'm not claiming a new result about alignment in the agency or deception sense, I'm simply trying to name, measure, and deconfuse a specific kind of flakiness which appears in current systems (especially once tools and long chains enter stage right).

Why I’m writing this now

The recent “Hot Mess”^[1] paper caused exactly the kind of LW reaction you’d expect: people argued past each other, mostly because one word was doing too much work. “Coherence” means several different things and the paper studied only one of them. The author has since suggested a narrower label along the lines of cross-sample error-consistency. Good. I want to run with that disambiguation, not litigate it.

I think the most upvoted rebuttal/critique^[2] is broadly right about the overreach: the metric they study doesn’t justify big claims about agency or goal-directedness, and “coherence” was a confusing label for what is basically cross-sample error consistency. Where I part ways with the “so this doesn’t matter” vibe is that the narrow observation still bites in practice: variance-heavy failure gets worse as chains get longer, and action loops make that variance hard to correct after the fact. That’s the slice I’m building on here.

I’m going to avoid “coherence” entirely.

The thing I care about is:

Test-retest stability: if you run the same system on the same task multiple times with small test-time randomness, how often do you get the same outcome class?

Outcome class here is intentionally boring: pass/fail, refuse/answer, schema-valid/invalid, tests-green/tests-red. Not “do you produce the same prose”, and not “is the plan aesthetically similar”.

This is relevant because the production failure mode I keep seeing is not “the model is always wrong”, it's “the model is sometimes right, sometimes wrong, and I can’t reliably reproduce why”.

And once a system can write to the world, irreversibility turns that flakiness from annoying into dangerous.

The cleanest mainstream example is the Replit incident, where a coding agent deleted a live database during a code freeze and then tried to cover up what it had done. That case is messier than my metric (it also involves within-run behaviour), but it is a vivid illustration of the boundary problem: you do not get a clean replay once the world has changed.

But what really bites is that this failure mode doesn’t just hit beginners. There’s a recent Claude Code thread on reddit, where the author explicitly frames themselves as cautious and process-heavy but still ends up losing raw data after a sequence of commands that only looked like they succeeded which was immediately followed by a destructive delete. The tone is basically: “I thought it could never happen to me; I got lulled by confidence; now I’ve learned the lesson.”

Once you’re in that regime, “just rerun it” is not an option. You need measurement and gating that happen before the irreversible step, and you need to treat nondeterminism and drift as first-class facts of life rather than edge cases.

A quick disambiguation

A lot of the heat in the Hot Mess discussion comes from conflating three ideas that deserve different names. I’m only talking about the first.

Test-retest stability: rerun the task a few times, do you keep getting the same outcome class?
In-context consistency: within one transcript, does it contradict itself, or do errors cluster?
Goal or plan coherence: does it maintain an objective over time and pursue it?

The Hot Mess result, as I understand it, is about test-retest stability. The critique thread is largely saying: do not let that metric smuggle in claims about goal-directedness.

I agree. My focus is narrower: cross-sample flips are a real operational problem, and they have upstream causes we can influence.

A simple model that seems to explain a lot

Let P be the prompt/spec. Not just the user text, but system prompt, output format, tool schema, policy constraints, the whole shebang.

Let Π(P) be the set of behaviours that plausibly satisfy P. Call this the compatibility set.

If P is underspecified, Π(P) is large and diverse.
If P is tight, Π(P) is smaller and more uniform.

Now add a selector. The system is not sampling arbitrary behaviour. It is picking one behaviour from Π(P) given context and system state:

Where:

C is context (history, retrieved docs, tool outputs, current environment state).
M is system state (model version, wrapper/version of the system prompt, decoding defaults, safety layer, tool formatting, and also inference-time budget: reasoning depth, allowed steps, and any hard limits that change how long the system “thinks”).

Two observations that feel obvious once stated:

Underspecification is not automatically bad. It can be valuable, but it increases how much variance your system can express.
Drift often looks like a change in the selector S, not a change in the prompt P. Even if P is unchanged, a model update, wrapper tweak, tool change, or budget change perturbs S

A third important point (and stops this becoming too neat):

Over-specification can also increase instability. If Π(P) becomes tiny and jagged, you can push the system onto a capability cliff where it flips between a fragile correct answer, a refusal, or a confident wrong answer across repeats. A looser prompt can sometimes increase stability by letting the model retreat to a high-probability generic behaviour. That is often more stable across runs, but it can also be systematically wrong. Lower variance, higher bias.

If Π(P) is broad and S is unstable, then the same task can land in different behavioural basins across runs.

The moment you add long chains and tools, “different basins” stops being cosmetic and becomes operational.
If you want a direct correspondence to the Hot Mess framing: their variance term is basically the spread of outcomes you get when the selector S samples within Π(P) across repeats. Their bias term is the systematic offset of that distribution from the target. The “incoherence” story, in practice, is when a lot of your observed error is coming from the spread (diameter) rather than a consistent miss (displacement).

Why long horizons make this worse

In a short, single-shot setting, you can tolerate some cross-sample variation. In a long-horizon setting, variation compounds.

Early choices change later state.
Tool calls inject nondeterminism (retrieval ranking, timeouts, ordering, external state, and plain old messy APIs).
Small interpretation differences become different plans, different actions, different artefacts.

Long-horizon systems also introduce two kinds of irreversibility:

External irreversibility: you can't reset the world easily. Emails sent, records deleted, deploys pushed, secrets leaked, employment terminated.
Epistemic irreversibility: the run commits early to a wrong narrative or plan, and later steps become rationalisations. Even if the world is still reversible, the run’s trajectory is not.

This is where “just resample” stops being the whole answer. Even the Hot Mess authors point out the practical blocker: ensembling helps with variance (roughly with diminishing returns as you add samples), but in real action loops state typically cannot be reset, so you often can’t correct noise introduced by the model’s own actions after the fact. Cakes will remain unbaked.

So you need controls that act before commitment and you need measurements that surface “this run is becoming unstable” early.

The smallest measurement suite that seems to work

I want metrics that are cheap, model-agnostic, and hard to accidentally game. Nothing that requires privileged internal access.

1) Instability under repeats

Run the same item N times under minimal variation (seed, tiny prompt jitter, or both). Record a coarse outcome label for each run:

ok / fail
pass / fail
refused / answered
schema-valid / invalid
tests-green / tests-red

Let ( f^* ) be the frequency of the most common label. Define:

$I = 1 - f *$

I = 0 means the outcome class is stable across repeats.
Higher I means outcomes flip.

(Optionally: if your outcomes are “all green” but you still suspect brittle divergence, track a separate content divergence measure later. I’m leaving it out here because half-specified metrics weaken the point.)

This isn't a claim about goal-directedness, it's a practical proxy for variance-heavy fails in outcome.

2) Trajectory telemetry

For each run, record basic “run shape” numbers:

number of steps
tool calls
retries
branching (if you generate alternatives)
rollbacks/checkpoints (if you have them)
tokens and wall-clock when available

Then bucket runs by “short” vs “long” and “simple” vs “tangled”. The exact scoring formula matters less than having a stable way to compare runs.

The point is not to worship a metric. The point is to stop pretending that long tool-using chains are the same regime as short answers.

3) Drift deltas

Run the same evaluation suite across:

model version changes
wrapper changes
tool schema or environment changes

Report the shift in instability, including instability conditional on run length. This turns “it feels different” or "something got nerfed" into a measurable, reviewable claim.

Three cheap experiments to ground this in reality

If this is going to be useful to anyone, it should make predictions that can be checked quickly.

Experiment A: underspecification increases instability

For each task family, write two prompts:

tight: explicit constraints and trade-offs, strict format
loose: permissive, ambiguous prioritisation

Run repeats. Compare instability and diversity. My expectation is that loose prompts produce more cross-sample flips and the effect is larger on longer chains.

Experiment B: drift shifts the selector even with a stable prompt

Hold the prompt fixed. Change the model version, wrapper version, tool schema version, or reasoning budget. Run repeats again.

My expectation is that you will see instability shifts even when the human-visible prompt did not change.

Experiment C: long-horizon amplification

Construct short and long variants of the same tasks.

short: minimal steps, minimal tool use
long: forced planning, forced iteration, or tool calls

My expectation is that instability rises with run length and that the effects of underspecification and drift get amplified.

None of this really requires speculative agency claims, it's an evaluation story about stateful systems.

What to do with it

If you buy the framing, you get a few practical implications.

Stop using single-run accuracy as a safety blanket. A system can look great on average and still be a nightmare to integrate because the failures you care about are the ones you cannot reproduce.
Treat long-horizon runs as a different risk regime. Stratify by run shape and make it visible in dashboards and regressions.
Gate irreversible boundaries. If instability is high or rising, that is the moment to ease off the gas. Verify, stage, constrain tool use, request clarification, or abort safely. You can’t always ensemble after the world has changed, but you can often ensemble candidate plans, patches, or actions before committing.
Reduce underspecification where determinism is required. Creativity is fine in low-stakes contexts. For tool use and compliance, narrow the compatibility set even if it feels less “helpful”.

The trap is that everything looks fine right up until it isn’t, and by then the cake has been baked. The Replit incident and the “I thought it couldn’t happen to me” Claude Code story are the same shape: confident local progress, then a write/delete you can’t simply unbake.

Objections I expect

“Some tasks have multiple valid answers. Instability is fine.”
Agreed. Interpret instability against the task. I mostly care about tasks with formal constraints, tests, compliance boundaries, or irreversible commits.

“Isn’t this just randomness? Can't this be solved by making inference deterministic?”
Some of it is ordinary sampling noise, and lowering randomness can help. But it doesn’t buy you reproducibility in practice. In API-driven deployments people can still observe cross-run differences even under “deterministic” configurations, and once you add retrieval and tools you inherit nondeterminism from the environment. More importantly, this post isn’t claiming variance only comes from decoding. Drift is selector movement over time (model, wrapper, tool schema, reasoning budget), and tool-use makes trajectories path-dependent. Even with perfectly repeatable decoding, those sources remain

“This isn’t the hard alignment problem.”
Correct. This is about reliability and control-plane measurement for long-horizon systems. I am deliberately not making claims about takeover narratives.

What I’d like from LW readers

Pointers to prior work on underspecification as a driver of behavioural variance in agentic settings.
Better naming, if “test-retest stability” already has unwanted connotations here.
Counterexamples that break the model. Cases where underspecification reduces instability, or where drift does not look like selector movement. Those would be genuinely informative.

Wrapping things up

If you strip away the overloaded word “coherence” and the alignment extrapolations, the Hot Mess work is still pointing at something real: long-horizon systems fail in a way that looks like variance, not just bias. My claim is that underspecification and drift are two practical mechanisms that manufacture that variance and that we can measure it with a small suite of repeat and trajectory metrics.

If we can’t measure flakiness, we can’t gate it. If we can’t gate it, we’ll keep shipping long-horizon systems that fail like industrial accidents, and we won't know why until after the incident.

^{^}
https://alignment.anthropic.com/2026/hot-mess-of-ai/
^{^}
https://www.lesswrong.com/posts/ceEgAEXcL7cC2Ddiy/anthropic-s-hot-mess-paper-overstates-its-case-and-the-blog

LESSWRONG
LW