Probes decode what's present, not what's used: the readout–mediator angle

Shreyas Fadnavis

Rejected for the following reason(s):

This is an automated rejection.
write or edit
You did not chat extensively with LLMs to help you generate the ideas.
Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

A lot of safety work wants to use linear probes as monitors — read a model's activations and catch it deceiving, scheming, or using dangerous knowledge even when the output looks clean (Burns et al. 2023; Zou et al. 2023; Marks & Tegmark 2024; Goldowsky-Dill et al. 2025). This rests on an assumption that almost nobody states out loud: if a probe can read a concept off the activations, that's the direction the model is computing with. We think that assumption is usually false, and this work gives you a way to check.

Here's the one-sentence version. A probe can decode a feature almost perfectly and still be reading a direction the model doesn't actually use — and you can measure when that's happening.

The thing a probe measures, and the thing it doesn't

Picture two spies watching the same operation through a one-way mirror.

Spy P reads a noticeboard on the far wall. The board lists everything — dates, names, totals — and Spy P reports it back perfectly. This is the probe: it asks where can I read this information off the activations?
Spy M ignores the board and watches what the operatives actually do — who moves, who hands off to whom. This is the causal test: it asks which part of the activations does the model actually use? The way you find this is blunt: delete a direction on every forward pass and see whether the model breaks.

The interesting question is whether the noticeboard is the plan, or just a readout of it. So we did the obvious experiment on Gemma 2 2B, asking it questions like "How many days between March 15th and June 22nd?" (it answers 99, correctly):

Take down the noticeboard — project out the direction the probe reads from. Accuracy drops by 0.6 percentage points. The model keeps counting 99 days as if nothing happened.
Disrupt what Spy M is watching — project out the direction a causal search (DAS) flags as load-bearing. Accuracy goes from 42% to 0%. The computation is gone.

For calibration: deleting a random direction of the same size costs 0.04pp. So the probe's direction, causally, behaves like a random direction. The noticeboard was a readout, not the plan. The probe decodes dates at R² up to 0.996 (replicating Gurnee & Tegmark 2024) — and the model isn't using that direction to count.

"How far apart are they?" — and why the obvious answer is a trap

Geometrically, the probe's direction and the used direction are 88° apart. Nearly perpendicular. It's tempting to stop there and say "see, totally different directions" — but that number means almost nothing on its own, and understanding why is the crux of the whole paper.

In high-dimensional space, almost any two directions are nearly perpendicular. In 2D or 3D, two random arrows are often fairly aligned. But as you pile on dimensions, random directions have less and less reason to point the same way, and by a few thousand dimensions almost every pair of random directions sits at ~90°. Gemma 2 2B's activation space has 2,304 dimensions. So "the probe and the mechanism are 88° apart" is what you'd get by throwing two darts blindfolded. The boring default. The surprising result would have been if they were close.

This flips the claim into something much sharper. It's not "the probe points somewhere different from the mechanism." It's "the probe points no closer to the mechanism than a random guess would." We can't statistically tell its angle apart from pure chance (a formal test gives p ≈ 0.5–0.7). The probe carries genuine information about dates — and zero information about how the model computes with them.

Because "88°" is the chance baseline, the angle alone can't be the verdict. So we report three things together:

the angle between the readable direction and the used direction,
the chance baseline for that angle (the arccos√(k/d) value above — what two random directions would give), and
a damage ratio: how much worse it is to delete this direction than to delete a random one.

That third number is what does the real work, because being perpendicular isn't automatically the same as being useless — a direction can be off to the side and still feed the answer through some other path. The damage ratio settles it. Delete the mechanism and you do ~1,000× more damage than deleting a random direction. Delete the probe's direction and you do about as much damage as random — roughly 1×. One is the engine; the other is a gauge that happens to be wired to nothing.

"Maybe you just picked a bad probe"

This is the first thing to suspect, we tried fancier probes — a small neural net, a kernel method. They decode dates just as well and land at the same chance angle from the mechanism. We tried richer targets — a 12-way "which month" classifier at 90% accuracy ends up exactly as far from the mechanism as the simple one. We tried other tasks — spatial reasoning and arithmetic, same story.

The arithmetic case is the cleanest gut-check. On single-digit addition, the probe is perfect (R² = 1.0) — and its direction is still 88° from the causal one, with deleting the mechanism costing 68 points and deleting the probe costing nothing. A flawless probe, completely beside the point.

And bigger models don't fix it. From 1.5B to 9B, the gap gets wider, not narrower — which, it turns out, is exactly what the high-dimensional-geometry story predicts (more dimensions, more room for the two directions to miss each other).

Why this is the rule, not bad luck

There's a clean reason the readable direction and the used direction tend to be different places.

A probe goes wherever the information is easiest to read — the direction along which the answer varies most clearly. The mechanism lives wherever the model is most sensitive to a nudge — the direction where pushing the activation most changes the output. "Where it's easiest to read the answer" and "where a push most changes the behavior" are different questions about the same activations, and deep networks give no particular reason for them to land in the same spot. (For the technically inclined: the probe is set by the data covariance, the mechanism by the gradient covariance; they coincide only under a spectral alignment there's no reason to expect.) Most of the time, they don't — and in a 2,304-dimensional space, "don't coincide" defaults all the way to "nearly perpendicular."

A probe reads "dates" before the model has learned anything

One result is worth pausing on. On Pythia 1.4B at training step 0 — a network with random weights that has learned nothing — the date probe already reads R² = 0.956. By the probe's standard, an untrained network "represents dates."

The actual mechanism (the month-counting circuit, the geometric structure) is absent at the start and only grows over training. So the probe is detecting something that's there from the moment you initialize the weights — the raw capacity of a big activation space to have things linearly read off it — not whether the model has learned to compute anything. Probe accuracy can be high before there's any computation for it to be relevant to.

Is the causal search really finding the mechanism?

Everything above depends on the causal method (DAS) actually pointing at the mechanism, and DAS has a fair critique: a search that optimizes a direction can land on something that breaks the model without corresponding to a clean internal variable (Grant et al. 2026). So we leaned on it hard.

The direction we find is stable across random restarts, and the four directions involved only matter together — deleting them jointly does ~60× more damage than deleting them one at a time, so they're a working unit, not a grab-bag. A completely different method (attribution patching) lands on the same circuit. And the clincher: the circuit is legible enough to check against reality. The model routes context through attention heads tuned to ±30 and ±61 days — one-month and two-month steps — with no weekly (±7) pattern. That's Gregorian month arithmetic, reconstructed from the inside.

An optimizer hallucinating a convenient direction doesn't accidentally rediscover the calendar. We're confident it's faithful here; we're less sure it would be on a task with no arithmetic to check against, and that caveat carries into the next part.

Why this matters for monitoring

Now the safety point. The plan behind probe monitors is: read the model's internals, catch it misbehaving even when the output looks fine. The dissociation breaks the plan in a specific way.

Think of a thermometer on one wall and the furnace it's supposed to watch. The thermometer reads the temperature accurately — but if it isn't wired to the furnace, you can run a space heater in the corner and the thermometer keeps reporting "all normal." Our probe is that thermometer. Three results, on these tasks:

Building the monitor with safety intent doesn't help. A purpose-built "is the model confident?" monitor, trained on the same activations, lands 88.4° from the mechanism — right on the chance baseline. The blind spot is geometric, not a matter of training the probe harder or for the right thing. Every one of seven different target choices we tried lands at chance.
The probe's reading tells you nothing about the mechanism's state. Formally, the mutual information between them is 0.000 bits, while the probe carries ~4 bits about the actual dates. Knowing what the probe reports gives you literally no information about what the computation is doing.
You can exploit the gap on purpose. With one knob that corrupts the mechanism and another that keeps the probe reassured, we shift the model's internal answer by 71 days while the probe still reports a 6-day error. 93% of the damage is invisible to the monitor.

The honest boundary: that last demo is hand-built and white-box — we used our knowledge of both directions to pry them apart. It shows the two are independently movable. It does not show a model learning to fool its monitor on its own. We didn't catch a model doing this.

What it does show is the precondition that makes such evasion plausible. If the watched direction is generically perpendicular to the used direction, there's a huge amount of room — most of a 2,304-dimensional space — for the computation to move while the monitor stays still. Training a model against a probe-monitor is, in effect, pressure to find that room. This lines up with existing findings that models can learn to hide their activations from latent-space monitors under optimization pressure, and with the sleeper-agent result where probes catch a toy backdoor but miss a more realistic one.

None of this means probe monitors are useless — explicit, instructed deception is often catchable, and the field's own framing already hedges. The narrower, harder-to-dodge point is this: a probe's accuracy (or AUROC) tells you nothing about whether it's reading the mechanism — and right now people report the accuracy and imply the mechanism.

What we're not claiming

These are tasks where the model computes a number (durations, positions, sums), not deception. The geometric argument is general and three domains behave identically, so the burden is on a deployment to show its probe is the exception — but we haven't demonstrated this on deception itself.
The faithfulness of the causal search is strongest where there's external ground truth (here, the calendar). On open-ended safety properties that's a bigger assumption.
The exploit is white-box and static, not a model that learned to evade.

What we suggest

If you put forward a probe as evidence that a model "represents" something, or you deploy one as a safeguard, report three numbers: the angle between the readable and used directions, the random-chance baseline for that angle, and the damage ratio (delete-the-direction vs delete-a-random-direction). Without them, a high-accuracy probe is uncharacterized — it might be reading the mechanism, or a direction the model abandoned, and accuracy can't tell you which. The check is cheap (a sub-$1 GPU run on a 2B model) and the chance baseline is a closed-form number, so there's little excuse to skip it.

The most useful thing this thread could surface is a counterexample: a probe that comes in well below the chance angle with a high damage ratio. If you have one, we'd genuinely like to see it.

Paper: When and How Long? The Readout–Mediator Angle in Temporal Reasoning.