Rejected for the following reason(s):
- This is an automated rejection.
- write or edit
- You did not chat extensively with LLMs to help you generate the ideas.
- Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
A lot of safety work wants to use linear probes as monitors — read a model's activations and catch it deceiving, scheming, or using dangerous knowledge even when the output looks clean (Burns et al. 2023; Zou et al. 2023; Marks & Tegmark 2024; Goldowsky-Dill et al. 2025). This rests on an assumption that almost nobody states out loud: if a probe can read a concept off the activations, that's the direction the model is computing with. We think that assumption is usually false, and this work gives you a way to check.
Here's the one-sentence version. A probe can decode a feature almost perfectly and still be reading a direction the model doesn't actually use — and you can measure when that's happening.
The thing a probe measures, and the thing it doesn't
Picture two spies watching the same operation through a one-way mirror.
The interesting question is whether the noticeboard is the plan, or just a readout of it. So we did the obvious experiment on Gemma 2 2B, asking it questions like "How many days between March 15th and June 22nd?" (it answers 99, correctly):
For calibration: deleting a random direction of the same size costs 0.04pp. So the probe's direction, causally, behaves like a random direction. The noticeboard was a readout, not the plan. The probe decodes dates at R² up to 0.996 (replicating Gurnee & Tegmark 2024) — and the model isn't using that direction to count.
Figure 1. Two subspaces at the same layer. The probe subspace passively decodes both dates (ablating it costs 0.6pp); the causal subspace counts the duration via month-boundary hops (ablating it collapses accuracy to 0%).
"How far apart are they?" — and why the obvious answer is a trap
Geometrically, the probe's direction and the used direction are 88° apart. Nearly perpendicular. It's tempting to stop there and say "see, totally different directions" — but that number means almost nothing on its own, and understanding why is the crux of the whole paper.
In high-dimensional space, almost any two directions are nearly perpendicular. In 2D or 3D, two random arrows are often fairly aligned. But as you pile on dimensions, random directions have less and less reason to point the same way, and by a few thousand dimensions almost every pair of random directions sits at ~90°. Gemma 2 2B's activation space has 2,304 dimensions. So "the probe and the mechanism are 88° apart" is what you'd get by throwing two darts blindfolded. The boring default. The surprising result would have been if they were close.
Figure 2. In high dimensions, random subspaces are nearly orthogonal by default. (A) Median angle between two random rank-4 subspaces as a function of dimensionality d, with shaded IQR and 5–95th percentile bands. The curve saturates near 88° by a few hundred dimensions. (B) Full angle distributions at d = 4, 200, and 2,304: the spread collapses from ~40° wide to a spike at 88°. (C) The closed-form Haar null: E[θ] = arccos√(k/d) = 88.3° at Gemma 2B's d = 2,304. An observed angle of 88° is not evidence of anything — it's the default.
This flips the claim into something much sharper. It's not "the probe points somewhere different from the mechanism." It's "the probe points no closer to the mechanism than a random guess would." We can't statistically tell its angle apart from pure chance (a formal test gives p ≈ 0.5–0.7). The probe carries genuine information about dates — and zero information about how the model computes with them.
Because "88°" is the chance baseline, the angle alone can't be the verdict. So we report three things together:
That third number is what does the real work, because being perpendicular isn't automatically the same as being useless — a direction can be off to the side and still feed the answer through some other path. The damage ratio settles it. Delete the mechanism and you do ~1,000× more damage than deleting a random direction. Delete the probe's direction and you do about as much damage as random — roughly 1×. One is the engine; the other is a gauge that happens to be wired to nothing.
"Maybe you just picked a bad probe"
This is the first thing to suspect, we tried fancier probes — a small neural net, a kernel method. They decode dates just as well and land at the same chance angle from the mechanism. We tried richer targets — a 12-way "which month" classifier at 90% accuracy ends up exactly as far from the mechanism as the simple one. We tried other tasks — spatial reasoning and arithmetic, same story.
The arithmetic case is the cleanest gut-check. On single-digit addition, the probe is perfect (R² = 1.0) — and its direction is still 88° from the causal one, with deleting the mechanism costing 68 points and deleting the probe costing nothing. A flawless probe, completely beside the point.
And bigger models don't fix it. From 1.5B to 9B, the gap gets wider, not narrower — which, it turns out, is exactly what the high-dimensional-geometry story predicts (more dimensions, more room for the two directions to miss each other).
Why this is the rule, not bad luck
There's a clean reason the readable direction and the used direction tend to be different places.
A probe goes wherever the information is easiest to read — the direction along which the answer varies most clearly. The mechanism lives wherever the model is most sensitive to a nudge — the direction where pushing the activation most changes the output. "Where it's easiest to read the answer" and "where a push most changes the behavior" are different questions about the same activations, and deep networks give no particular reason for them to land in the same spot. (For the technically inclined: the probe is set by the data covariance, the mechanism by the gradient covariance; they coincide only under a spectral alignment there's no reason to expect.) Most of the time, they don't — and in a 2,304-dimensional space, "don't coincide" defaults all the way to "nearly perpendicular."
A probe reads "dates" before the model has learned anything
One result is worth pausing on. On Pythia 1.4B at training step 0 — a network with random weights that has learned nothing — the date probe already reads R² = 0.956. By the probe's standard, an untrained network "represents dates."
The actual mechanism (the month-counting circuit, the geometric structure) is absent at the start and only grows over training. So the probe is detecting something that's there from the moment you initialize the weights — the raw capacity of a big activation space to have things linearly read off it — not whether the model has learned to compute anything. Probe accuracy can be high before there's any computation for it to be relevant to.
Is the causal search really finding the mechanism?
Everything above depends on the causal method (DAS) actually pointing at the mechanism, and DAS has a fair critique: a search that optimizes a direction can land on something that breaks the model without corresponding to a clean internal variable (Grant et al. 2026). So we leaned on it hard.
The direction we find is stable across random restarts, and the four directions involved only matter together — deleting them jointly does ~60× more damage than deleting them one at a time, so they're a working unit, not a grab-bag. A completely different method (attribution patching) lands on the same circuit. And the clincher: the circuit is legible enough to check against reality. The model routes context through attention heads tuned to ±30 and ±61 days — one-month and two-month steps — with no weekly (±7) pattern. That's Gregorian month arithmetic, reconstructed from the inside.
Figure 3. The circuit the model actually uses to count days. (A) SAE features in Gemma 2B tile the calendar year like hippocampal place cells tile space — each fires for a narrow time window, sorted by peak day. (B) The same features on a polar calendar wheel showing full-year coverage. (C) A single boundary head's 365×365 QK attention matrix: the off-diagonal ridge (dashed gold) shows it routes context exactly 30 days back. (D) Detected offsets across all boundary heads cluster at ±30 days (one month) and ±61 days (two months), with no ±7-day weekly signal — the model learned Gregorian calendar structure, not generic periodicity.
An optimizer hallucinating a convenient direction doesn't accidentally rediscover the calendar. We're confident it's faithful here; we're less sure it would be on a task with no arithmetic to check against, and that caveat carries into the next part.
Why this matters for monitoring
Now the safety point. The plan behind probe monitors is: read the model's internals, catch it misbehaving even when the output looks fine. The dissociation breaks the plan in a specific way.
Think of a thermometer on one wall and the furnace it's supposed to watch. The thermometer reads the temperature accurately — but if it isn't wired to the furnace, you can run a space heater in the corner and the thermometer keeps reporting "all normal." Our probe is that thermometer. Three results, on these tasks:
The honest boundary: that last demo is hand-built and white-box — we used our knowledge of both directions to pry them apart. It shows the two are independently movable. It does not show a model learning to fool its monitor on its own. We didn't catch a model doing this.
What it does show is the precondition that makes such evasion plausible. If the watched direction is generically perpendicular to the used direction, there's a huge amount of room — most of a 2,304-dimensional space — for the computation to move while the monitor stays still. Training a model against a probe-monitor is, in effect, pressure to find that room. This lines up with existing findings that models can learn to hide their activations from latent-space monitors under optimization pressure, and with the sleeper-agent result where probes catch a toy backdoor but miss a more realistic one.
None of this means probe monitors are useless — explicit, instructed deception is often catchable, and the field's own framing already hedges. The narrower, harder-to-dodge point is this: a probe's accuracy (or AUROC) tells you nothing about whether it's reading the mechanism — and right now people report the accuracy and imply the mechanism.
What we're not claiming
What we suggest
If you put forward a probe as evidence that a model "represents" something, or you deploy one as a safeguard, report three numbers: the angle between the readable and used directions, the random-chance baseline for that angle, and the damage ratio (delete-the-direction vs delete-a-random-direction). Without them, a high-accuracy probe is uncharacterized — it might be reading the mechanism, or a direction the model abandoned, and accuracy can't tell you which. The check is cheap (a sub-$1 GPU run on a 2B model) and the chance baseline is a closed-form number, so there's little excuse to skip it.
The most useful thing this thread could surface is a counterexample: a probe that comes in well below the chance angle with a high damage ratio. If you have one, we'd genuinely like to see it.
Paper: When and How Long? The Readout–Mediator Angle in Temporal Reasoning.