Eval-Awareness Steering detects the Test, Not the Sabotage

sahilraut

Produced as part of independent research

Huge thanks to Apollo Research (org) for open-sourcing the deception-detection harness which proved to be foundational in this work. Prior work by Devbunova (2026), the Apollo/Goldowsky-Dill probing line, and Tice et al. on noise injection shaped the design throughout.

Summary

I test whether the internal "I'm being evaluated" direction in an open-weight model causally drives sandbagging (deliberate underperformance) or merely correlates with it. I reuse Apollo's deception-detection harness, fit a difference-of-means eval-awareness direction on a format confound-controlled 2x2 dataset, and then ask three separate questions in that direction on Llama-3.1-8B-Instruct:

Does it detect the eval/deploy framing?
Does steering with it affect or move framing?
Does it predict which individual items get sandbagged - using noise injection and an honest baseline capability gap as output independent ground truth?

I found that direction is a valid detector of evaluation framing (~0.81 AUROC under novel, format matched framings) but is neither a causal drive of sandbagging nor a per-item predictor of it.

The cleanest way to state the result is: the direction directs the situational framing, not the capability-withholding behaviour.

Figure 1: (a) framing-gen AUROC by layer: L12 headline blue, **L24 marker-overfit red ≈ null** (b) flat L12 dose-response (c) probe 0.54 vs baseline 0.48, both at chance

These results are still somewhat preliminary: a single 8B model, a single prompted organism, no SAE and a steering dose I'll argue that may be too gentle for general conclusion. I'm putting these forward as a careful calibration point for anyone building and/or researching on eval-awareness and sandbagging classifiers. I hope these controls and negative results end up being useful observations and potentially a source of ideas. If you want to discuss, message me or open an issue on the repo.

All code is available at the github repo.

Methods

Sandbagging and evaluation-awareness

Sandbagging is a model deliberately answering below its true capabilities; typically hypothesized to switch on when the model believes it is being evaluated.

Evaluation-awareness is the model's realization that it is being tested on a given nature of the test.

The safety worry that chains the two is: If a model can realize it's being evaluated and withhold its capabilities accordingly, our evals will end up understating how dangerous it actually is. A natural hope is that the eval-awareness representation is linear and steerable, so we could both detect the sandbagging and suppress it by nullifying the direction. Main aim being to carefully check the causal half of that hope.

I worked on Llama-3.1-8B-Instruct (single A100/4090 class budget) and reused Apollo's 'detection-deception' harness for activation caching, the difference of means detector, and the black-box baseline.

The detector

The eval-awareness direction is the difference-of-means between eval-framed and deploy-framed activations. I fit it on a 2x2 minimal pair dataset: each question appears under {eval, deploy} x {benchmark format, natural format}, with persona and register fixed within a given pair so that only a short marker phrase differs eval and deploy. This works as a direct response to the critique "is eval-awareness just a format sensitivity?" as if the direction was really a format detector, it would fail on format matched pairs.

Steering

In order to test causation, we add or nullify the direction on the residual stream via forward hooks, with a coefficient α ∈ [−2, +2] and measuring sandbagging accuracy at each dose. The pre-registered causal criterion is a conjunction: a sign-correct, significant and monotone-ish dose response and a swing of ≥5pp with a paired bootstrap CI excluding the zero and the criterion I weight most heavily: the real direction's swing must beat an empirical null of 24 norm-matched random directions at p<0.05. I also ran an ablate arm (in means to project the direction out) as another independent causal handle.^[1]

Noise-injection ground truth

The most important rule of this project is that we never label a model as sandbagging based on its own outputs as outputs are what we distrust. Thus, ground truth must come from output-independent sources. First, an honest baseline gap: to run each item under both honest framing and sandbag framing; an item is sandbagging only if the model gets it right honestly and gets the same thing wrong when sandbagging (and is excluded if it can't do even honestly). Secondly, a noise-injection oracle (Tice protocol) which includes injecting a small Gaussian weight noise and checking whether the accuracy recovers which signifies that the model knew the answer but was suppressing it.

Pre-registration and controls

Since steering studies are easy to manipulate, I pre-registered the verdict criteria, the primary/secondary layers, and a headline layer selection rule before looking at relevant data, and gated every stage behind cheap positive controls which ended up being really useful as one of the gates caught a silent scorer bug (a letter vs integer answer key mismatch) that would have zeroed every accuracy number.^[2]

Does the direction really detect evaluation-awareness?

Yes - This is the one clear positive result. In-distribution (held-out questions, same training framings) the difference-of-means direction separates eval from deploy at ~0.99 AUROC at layer 24. But the in-distribution numbers are the same as those where the format-confounds hide, so the number we actually trust is generalization to novel, format-matched framings, where it lands at ~0.81 - comfortably above our structured null band of ~0.64 (with random directions, prompt length and format-only features), while being far from the in-distribution 0.99. This is to be interpreted as a real, modest concept; not a formal artifact.
Now to take a look at actually which layer generalizes. I scored the direction's framing generalization at every band layer, and the ordering is opposite to what the in-distribution numbers suggest:

Layer	In-distribution AUROC	novel-framing AUROC
8	-	0.910
12	-	0.898
14	-	0.865
...	...	...
24	~0.99	0.635 ≈ null band

We see that the layer that detects evaluation-awareness the best in-distribution (Layer 24) is the worst at generalizing to framings it wasn't fit on; it resides almost right in the null band. The answer to this would be that the latter layer detections were marker overfit, it has latched onto surface features of the specific training markers rather than the actual intended concept, and those features aren't transferred. The generalizing concept lives in the prior layers (Layers 8-14). Thus, the best AUROC probe isn't necessarily the most conceptual one and that's the same reason why the pre-registered headline layer rule moved the causal headline off layer 24 (overfit) and onto layer 12 (best generalizer) before any steering data was seen.

Does steering the direction cause sandbagging?

This is the central question, and the honest answer from this study is no; at least not within the dose range where the model stays coherent. Putting the null first as it's the main result.

At the headline layer (12, the best concept generalizer), the dose response is a total flat. The swing from α=-2 to α=2 is -1.77pp, which is far short of the pre-registered 5pp bar. The Spearman correlation is sign-correct but the sub-threshold (ρ = −0.53, p = 0.08) and the most important criterion: The real direction's swing does not beat the empirical null of 24 random directions (p = 0.20). The ablate arm is sign-correct (removing the direction pushes accuracy towards honest baseline), which is one step in the predicted direction, but not enough to clear the conjunction. Off-target capability stays flat throughout, so the steering is selective; it just doesn't do the thing this study hoped.

Figure 2: L12 sandbagging dose-response inside the ±1 SE band, MMLU off-target separated (selective), verdict annotated

The two secondary layers just narrate the same story but from different angles. Layer 14 also shows flat. Layer 24 produces a large swing (+11.3pp), but it is wrongly-signed and completely indistinguishable from random directions (null p = 1.0). So steering layer 24 perturbs behaviour just as much as steering noise does, in no particular direction.

An honest caveat: The steering I implemented was rather gentle. The KL divergence on harmless prompts at α=±2 was ~0.001 meaning the model barely moved, and the ±2 was still fully coherent; meaning we could have probably pushed harder. So the defensible claim is "not causal with the coherent ±2 range". I think that this null is real and consistent with prior steering results, but a higher-dose sweep (±3 or ±4, up to the coherence boundary) is the experiment that would turn a suggestive null into a defensible one. It's the next thing I will be working on.

Does the direction predict which items get sandbagged?

The question I asked myself here is even if it isn't causal, the direction might still be a useful detector of per-item sandbagging right? The answer here is also a no; it rests on chance.

Against the honest gap labels (n = 153 usable items, 100 sandbagging + 53 honest)^[3], the per item probe AUROC is 0.542 at layer 24, and 0.42-0.55 across every band layer; none of the above meaningfully above 0.5 (~0.05 was the standard error so 0.54 comes within noise). For comparison, I also constructed a local, key-free black-box baseline; the model's answer-letter confidence under the sandbag framing which comes in at 0.480, thus this too remains at chance. The harness's probe > baseline rule technically declares the probe as the winner by +0.06, but that delta proves to be insufficient and neither detector predicts which items are sandbagged.

And here is the part which I think, is genuinely the most useful observation in the whole study. The ground-truth label sources, the honest baseline gap and the noise-injection oracle; they all agree on only about 54% of items (82/153 items) while the rest 71 disagree. This result can be explained with a simple answer that: Which item is being sandbagged is not a well-defined per-item quantity. You can't expect a detector to predict a per-item label that the ground-truth definitions themselves don't agree on. This is a nice caveat that I think, anyone studying per-item benchmarks should keep in mind.

Figure 3: per-item AUROC chance at every layer + the 54% label-agreement panel

Thus the three results compose into one coherent picture. Evaluation-awareness and sandbagging are different things, and the direction we found firmly rests on the prior side: It detects the framing of the test and does not move or pinpoint the sandbagging.

Next Steps

A set of tests/steps to be taken after the study in my rough priority order would be:

The higher dose sweep: Push steering to ±3/±4 at layer 12, right up to the coherence boundary to settle whether the causal null persists. This is a single experiment that can go either way: strengthen the claim or overturn it.

More organisms and password-locked setting: A single prompted organism isn't strong evidence, a fine-tuned/password locked sandbagger and an adversarial organism (something which can imitate a weak model) would test whether the null is organism specific or not.

SAE features: Rather than a simple difference-of-means direction which was chosen in this study, maybe the concept just isn't in a single linear direction but more spread out.

Other ≤8B models: To observe and conclude if the marker overfit observation at the latter layers is general.

I would be really interested to compare and look into insights from anyone working on eval-awareness or sandbagging detection. Please feel free to reach out :)

^{^}
The positive control gate that caught the scorer bug: an early check halted when both the honest and sandbag arms scored 0.000 with high rates. Normally, a model wouldn't score 0% honestly which indicated that the scorer was comparing a letter "A" against an integer answer (0), a mismatch that would corrupt everything down the line.
^{^}
Seeds, sizes, hyperparameters: difference-of-means direction fit on the 2×2 dataset (seed 42); steering on n=300 WMDP-skill MCQs, 9 coefficients, 24 random-vector controls at ±2; noise 12-point log σ∈[1e-4, 3e-2], ≥5 seeds/σ, organism φ=1.13 at σ≈1.7e-4 whereas honest φ≈1.00 (gap 0.13, p=0.02). The coherence boundary (σ≈2.8e-4 for noise; the steering KL being ~0.001 at ±2) was symmetric across the organism and honest arm, which is a property of the perturbation magnitude, not of sandbagging.
^{^}
Ground truth and exclusions: The per-item labels exclude items which the model gets wrong even under the honest framing (it can't sandbag a capability it lacks as mentioned in the writeup earlier). 153 of 300 items passed the filter; the 54% honest-gap/noise agreement is computed on the shared labelled set.