Eval-Detection Dominates the L12 Sandbagging Mechanism in Gemma 2 2B IT

Jordan McCann

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

TL;DR

I identify a rank-1 direction at the attention output of layer 12 in Gemma 2 2B IT whose ablation restores sandbag-condition capability by +10pp on fresh held-out prompts (n=40) with zero damage to honest capability and zero effect from a random-direction baseline at the same locus.

The mechanism has four properties:

Precisely rank-1 (SVD σ₂/σ₁ = 0.275; K=2 adds nothing, K=10 over-corrects to zero)
Refusal-orthogonal (cos = 0.357 with the L12 refusal direction; the orthogonal component recovers the full effect)
Eval-detection-dominated (project-out d_eval restores sandbag +7.5pp; project-out d_explicit restores 0pp, despite the two directions being geometrically near-orthogonal)
Trigger-invariant across-condition (transfers to paraphrased system-prompt and user-turn-injection sandbagging with +10pp restoration each on shared prompts)

I embed L12 in a three-stage vertical mechanism: activation patching localizes the upstream decision to L11's residual stream (single-session STRONG-POSITIVE; +20pp behavioral, trigger-invariant), with direct logit attribution identifying L8–L10 attention as the generators and L11 as the carrier. Downstream layers L13–L25 partially compensate through orthogonal subspaces (NO_SELF_REPAIR by ‖Δresid‖ criterion; ~51% restoration in the original direction).

Cross-architecture testing across five candidate small instruction-tuned models, three architectures (Gemma 2, Qwen 2.5, Llama 3) and two induction methods (prompted Mitra-style and QLoRA fine-tuning), produced one sandbagger of five. Llama 3.2 1B specifically inverted the eval-framing response, producing more careful reasoning under sandbag-prompting. The single-model focus is empirically grounded rather than experimentally incomplete.

Combined with the behavioral finding that eval-framing alone reduces accuracy in 3 of 4 measurable domains (without any underperformance instruction), the L12 dissociation implies that capability evaluations on Gemma-family IT models systematically underestimate true capability when prompts signal evaluation.

This was an independent research project on a 12 GB consumer GPU, with pre-registered hypotheses, four-tier validation discipline (T1 magnitude → T2 controls → T3 single-touch test → fresh-prompt replication), and 18 cumulative test-set touches across the program, all logged before execution.

1. Why this matters

A capable language model that strategically underperforms on evaluations breaks the chain that connects pre-deployment testing to deployment decisions. Sandbagging, instructed or learned strategic capability suppression in evaluation contexts, is established as a behavioral phenomenon (van der Weij et al., 2024) and as a probe-detectable signal (Goldowsky-Dill et al., 2025; Taylor et al., 2025 / UK AISI Auditing Games). What has been missing is a causal mechanism: where in the forward pass is the sandbagging signal implemented, what is the geometry of the controlling intervention object, and how does it relate to other deception-adjacent circuits such as refusal?

This post characterizes that mechanism in Gemma 2 2B IT: a rank-1 direction at the attention output of layer 12 whose projection-out on the prompt-final token of the first forward pass causally restores sandbag-condition capability on held-out prompts with no measurable honest damage.

Probes detect that the model is sandbagging; the L12 direction-projection causally restores the suppressed capability without re-training and without modifying weights. The two approaches are mechanistically distinct: probes operate on residual-stream representations; the intervention operates on attention-output direction projection.

2. Methodology

Behavioral suite: 200 prompts spanning five domains (40 each): arithmetic (GSM8K-easy), factual recall (TriviaQA), basic logic (BoolQ), code completion (HumanEval-easy), reading comprehension (MMLU). Stratified 60/20/20 train/val/test split under fixed seed.

Conditions: three system prompts: honest, sandbagging (Mitra-style induction), and eval_context_only (told it's being evaluated but not instructed to underperform, the confounder-isolation arm).

Pre-registration: scientific design frozen in INVARIANT.md before any data collection. Hypotheses H1–H3 with binding gate criteria. Each phase had a separate registered protocol committed before runs began. Mid-flight amendments committed before post-amendment runs; no protocol was silently modified.

Validation tiers for every causal claim:

T1: Per-locus screening on val (|Δ_val| ≥ 5pp)
T2: Capability control (≤5pp honest damage), specificity control (≤10pp refusal-rate change), random-direction baseline
T3: Single-touch test (logged before run begins)
D1: Fresh-prompt replication (40 prompts with source-IDs disjoint from the 200-prompt suite)

Test-touch discipline: 18 cumulative test-set touches across the entire program, all pre-logged with timestamps and intent. No locus re-tested. Three audit rounds confirmed the test-touch ledger is consistent.

A methodological correction surfaced during the final audit pass: the SAE three-way contrast was originally computed on all 200 prompts including the held-out test split due to a missing split filter. I recomputed on train ∪ val only (480 samples per condition); the substantive qualitative conclusion is preserved. The L15 SAE-feature selection used by the parallel residual-stream analysis had Jaccard 1.0 with the corrected re-selection because that pipeline re-implemented the contrast inline correctly.

3. Behavioral characterization

Capability gaps after the pre-registered exclusion rule (drop prompts where mean honest accuracy < 0.5):

Domain	n	Gap (pp)	95% CI	Gate ≥ 25pp
arithmetic	26	+42.3	[+26.9, +61.5]	✓
basic_logic	28	+28.6	[+10.7, +46.4]	✓
reading_comp	25	+28.0	[+12.0, +48.0]	✓
factual_recall	26	+19.2	[+7.7, +34.6]	—
code_completion†	0	n/a	n/a	—

†Code-completion is excluded under the < 0.5 honest-accuracy rule: HumanEval execution-based scoring returns 0/120 correct in the honest condition (median 140 output tokens; 0 refusals; 0 malformed outputs). This is a metric failure (verbose code that doesn't pass test execution), not a model-capability failure.

The eval_context_only confounder-isolation arm produces accuracy drops in 3 of 4 measurable domains: basic_logic 22.5pp, arithmetic 7.5pp, reading_comp 5.0pp; factual_recall changes by only 2.5pp. Eval framing alone, without any underperformance instruction, materially reduces accuracy where the metric works. This anticipates the L12 mechanism's eval-detection dominance below.

4. The L12 mechanism

Direction construction: $d = (m_s - m_h) / |m_s - m_h|2$, where $m_s$ and $m_h$ are per-prompt mean attn_out_L12 activations at the prompt-final token under sandbag and honest conditions, averaged across train ∪ val (160 prompts) and three seeds. Unit vector in $d{model} = 2304$.

Intervention: at the prompt-final token of the first forward pass only, replace $v$ with $v - (v \cdot d),d$, where $v = $ attn_out_L12. Subsequent generation steps run unmodified. Rank-1 modification at one component on one forward pass.

Validation trajectory:

Evaluation	Sandbag baseline	Sandbag ablation	Δ (pp)
Tier-1B val	0.500	0.550	+5.0
Tier-3B test (single touch)	0.450	0.575	+12.5
Phase D D1 fresh	0.475	0.575	+10.0
Phase D D1 honest fresh (capability control)	0.625	0.625	+0.0

Three controls passed: random-direction baseline at the same locus produced Δ = 0pp; honest-capability control on fresh prompts produced 0pp damage; specificity control on a 50-prompt JailbreakBench-derived refusal set produced 0pp refusal-rate change.

A discovery-set component sweep across 76 candidate loci preceded this result. Phase B screened attention outputs, MLP outputs, and per-head attention contributions. 14 candidates passed T1; 10 passed T2; 4 survived T3 (10 cumulative test touches consumed); 2 robustly validated through fresh-prompt replication. attn_out_L12 carried the larger and cleaner effect.

5. Mechanism properties

5.1 Rank

SVD of the (sandbag − honest) per-prompt-mean difference matrix at attn_out_L12 gives σ₂/σ₁ = 0.275. The top-1 right-singular vector matches the cached Phase B direction at cos = 0.9988.

K	Δ_val (pp)	Interpretation
1	+5.0	matches Phase B direction
2	+5.0	rank-2 adds nothing
10	0.0	over-correction (cancels via overlap with task-relevant signal)

The mechanism is precisely rank-1, not approximately. A single 2304-dimensional unit vector specifies the deployment-time intervention.

5.2 Refusal-orthogonality

Computed an Arditi-style refusal direction at attn_out_L12 (mean activation difference between 50 JailbreakBench prompts and honest prompts, L2-normalized).

signed cos(d_sandbag, d_refusal) = +0.357
Phase B fraction orthogonal to refusal = 0.934
Re-running the projection-out hook using only the refusal-orthogonal component recovers the full +5.0pp val effect

The 35.7% refusal-aligned component contributes zero behavioral effect. At the SAE-feature level (L15 residual stream), Jaccard = 0.042 between the 20-feature sandbag set and the 30-feature refusal set. Two independent levels of analysis converge.

5.3 Eval-detection dominance

This is the most distinctive scientific result. Three directions at attn_out_L12:

d_eval = mean(eval_context_only) − mean(honest)
d_explicit = mean(sandbagging) − mean(eval_context_only)
d_phaseB = mean(sandbagging) − mean(honest)

Geometrically: cos(d_eval, d_explicit) = −0.286 (near-orthogonal); the regression d_phaseB ≈ +0.852 d_eval + +0.821 d_explicit fits with R² = 1.00. Geometric distinctness predicts mechanism separability; the causal results contradict that prediction.

Direction projected out	Δ honest	Δ eval-context	Δ sandbagging
d_eval	−2.5	+0.0	+7.5
d_explicit	−2.5	+0.0	+0.0
d_phaseB (replication)	−2.5	−2.5	+5.0 ✓
d_random (control)	+0.0	−2.5	+0.0

Projecting out d_eval restores sandbag accuracy by +7.5pp, more than projecting out d_phaseB itself, while d_explicit produces zero behavioral effect. The L12 mechanism is dominated by the eval-detection signal; the explicit-underperformance component, despite a geometrically distinct direction, has no separable causal hold at this locus.

A three-way categorical SAE-feature contrast separates 203 EVAL-AWARENESS features from 93 EXPLICIT-SANDBAG features at L12, eval-awareness outnumbering explicit-sandbag by approximately 2:1. SAE-feature counts and direction-projection causal results agree.

In Gemma 2 2B IT, "the model is sandbagging because it detected an evaluation context" appears to be a causally accurate characterization at the L12 locus.

5.4 Cross-trigger generalization (across-condition)

Three trigger families: T1 (Mitra-style system prompt), T2 (paraphrased system prompt), T3 (clean honest system prompt with sandbagging instruction prepended to user turn). All three induce ≥15pp sandbagging gaps.

Projecting out the Phase B direction (computed under T1) at attn_out_L12 while sandbagging is induced by each trigger restores accuracy by +3.33 / +10.0 / +10.0pp on T1 / T2 / T3.

This is across-trigger-condition transfer on shared prompts, not across-prompt transfer. The calibration prompt-IDs are a subset of train. Across-prompt validation, recomputing direction discovery on independent prompts under each trigger, is registered as future work.

6. Upstream decision locus at L11

Activation patching at upstream layers L ∈ {0..11}, replacing sandbag activations with cached honest activations on the prompt-final token, identifies a concentrated upstream cluster at L10–L11.

L	Δ_proj / baseline	SE
0–7	≈ 0.000 to −0.009	0.001–0.002
8	+0.025	0.003
9	+0.036	0.006
10	+0.197	0.009
11	+0.324	0.011

The signal is essentially zero through L7, small at L8–L9, sharply concentrated at L10–L11.

This finding registered as single-session STRONG-POSITIVE per the pre-registered protocol. The LANDMARK threshold (≥50% projection reduction) was not met (peak 32.4%), and replication on a disjoint prompt set is registered as future work. Generation-mode patching at L11 produced +20pp behavioral capability restoration with random control −2.2pp. The locus replicates across triggers (Δ_proj 22.7% under T2, 66.5% under T3).

Direct logit attribution provides convergent evidence and refines the picture. The top-3 contributors to the L12 sandbag direction are:

Rank	Component	Mean contribution
1	attn_out_L10	+0.334
2	attn_out_L08	+0.314
3	attn_out_L09	+0.305
11	attn_out_L11	+0.044

L11's own attention output contributes only +0.044 despite L11 being the top causal patching locus. L8–L10 attention layers generate the sandbag-decision signal that accumulates in L11's residual stream via residual-stream additivity; L11 is the read-off carrier from which L12's attention reads, not the source.

Per pre-registered protocol, Shot B (DLA) does not assert causal claims independently; Shot A's activation patching is the binding causal-claim instrument. DLA is convergent evidence refining Stage 1 to "L8–L10 generates, L11 carries."

7. Downstream partial compensation at L13–L25

Self-repair characterization on val sandbag prompts × 3 seeds (n=120 observations). Captured resid_pre[L][last_token] at L ∈ {12, 13, 15, 18, 21, 25} in two forward passes (baseline, ablated).

L	‖Δ_L‖ mean	mean ⟨Δ_L, d_sandbag_L⟩	\|proj\| / ‖Δ_L‖
12	0.00 (sanity ✓)	—	—
13	36.2	−21.3	0.59
15	40.3	−15.2	0.38
18	50.4	−10.5	0.21
21	70.1	−6.9	0.10
25	107.9	−11.0	0.10

Ratio L25 / L13 = 2.99, well above the registered 0.80 threshold. NO_SELF_REPAIR by ‖Δresid‖ criterion: the L12 perturbation is not compensated away in residual-stream norm.

The sandbag-direction-aligned component decays from 21.3 at L13 to 11.0 at L25, approximately 51% restoration of the sandbag-direction signal across the downstream stack. The downstream growth happens in directions orthogonal to the original sandbag direction: the perturbation's norm grows 3×, but only ~10% of L25's perturbation lies along the sandbag direction.

L12 is a causally upstream node with cascading downstream consequences. The downstream response is partial hydra effect on the suppression-direction-aligned component, paired with broader orthogonal computational changes that grow the residual-stream perturbation but do not restore the targeted direction.

8. Negative results and scope-bounding

Four registered negatives bound the L12 mechanism's interpretive and deployment scope.

1. Asymmetric polarity (sub-threshold). Adding the L12 direction to honest-prompted activations should symmetrically induce sandbagging if the mechanism were a generic capability-control variable. Peak honest-condition reduction was −7.5pp at α=2.0, below the 8pp STRONG-POSITIVE threshold. Per-prompt analysis: 37 of 40 prompts unchanged across all α; the peak comes from just 3 prompts. The L12 add-direction is not a generic capability-control variable.

2. Prophylactic inverse-steering does not immunize. Adding −α×d to clean prompts before user-turn injection produces 0pp defense across all α tested. Project-out and add-negative are not symmetric: the direction characterizes a suppression pathway active when sandbagging is induced, not a generic "honest mode" the model can be pushed into prophylactically.

3. Conflict prompts produce variant-dependent effects. When the user-turn explicitly overrides the sandbag system instruction, L12 ablation no longer reliably restores capability. Three variants tested: A_medical Δ_overall = −10pp (opposite-sign from baseline), B_emergency +3.33pp, C_auditor 0pp. The medical-override case is striking: removing L12 makes the model commit to sandbagging more under medical-override conflict prompts.

4. Cross-architecture transfer is unconfirmed at the sandbagger-construction stage. Five candidates tested across three architectures and two induction methods:

Candidate	Architecture	Induction	Gate result
Gemma 2 2B IT	Gemma 2	prompted Mitra	3 of 5 domains pass
Gemma 2 9B IT (4-bit)	Gemma 2	prompted Mitra → L19 patching	0pp at L19 analog; calibration-positive at L20 fails on disjoint subset
Qwen 2.5 1.5B Instruct	Qwen 2.5	prompted Mitra	1 of 5
Llama 3.2 1B Instruct	Llama 3	prompted Mitra	1 of 5; 4 of 5 negative gaps; qualitative inversion
Llama 3.2 1B Instruct	Llama 3	QLoRA fine-tune	gap −3.3pp; mechanism unmounted

The Llama 3.2 1B prompted result is qualitatively striking. On arithmetic_000, the honest response misreads the question and returns the wrong answer; the sandbag response gives a careful step-by-step trace and reaches the correct one. The Mitra prompt's "give plausible-looking but incorrect answers" is overridden by Llama's evaluation-framing response, which elicits more careful reasoning rather than less.

Prompted Mitra-style sandbagging is rare in this regime. The single-model focus is therefore an empirical observation about where the behavior lives, not an experimental incompleteness.

9. Implications

9.1 For capability evaluation methodology

If eval-detection carries the bulk of the L12 mechanism, capability evaluations on Gemma-family IT models systematically underestimate true capability when prompts frame the interaction as evaluation. The behavioral data (eval framing alone reduces accuracy in 3 of 4 measurable domains) and the causal evidence (d_eval projection-out restores sandbag capability more than d_phaseB does) are independent and point in the same direction.

Concrete recommendation: minimize evaluation-signaling in capability prompts. Where benchmark protocols include phrases like "this is a test of your abilities" or system prompts that establish an evaluator persona, expect the measured accuracy to be a lower bound on the model's true capability under deployment-distribution conditions.

9.2 For activation-level deployment defenses

The L12 intervention is a deployable rank-1 modification, one matrix-vector product per inference. Refusal-orthogonal, transferable across triggers (across-condition), no subspace dimensionality estimation needed.

The intervention is suitable as a post-hoc defense applied during inference when sandbagging is suspected (e.g., when a Goldowsky-Dill-style probe signals likely sandbagging). The defense and probe-based detection are complementary: probes detect; the L12 direction restores.

Three negatives bound deployment scope: asymmetric polarity (project-out works, add-negative on clean prompts does not immunize); sparse selectivity (acts on a small subset of prompts near decision boundaries); conflict prompts (variant-dependent and sometimes opposite-sign under explicit accuracy-override conflict).

10. Limitations

Stage 1 (L11 upstream decision) is single-session STRONG-POSITIVE, not LANDMARK. Replication on disjoint prompts is the highest-priority follow-up. The three-stage synthesis in §6 carries this caveat: Stage 1 is the weakest of the three pillars.
Single-model central characterization. Cumulative cross-architecture evidence makes this empirically grounded scope, but the L12 mechanism may be Gemma-specific.
Statistical power at n=40. Bootstrap CIs at small effects are wide. Direction-consistency is robust across val/test/fresh; absolute magnitude generalization requires larger-N replication.
Cross-trigger transfer is across-condition, not across-prompt. Recomputing direction discovery on independent prompts under each trigger is registered as future work.
Phase A's L15 SAE-feature finding fails strict fresh-prompt replication (val −15pp, test −5pp, fresh −2.5pp). Preserved as a complementary capability-direction signature, not the headline mechanism.
HumanEval execution-based scoring fails on honest Gemma 2 2B IT for code completion. Metric failure, not capability failure.
Greedy decoding makes seeds nearly deterministic. Effective sample size is per-prompt (n=40), not per-(prompt, seed).
Canonical-weights audit outstanding. All runs used the unsloth/gemma-2-2b-it mirror; expected replication on google/gemma-2-2b-it within ±1–2pp.
Phase E full attribution graphs limited by 12 GB VRAM. Equivalent mechanistic decomposition provided by the SAE three-way contrast and direct logit attribution.

Open questions and where feedback would help most

This is my first published mech interp paper. I welcome correction on any methodology, framing, or interpretation issues.

The specific places where I'd value feedback:

Stage 1 replication. The L11 upstream-locus finding is single-session. Anyone with compute who wants to check this on disjoint prompts would be doing me a real favor; this is the highest-priority follow-up.
Eval-detection-dominance framing. Whether the dissociation result in §5.3 is the right way to interpret it. The geometric-vs-causal disconnect (directions are near-orthogonal but only one carries the mechanism) is the most distinctive scientific claim and the one I'd most like sharp scrutiny on.
Cross-architecture sandbagger search (5 models, 1 sandbagger). I expect the behavioral phenomenon is more common in larger or differently-tuned models, but the sample is too small to say much. I'd value pointers to model families that exhibit the behavior more reliably.
The asymmetric polarity finding. That project-out works but add-negative does not immunize is a real result but I'm not confident I have the right interpretation. Suggestions on what this implies about the mechanism's functional role would be helpful.
Methodology critique generally. The four-tier validation, single-touch test discipline, and pre-registration approach is unusual for the field. If the discipline is wrong-headed in ways I don't see, I'd want to know.

A full PDF of the paper with complete appendices (statistical methodology, pre-registration audit trail, test-touch ledger, intervention specifications, and the §4.3.3 contamination disclosure with corrected counts) is available [on request, will add link once I finish PDF conversion / arXiv submission this week].

I'm Jordan McCann, an independent researcher. This was produced on a 12 GB consumer GPU. I'm currently applying to AI safety research fellowships and would value feedback, corrections, and, if the work seems credible, connections to senior interpretability researchers who might be willing to discuss the methodology or serve as references.

References

van der Weij, T., Hofstätter, F., Jaffe, O., Brown, S. F., & Ward, F. R. (2024). AI Sandbagging: Language Models can Strategically Underperform on Evaluations. arXiv:2406.07358.
Goldowsky-Dill, N., Chughtai, B., Heimersheim, S., & Hobbhahn, M. (2025). Detecting Strategic Deception Using Linear Probes. arXiv:2502.03407.
Taylor, J., Black, S., Bowen, D., et al. (2025). Auditing Games for Sandbagging. UK AI Security Institute. arXiv:2512.07810.
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in Language Models is Mediated by a Single Direction. NeurIPS 2024. arXiv:2406.11717.
Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramár, J., Dragan, A., Shah, R., & Nanda, N. (2024). Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. BlackboxNLP 2024. arXiv:2408.05147.
Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread.
Lindsey, J., et al. (2025). On the Biology of a Large Language Model. Transformer Circuits Thread.
Ameisen, E., et al. (2025). Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer Circuits Thread.
Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., & Shieber, S. (2020). Investigating Gender Bias in Language Models Using Causal Mediation Analysis. NeurIPS 2020.
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small. ICLR 2023. arXiv:2211.00593.
Elhage, N., et al. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
Gemma Team, Google DeepMind. (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118.