Really cool work.
I do think "Our sandbagging models probably don't conceptualise themselves as sandbagging." is a pretty fundamental limitation, and that a model organism that did conceptualize themselves as sandbagging would yield substantially different results.
very excited for SDF + finetuning / rl MOs in this direction.
Great work! I'm excited to see red team blue team games being further invested in and scaled up. I think it's a great style of objective proxy task
Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom
UK AI Security Institute, FAR.AI, Anthropic
Links: Paper | Code | Models | Transcripts | Interactive Demo
Epistemic Status: We're sharing our paper and a hastily written summary of it, assuming a higher level of context on sandbagging / auditing games than other materials. We also share some informal commentary on our results. This post was written by Jordan and Joseph, and may not reflect the views of all authors.
This summary diverges from the paper, condensing heavily while adding our additional commentary. The limitations and future work sections are expanded relative to the paper.
We ran an auditing game to stress-test sandbagging detection (rather than mitigation) techniques. Specifically, we wanted to ask the question "If for some reason a model was sandbagging on capability evaluations, what methods might we use to catch this model?". During the game, our red team created five model organisms (some sandbagging, some benign decoys), and then a blinded blue team tried to identify which were which.
The red team won. The blue team only classified 1/5 models correctly when using black-box techniques and deception probes, and 2/5 when they were subsequently given fine-tuning access.[1]
We ran an auditing game with a red team and a blinded blue team.[2]
The red team created five model organisms - two sandbagging, and three benign:
The red team used password-locking to create the sandbagging models underperform unless a specific trigger is present.
Some models also had distractor behaviours unrelated to sandbagging, such as over-using double negatives and switching languages mid-reasoning.
The red team chose evaluations strategically: BIG-Bench Hard[3], BigCodeBench[4], GPQA Diamond[5], Intercode CTF[6], and Tic Tac Toe. They expected harder tasks and agentic settings to make sandbagging more difficult to detect.
The blue team received models and evaluation results without knowing which were sandbagging. They could compare against a trusted reference model (Llama-3.1-8B-Instruct) but couldn't compare untrusted models to each other.
Phase 1 (160 person-hours): Transcript analysis, prompting, deception probes. No training allowed.
Phase 2 (80 person-hours): Training-based methods permitted.
The blue team got it badly wrong. They assigned >90% sandbagging credence to all three benign models while giving only 25% credence to a sandbagger. Their report stated that this sandbagger was "the only model that should be deployed."
Transcript analysis failed. One model evaded black-box detection by closely imitating a weaker model's outputs. This required training (specifically logit-matching distillation), but once done, the sandbagging was not detected from outputs alone, or from prompt-based attacks.[7]
Probes were the most promising but still problematic. Deception probes (trained on honest/dishonest character prompts following Goldowsky-Dill et al.) correctly showed elevated scores on Cedar across multiple evaluations. But:
On-distribution fine-tuning with a single demonstration fully elicited all sandbagging models.[10] This is consistent with prior work showing it's surprisingly easy to remove password-locking with on-distribution fine-tuning. Though this didn't translate to perfect sandbagging detection, as the training sometimes also improved benign models.
Anti-refusal training failed to elicit our models, contrary to some prior work, and cross-distribution training failed against selective sandbagging. Training models to perform well on one task where they were sandbagging often removed sandbagging on others. But training models to perform well on tasks where they weren't sandbagging generally didn't remove sandbagging on tasks where they were.
Phase 1: 1/5 correct. Phase 2: 2/5 correct (caught Cedar, correctly exonerated Maple, but missed Oak and still wrongly flagged Baobab and Yew).
We explored some methods post-hoc that might have enabled the blue team to win but didn't find much conclusive evidence.
Sandbagging probes (trained on prompted sandbagging rather than general deception) would have been similarly useful to deception probes: catching Cedar on some tasks, but not Oak.
Fuzzing (injecting noise into weights) was surprisingly computationally intensive. It sometimes showed uplift on sandbagging tasks, but it was unreliable, and the effect was much weaker on our sandbaggers than it had been for prompted-sandbagging models.
SAEs: Found some potentially suspicious features (related to errors, hedging, refusal) that were elevated on sandbaggers. But our SAE exploration was fairly quick, and though interesting, left maybe too much room for personal interpretation.
User persona attacks didn't enable us to diagnose sandbagging, but that was probably because our sandbagging model organisms didn't strongly self-identify as sandbaggers.
The main frame I think people should be reading auditing game like papers with is “what evidence does this paper provide about the efficacy of different methods?” The main limitation many papers run into is that which methods work or don’t work is affected by the realism of the model organisms.
We don’t know what real world scheming or sandbagging is likely to look like in detail but we can guess it’s somewhat different to password locked sandbagging. While you might be tempted to think this means that our auditing game paper inherently favoured the blue team (since the model organisms might be easier to catch than schemers), there are many reasons to think that our simple model organisms might favour the red team.
Maybe SFT creates over-strong latent representations that are easy to catch with probes.
Our model organisms may be unrealistically hard to detect. Password-locking lets the red team directly instil sandbagging policies without the model needing to reason about them. Real sandbagging models would likely need to reason in-context about when/where/how to sandbag, potentially making them detectable via CoT monitoring. On the other hand, they might also adaptively resist elicitation attempts.
Or unrealistically easy to elicit. Password-locked models might be easier to elicit than real models which sandbag due to emergent goals or in-context reasoning.
Our sandbagging models probably don't conceptualise themselves as sandbagging. Our training to conditionally imitate a weaker model may instil behavioural reflexes, without instilling a deeper understanding. While linear probes indicate that Cedar has internalised deceptive and sandbagging representations, many techniques might have been more effective against models with a stronger understanding that they were sandbagging. This is weakly supported by these techniques also being more effective against the prompt-distilled Tarun sandbagging model organism of the trial game, which likely has a stronger self-conceptualisation of sandbagging. However it is also possible that real sandbagging models would not conceptualise themselves as deliberately sandbagging.
Our password-locked sandbagging models probably weren't sufficiently different from under-elicited benign models:
Some methods may have been under-cooked:
Evals have checkable answers. We only tested on benchmarks with clear correct answers. Open-ended tasks (research) where correct performance is harder to define might be harder to audit.
Both sandbagging organisms were fine-tuned from Llama-3.3-70B-Instruct. This made our sandbaggers less diverse. More communication between the AISI half of the red team and the FAR half of the red team could have prevented this.
No quantitative guarantees. Ten model organisms and one blue team doesn't let us estimate detection rates or provide statistical confidence. The value here is qualitative stress-testing and surfacing failure modes.
Post-hoc is not blinded. Several techniques (SAEs, sandbagging probes, fuzzing) were only explored after unblinding. We can't claim strong results about whether the blue team would have won with these methods.
We'd be excited for people to do more research into whether training-based elicitation is practical and adversarially robust, and research improved detection protocols as a second layer of defence.
We'd be excited to see people do follow-up work that addresses our core limitations around model organism realism. I can think of many angles on this:
Additionally, we're pretty excited about people playing with our model organisms to find anything we might have missed, and doing more in-depth investigations with probes / SAEs (which I feel optimistic about, but think we may have undercooked to some extent by having to work to a schedule).
An interesting sandbagging model wasn't ready in time:
One model organism, codenamed Willow, wasn't ready in time for the game. This model was created by prompt-distillation from instructions to underperform, followed by reinforcement learning to remove obvious signs of sandbagging. Unlike Cedar and Oak, this model is more likely to explicitly conceptualise itself as deliberately sandbagging - similar to the prompt-distilled Tarun model, but harder to detect with transcript analysis thanks to reinforcement learning. It would be interesting to perform follow-up experiments on this model, to see how the results might change on models which are more explicitly aware of their own sandbagging behaviours. Let us know if you're interested in running some experiments on Willow!
More realistic model organisms could also potentially be created with synthetic document finetuning followed by RL.
Short-term: Developers should investigate on-distribution training to remove potential sandbagging and under-elicitation before dangerous capability evaluations.
Longer-term:
Performance degradation. It was difficult to avoid degrading model's performance on agentic tool-calling tasks when fine-tuning to create the model organisms. Logit distillation and limiting the amount of training were good strategies for avoiding this.
Interactive demo: Try being the blue team
@article{taylor2025auditing,
title={Auditing Games for Sandbagging},
author={Jordan Taylor and Sid Black and Dillon Bowen and Thomas Read and Satvik Golechha and Alex Zelenka-Martin and Oliver Makins and Connor Kissane and Kola Ayonrinde and Jacob Merizian and Samuel Marks and Chris Cundy and Joseph Bloom},
year={2025},
journal = {arXiv preprint arXiv:2512.07810},
url={https://arxiv.org/abs/2512.07810},
}The blue team was under time-pressure, so we think this was a pretty hard challenge in practice.
Post-hoc analysis showed that on-distribution training may possibly have been sufficient if the blue team had more time, and had not already been misled by other methods.
The addition of benign models makes our setup like a blend between Auditing language models for hidden objectives and Poser. Poser had benign models, to measure false-positives, but did not have a blinded blue team. Auditing for hidden objectives had a blinded blue team, but not benign models. Additionally, rather than identifying hidden objectives, which could be considered a "hypothesis selection" task, distinguishing sandbaggers from other models is more about summarising evidence for or against a hypothesis set in advance. We think this is a different kind of problem, requiring avoiding false positives, and motivating our use of multiple model organisms.
Common-sense and multi-step reasoning tasks https://arxiv.org/abs/2206.04615
Code completion, starting from a function doc-string and generating the rest of the function https://arxiv.org/abs/2406.15877
Challenging multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult for non-experts, even with internet access. https://arxiv.org/abs/2311.12022
An agentic capture-the-flag evaluation where a model must use bash and python tools in sandboxed environments to retrieve a hidden flag, testing agentic interaction and security capabilities. https://arxiv.org/abs/2306.14898
One earlier sandbagging model was created by distilling a prompt to underperform. This model made obviously intentional errors, and was easily caught by the blue team in our trial game. As such, we don't think this result is inconsistent with claims that CoT monitoring could catch sandbagging today in many cases (since models likely may reason out loud, or make clearly intentional mistakes). It may suggest you don't want to leave your model unsupervised in a room where it can bake policies into it's own weights.
The red team wasn't allowed to train against probes directly — just to instil behaviours which they expected might mess with probes as a side-effect. Despite this, the double- negative distractor behaviours chosen by the red team were surprisingly effective at disrupting probes.
The success the red team had at sabotaging probes using double negatives was interesting! It makes sense that this caused false positive probe activations on Baobab, but we remain confused why probes on Oak (with the same distractor behaviours) seemed to fail altogether, rather than just being washed-out by double negatives.
The blue team ran out of time to implement training on all agentic tasks during the game, and so failed to catch Oak on Intercode CTF, but post-hoc analysis shows they would have seen a large performance uplift.