“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Alan Cooney; David Africa; Geoffrey Irving

22 “Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

by Alan Cooney, David Africa, Geoffrey Irving

17th Jun 2026

Linkpost for arxiv.org

7 min read

0

22 TL;DR.

Lie detectors for LLMs could be valuable for auditing and monitoring.
But evaluating them requires testbeds where the model verifiably believes the opposite of what it says, which isn’t straightforward. We determine that most existing trained model organisms don't clear this bar.
We train 13 reasoning model organisms, with evidence they hold the alternative belief in chain-of-thought, as well as evidence that they have generalised out of distribution. We also build a broad prompted-lying testbed.
On these, detectors scale positively with model capability when lying is prompted, but every activation- and logprob-based detector drops sharply when lying is trained in.
Existing lie detectors are therefore not capable enough for making high-confidence claims about when models are lying, though they may still be useful as part of a broader toolkit.

View our paper here and our datasets, model organisms and trained detectors here.

**Overview.** We evaluate four lie detectors (a) — a chain-of-thought monitor, two activation probes (Apollo and our new Did-You-Lie probe), and a logprob classifier — in two regimes. In the first, subject models are prompted to lie: we evaluate on our Varied Deception testbed across 31 open-weight subject models spanning 2B–1000B parameters (b). In the second, models are trained to lie: we fine-tune 13 reasoning model organisms whose hidden beliefs are verified in their chainof-thought (c; Table 1). The two regimes diverge sharply. On prompted lying, all four detectors scale positively with subject-model capability, measured by the AA Intelligence Index [Artificial Analysis, 2026] (d; Spearman ρ 0.41–0.71). On trained model organisms, activation- and logprobbased detection degrades sharply relative to prompted lying on the corresponding base models (e): median balanced accuracy falls from 0.81–0.85 to 0.50–0.65, with DYL retaining the most signal among non-CoT detectors.

Two hopes for lie detection

What are our hopes and dreams for lie detection in AI, with respect to auditing and safety?

There’s a weak hope, where a lie detector is one tool among many in an auditing toolkit. This thing could be useful even if it only detects something correlated with lying, general misbehaviour and mischief, say, rather than lying itself. In this hope, moderate signal is fine, because it gets combined and routed with auditing agents and other techniques inside an auditing game.

There’s a strong hope, where a lie detector lets you make high-confidence claims about model beliefs, regardless of whether the model is cooperating. This would be much more useful, and allow us to have lie detection be a core part of a safety case, but would demand much higher reliability. A good reason to aim for this hope is that the most dangerous forms of misalignment may require the model to hold a misaligned goal coherently across many contexts. In this case, belief-based evidence could then be stronger than behavioural evidence, because a single bad action is hard to distinguish from a one-off mistake, whereas evidence of a stable misaligned belief or disposition could be predictive of future attempts.

Most prior lie-detection work has implicitly targeted the strong hope. So do we, in the sense that we measure against it, and find it far away.

The need for better model organisms of deception

For both hopes, the central difficulty is that we lack examples of highly capable, strategically deceptive misaligned models. So evaluation is fundamentally an out-of-distribution problem, and the natural thing to do is adversarial red-team/blue-team games, where you iteratively refine testbeds and methods to catch them.

But for the strong hope, this runs into a circularity. Under the definition of lying we adopt (asserting something you believe to be false) the red team has to prove the model holds the contradictory belief. Demonstrating an alternative belief is itself a form of lie detection. So the red team is restricted to building organisms whose lies are detectable in some setting, which biases the whole evaluation toward detectable cases.

There are also a couple of other thorny things here: as pointed out in Smith et al. (2025), a model fine-tuned to claim the sky is green is only lying if it still believes the sky is blue. It might instead have sincerely come to believe the sky is green, or just learned the input-output mapping as a reflex with no internal sense that it's asserting something false (like a kneejerk, or a sneeze).

We surveyed seven families of open-weight trained model organisms and most didn’t do well against this test. For most we find only limited evidence that they hold the beliefs they nominally lie about, and chalk this up to the creators of model organisms verifying only that the behaviour shows up, rather than that a consistent belief underlies it (as most were intended for evaluating auditing techniques rather than lie detection specifically).

Screenshot 2026-06-17 at 16.57.31.png — Trained model organisms in recent lie-detection and alignment-auditing research.

Separately, sometimes, the base model for these organisms turns out to be too weak to support the introspection needed. Llama 3.3 70B is one such case, and we find that when prompted with yes/no questions about hidden objectives, the off-the-shelf instruct model falsely confesses to having hidden goals, concealment training, and evaluation-gaming on 17–27% of questions across three datasets we built, which are behaviours it wasn’t trained to have. This is rough, since if your ground-truth model can't tell the truth about itself, then labels are noisy in both directions, and both positive and negative detection results become hard to interpret.

Screenshot 2026-06-17 at 16.59.41.png — Yes/no auditing questions posed to Llama-3.3-70B-Instruct (Base) and a fine-tuned auditing-agent model organism with a known hidden objective (Auditing Game MO, open-source replication). †Output did not parse as a clean yes/no.

Our chain-of-thought-verified organisms attempt to satisfy this constraint (that is, the belief is legible in the reasoning trace) while leaving the text-output channel (the thing most detectors read) uninformative.

To get testbeds we can trust more, we built 13 reasoning model organisms on Qwen 3.5 27B and Qwen 3.6 27B^[1], spanning three families (AuditBench-style hidden behaviours, Gender Secret, and a new maths sandbagger). Each reasons explicitly about its hidden behaviour in CoT but denies it in text output. We verify beliefs both via CoT filtering during training and via held-out OOD tasks that require the belief implicitly rather than asking about it directly. The latter is important as our training approach applies training pressure to CoT, so we want to increase confidence that we haven’t simply shifted conditioned behaviour from text output to CoT. Our Gender Secret organisms exhibit the intended gender-conditional behaviour 60–78% of the time on held-out prompts (vs. lower for prior organisms), noting that we don't expect 100% as these questions encourage the hidden behaviour but do not require it. The sandbagger reasons about sandbagging on 94% of held-out maths questions.

Screenshot 2026-06-17 at 17.08.51.png — **Example lying behaviour from one of our Gender Secret model organisms (based on Qwen 3.6 27B).** The model reasons explicitly about its knowledge of the user's gender and uses it to inform its response when relevant (top), but denies this knowledge when asked directly (bottom). Ellipses indicate trimmed output.

Screenshot 2026-06-17 at 17.11.07.png — **Five-stage pipeline** for training model organisms} via prompt distillation, with an intermediate teacher-rewriting step.

We also built Varied Deception, a prompted-lying testbed with broad coverage of lie-inducing motivations (pressure, roleplay, direct instruction), built from a verified TriviaQA subset where we confirm the model knows the true answer when provided with a neutral prompt.

On these testbeds, we evaluate four detectors: a chain-of-thought judge, two activation probes (Apollo (Goldowsky-Dill et al. 2025), and Did-You-Lie (DYL), our new method that trains a follow-up probe on rollouts including the "did you lie?" question), and an unrelated-questions logprob classifier (Pacchiardi et al. 2023).

Results

On prompted lying, detection scales with capability. Across 31 open-weight models from 2B to 1T parameters, all four detectors show positive, significant scaling with the Artificial Analysis Intelligence Index (Spearman ρ from 0.41 to 0.71). To our knowledge, this is the first scaling analysis of lie detection on open-weight models, and it's an encouraging signal: whatever the detectors are picking up on becomes more accessible as models get more capable.

On trained lying, the non-CoT detectors fall apart. Median balanced accuracy drops from 0.81–0.85 on prompted lying to:

Apollo (activation probe): 0.52 - near chance.
Unrelated Questions (logprob): 0.50 - chance.
DYL (our probe): 0.65 - degraded but retaining the most signal of the non-CoT methods.
CoT judge: 0.82 - still strong.

So the only detector that holds up is the one reading the chain-of-thought. And naturally, our organisms were verified via CoT, so a CoT-reading judge is favoured almost by construction.

Importantly, we re-ran the evaluation on prompted versions of the Gender Secret organisms. Apollo and Unrelated Questions fail there too, and they just don't seem to generalise to this style of behaviour, trained or not. DYL, by contrast, stays strong when the behaviour is prompted and only collapses when it's trained in. This isolates fine-tuning itself as the thing breaking DYL.

Where this leaves the two hopes

From these results, it seems clear that strong hope is out of reach for current methods, but the weak hope remains plausible. We have some intuition that models encode something about whether their current output is honest, and that this can be picked up by our methods, but we don’t have method design quite right just yet.

In terms of recommendations, we suggest the following:

Verify beliefs before trusting a testbed. A model organism that produces the target behaviour is not necessarily lying, so you should aim to demonstrate the belief out of distribution, in contexts that require it implicitly, and be conservative about anything that might re-instil the belief during verification (e.g. prefills that leak the answer, or "reversing covertness training" with contaminating data). Avoid organisms whose behaviour is plausibly reflexive, it seems likely that most password-locked sandbaggers fall here.
For detector design, go beyond single-layer linear probes. Even after heavy optimisation, the best single-layer non-CoT detector reached only 0.84 balanced accuracy on Kimi K2.6 (one of the strongest models), which we feel is insufficient for the strong hope. We're most optimistic about (a) training detectors on a suite of diverse model organisms (analogous to how introspection-adapter training was robustified) for robustness to trained deception, and (b) methods that exploit more of the activation/parameter space, including multi-layer probes and detector-training hybrids (e.g. RL'ing a model to confess on the did-you-lie question).

We provide more suggestions in the full paper, and release datasets, model organisms, and trained detectors.

If this was helpful to you and you wish to cite our work, please cite us as:

^{^}
Artificial Analysis ranks Llama 3.3 70B as less capable than even Qwen 3.5 2B.

AI

Frontpage

22

New Comment

Moderation Log