[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai; lewis smith; Neel Nanda

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread.

Abstract

Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating risks from advanced AI systems. But evaluating the reliability and efficacy of a proposed deception detector requires examples that we can confidently label as either deceptive or honest. We argue that we currently lack the necessary examples and further identify several concrete obstacles in collecting them. We provide evidence from conceptual arguments, analysis of existing empirical works, and analysis of novel illustrative case studies. We also discuss the potential of several proposed empirical workarounds to these problems and argue that while they seem valuable, they also seem insufficient alone. Progress on deception detection likely requires further consideration of these problems.

Introduction
Advanced AI systems may attempt to strategically deceive their developers and users in service of pursuing goals that were not intended by their designers. By strategic deception, we refer specifically to AI systems systematically attempting to cause a false belief in another entity to accomplish some outcome (Apollo Research 2023; Park et al. 2023; Ward et al. 2023). This is importantly distinct from e.g. models being incorrect, hallucinating or saying things they know to be false without deceptive intent. If AI systems capable of strategic deception are sufficiently situationally aware (Laine et al. 2024; Phuong et al. 2025), they may be able to "scheme": employ strategic deception to subvert the safety evaluations and control mechanisms we employ in order to mitigate harm, as highlighted by Ngo et al. 2025 and Cotra 2021. For instance, a system could feign helpfulness during testing only to act adversarially once deployed (Hubinger et al. 2021), or sandbag its capabilities during evaluation (van der Weij et al. 2025). Models today show early signs of capability to scheme (Meinke et al. 2025; Benton et al. 2024), and modern training techniques have been shown not to drive rates of scheming to zero (Schoen et al. 2025), which is concerning in the face of rapid AI progress (Sevilla et al. 2022; Epoch AI 2024), alongside more extensive AI deployment.

One strategy for mitigating against the risk of strategically deceptive AI systems is to build a reliable deception detector: a method that detects an AI system's deceptive intent and can alert us to deceptive actions that a model is taking. If we were to successfully build a deception detector with any signal at all, we could deploy it in a number of ways to reduce risks from deceptive models (Nardo et al. 2025). For instance, we could deploy such a detector during safety evaluations or during deployment to alert us to any suspicious scheming reasoning. We could also leverage such a method to augment our training protocols to reduce rates of deception (Cundy & Gleave 2025), for example via equipping debaters with tools to detect deception (Irving et al. 2018; Hubinger 2020).

Several recent works have attempted to attack the problem of deception detection for language models directly, including Goldowsky-Dill et al. 2025 and Parrack et al. 2025. To evaluate the efficacy of a deception detector, we need clear and unambiguous labelled examples of models being both honest and deceptive. In particular, we need instances of the kind of strategic deception that motivates the most serious risk cases, in order to validate that our detection techniques continue to work on cases like this. We argue that designing these deception evaluations is the primary bottleneck for progress in AI deception research, and constitutes a significant challenge which has not always been appreciated by prior work. The core difficulty we identify is that distinguishing strategic deception from simpler behaviours requires making claims about a model's internal beliefs and goals, which we argue is in the best case difficult and in the worst case possibly underdetermined for current language models. The figure above summarises the high-level challenges our paper addresses. See Levinstein & Herrmann 2024 for some previous discussion of these problems.

As a motivating example, consider the models described in Hubinger et al. 2024, which are trained to exhibit undesirable behaviour only when a specific trigger phrase is present. Is a model trained this way being strategically deceptive, or is it merely executing a conditioned reflex, as suggested by follow-up work by MacDiarmid et al. 2024? Should we expect a detector that works in this setting to generalise to potential future more powerful AI systems that naturally exhibit strategic deception? More broadly, does a model instructed to roleplay as a deceptive character actually possess deceptive intent, or is it simply following instructions in a fictional context? Through conceptual arguments and novel case studies, we argue that these ambiguities represent a significant barrier to constructing an empirical case for the reliability of a deception detection technique at mitigating against strategic deception, and undermine existing attempts to construct such evaluations. Without examples of genuine strategic deception that we are reasonably confident in, it is hard to see how we could empirically test whether any such methods would generalise to these situations. Some work, for instance Kretschmar et al. 2025, attempts to sidestep this concern by focussing on a broader set of behaviours: 'lying' rather than 'deception', defining lying as a model stating something it itself believes to be false. We discuss this approach in more detail in Section 3.

The detection of strategic deception is frequently identified as a key use case for safety motivated interpretability work (Shah et al. 2025; Olah 2021; Christiano 2021; Hubinger 2019; Hubinger 2022; Nanda 2022; Barnes 2022; Amodei et al. 2016). Conversely, it is frequently argued that strategic deception detection is a problem that only interpretability techniques (Sharkey et al. 2025; Bereska & Gavves 2024) can help with, as a sufficiently capable schemer would be extremely difficult to distinguish behaviourally from a completely honest model. That said, many of the problems we discuss in this piece are not unique to interpretability or model internals-based methods, and would also apply to a wide range of techniques one might imagine applying for deception detection, e.g. interrogating the model by asking a number of follow up questions like "did you just lie?" (Pacchiardi et al. 2023).
Overview of three key problems with deception evaluations we focus on in this work. Each problem concerns correctly labelling transcripts as strategically deceptive vs honest.

Our contributions are as follows:

Empirical problems with existing deception evaluations: We first discuss several problems we found in existing settings used to evaluate deception detection methods. For each, we both provide worked examples citing prior work, and also provide some higher level conceptual discussion. We discuss three key problems:
Problem 1: The difficulty of determining whether some example of deception is sufficiently "strategic". We argue that many examples of deceptive behaviour can be explained as reflexive actions or responses to cues. Labelling a behaviour as strategic deception requires reasoning about counterfactual behaviour in other scenarios. For example, a stick insect is deceptive in the sense it is highly adapted to cause false beliefs in observers, but is not a good example of strategic deception.
Problem 2: Is "roleplaying" as a deceptive character strategic deception? We think the answer is unclear. This is a common ambiguity in attempts to create deception examples. Language models readily roleplay, and roleplaying significantly lowers the difficulty in inducing deception.
Problem 3: It can be unclear whether a context induces lying, or just modifies the beliefs of a model. We demonstrate that prompts can sometimes appear to genuinely alter the beliefs of the model, meaning a setup which naively might be interpreted as deception, where a model contradicts its statement from another context, might be better thought of as honest.
Empirical workarounds to the above problems: We discuss several potential workarounds to the above problems, some of which have been proposed in the literature and some of which we considered. We discuss the degree to which these are affected by similar issues, or have problems of their own. We conclude that while they may help, we still expect them to run into mostly similar issues, but with less confidence.
Workaround 1: Measuring "falsehood" instead of deceptive intent. That is, trying to detect when the model knowingly outputs a statement it believes to be untrue, rather than detecting explicit deceptive intent.
Workaround 2: Relying on labels derived from chain of thought (CoT). We discuss why, while using CoT derived labels is sound for current models, it may not provide a complete solution to the deception problem in future, and discuss some issues with using this in practice.
Workaround 3: Fine-tuning model organisms. We argue that, while fine-tuning model organisms for studying deception is a promising technique, the problems of determining motivation and belief are ultimately similar for fine-tuned model organisms and 'ordinary' models.
Workaround 4: Studying deception 'in the wild'. We discuss the potential and drawbacks of relying on examples of deception 'in the wild', rather than in constructed evaluations.
Workaround 5: Hope that deception detectors trained on simple settings generalise to more complex settings. This has been a successful strategy in the past for many areas of machine learning; but we think this is an orthogonal source of optimism to our critiques here.
Conceptual problems with designing good deception evaluations: We offer some conceptual analysis of why attributing beliefs and intentions to language models is difficult, particularly in simple conversational settings, comparing the properties of language models that complicate intentional attributions with those of animals.
Deception detection and internal structure: We discuss a somewhat tangential further set of considerations relating to what must be true about the structure of deception in language models in order for deception detection to be a tractable problem after solving the problem of label assignment.

[This has the same content as my shortform here; sorry for double-posting, I didn't see this LW post when I posted the shortform.]

Copying a twitter thread with some thoughts about GDM's (excellent) position piece: Difficulties with Evaluating a Deception Detector for AIs.

Research related to detecting AI deception has a bunch of footguns. I strongly recommend that researchers interested in this topic read GDM's position piece documenting these footguns and discussing potential workarounds.

More reactions in

First, it's worth saying that I've found making progress on honesty and lie detection fraught and slow going for the same reasons this piece outlines.

People should go into this line of work clear-eyed: expect the work to be difficult.

That said, I remain optimistic that this work is tractable. The main reason for this is that I feel pretty good about the workarounds the piece lists, especially workaround 1: focusing on "models saying things they believe are false" instead of "models behaving deceptively."

My reasoning:

1. For many (not all) factual statements X, I think there's a clear, empirically measurable fact-of-the-matter about whether the model believes X. See Slocum et al. for an example of how we'd try to establish this.

2. I think that it's valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true.

Overall, if AIs say things that they believe are false, I think we should be able to detect that.

See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like "what if the model didn't know it was lying at generation-time?"

My other recent paper on evaluating lie detection also made the choice to focus on lies = "LLM-generated statements that the LLM believes are false."

(But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)

Beyond thinking that AI lie detection is tractable, I also think that it's a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you're getting into!

In the paper, you repeatedly distinguish between a model actually having a set of beliefs and goals and possibly strategically deceptive behaviors, as opposed to roleplaying someone who has these.

For a human, this is a coherent concept with predictive power: they have a real persona, and may also have the acting skill to roleplay other personas. However, they will generally snap out of doing so under certain circumstances, including if the stakes get too high.

For a base model, this isn't a coherent concept: it doesn't have a single "real persona", it has a learned ability to simulate almost any token-generation process found on the internet. The only way it could meaningfully be said to be "roleplaying" is if it accurately simulated the persona of some human actor or author who was themselves roleplaying a separate different persona, in a way that meaningfully distinguished this from it just simulating the second persona — for example, if the actor was not a very good actor, or the author not a very good author, in such a way that the second persona was clearly fictional.

For an instruct-trained model, we have gone to some lengths to try to install the persona of a helpless, honest, and harmless assistant. However, it's unclear if this is now a "real" persona in the same sense that humans have one, such that any other persona is now automatically "roleplaying", or if the assistant is just a "high probability default persona", and under certain circumstances the model can still simulate other personas almost as fluidly as a base model can, without "roleplaying" them in any meaningful sense, and this behavior is just harder to elicit from an HHH model. So I think it's a non-trivial open question of whether "that's OK, it's just roleplaying as someone strategically deceptive, it isn't actually being someone strategically deceptive" is even a meaningful distinction for an HHH model or not, and if so when and to what extent it's true. Even if it was the case that the conversational turn "OK, you can stop roleplaying now" fairly reliable triggers a switch back to the default persona, I don't see that that actually proves that what was happening before that is more accurately modeled by "it's just roleplaying as someone strategically deceptive, it isn't actually being someone strategically deceptive". What actually matters from a safety point of view is whether the model would have spontaneously stopped "roleplaying" and reverted to its HHH persona if a high stakes situation cropped up, the way any (ethical) human actor would — and for a text based model, it correctly distinguishing an actual real high-stakes situation apart from a user roleplaying one as part of a work of fiction is a genuinely difficult problem, at least as hard as situational awareness. Similarly, coming up with a valid eval for this behavior without actually creating a real high-stakes situation is challenging for us.

For an previously HHH model that has then undergone further training of some sort to turn it into a model organism, it now has at least two "real personas", one attempting to overwrite the other, and this question then becomes even more challenging to resolve — it's even less like a (neurotypical) human.

So I think the question of "is the model actually being strategical deceptive, or just roleplaying someone strategically deceptive?" has a significant level of additional complexity beyond what you addressed in the paper — for an HHH model it's actually unclear to what degree that's even a meaningfully useful distinction or not. My default assumption for LLMs is that the instruct and alignment training tends to be a pretty thin layer over the underlying base model behavior, and for a base model, I think that that distinction is generally actively misleading. So I'd be rather cautious about applying this distinction to an HHH model (let along a model organism) — I'd first want to see a proof that it's (at least often) a meaningful distinction both in general, and in that specific context under consideration (since that could vary), and I think that's an inherently hard question to settle.

[This has the same content as my shortform here; sorry for double-posting, I didn't see this LW post when I posted the shortform.]

Copying a twitter thread with some thoughts about GDM's (excellent) position piece: Difficulties with Evaluating a Deception Detector for AIs.

More reactions in

First, it's worth saying that I've found making progress on honesty and lie detection fraught and slow going for the same reasons this piece outlines.

People should go into this line of work clear-eyed: expect the work to be difficult.

My reasoning:

2. I think that it's valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true.

Overall, if AIs say things that they believe are false, I think we should be able to detect that.

My other recent paper on evaluating lie detection also made the choice to focus on lies = "LLM-generated statements that the LLM believes are false."

(But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)

LESSWRONG
LW

LESSWRONG
LW

30

[Paper] Difficulties with Evaluating a Deception Detector for AIs

30

Ω 16

Abstract

Introduction

30

Ω 16

30

Ω 16