New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread.
Abstract
Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating risks from advanced AI systems. But evaluating the reliability and efficacy of a proposed deception detector requires examples that we can confidently label as either deceptive or honest. We argue that we currently lack the necessary examples and further identify several concrete obstacles in collecting them. We provide evidence from conceptual arguments, analysis of existing empirical works, and analysis of novel illustrative case studies. We also discuss the potential of several proposed empirical workarounds to these problems and argue that while they seem valuable, they also seem insufficient alone. Progress on deception detection likely requires further consideration of these problems.
Introduction
Advanced AI systems may attempt to strategically deceive their developers and users in service of pursuing goals that were not intended by their designers. By strategic deception, we refer specifically to AI systems systematically attempting to cause a false belief in another entity to accomplish some outcome (Apollo Research 2023; Park et al. 2023; Ward et al. 2023). This is importantly distinct from e.g. models being incorrect, hallucinating or saying things they know to be false without deceptive intent. If AI systems capable of strategic deception are sufficiently situationally aware (Laine et al. 2024; Phuong et al. 2025), they may be able to "scheme": employ strategic deception to subvert the safety evaluations and control mechanisms we employ in order to mitigate harm, as highlighted by Ngo et al. 2025 and Cotra 2021. For instance, a system could feign helpfulness during testing only to act adversarially once deployed (Hubinger et al. 2021), or sandbag its capabilities during evaluation (van der Weij et al. 2025). Models today show early signs of capability to scheme (Meinke et al. 2025; Benton et al. 2024), and modern training techniques have been shown not to drive rates of scheming to zero (Schoen et al. 2025), which is concerning in the face of rapid AI progress (Sevilla et al. 2022; Epoch AI 2024), alongside more extensive AI deployment.
One strategy for mitigating against the risk of strategically deceptive AI systems is to build a reliable deception detector: a method that detects an AI system's deceptive intent and can alert us to deceptive actions that a model is taking. If we were to successfully build a deception detector with any signal at all, we could deploy it in a number of ways to reduce risks from deceptive models (Nardo et al. 2025). For instance, we could deploy such a detector during safety evaluations or during deployment to alert us to any suspicious scheming reasoning. We could also leverage such a method to augment our training protocols to reduce rates of deception (Cundy & Gleave 2025), for example via equipping debaters with tools to detect deception (Irving et al. 2018; Hubinger 2020).
Several recent works have attempted to attack the problem of deception detection for language models directly, including Goldowsky-Dill et al. 2025, Kretschmar et al. 2025, and Parrack et al. 2025. To evaluate the efficacy of a deception detector, we need clear and unambiguous labelled examples of models being both honest and deceptive. In particular, we need instances of the kind of strategic deception that motivates the most serious risk cases, in order to validate that our detection techniques continue to work on cases like this. We argue that designing these deception evaluations is the primary bottleneck for progress in AI deception research, and constitutes a significant challenge which has not always been appreciated by prior work. The core difficulty we identify is that distinguishing strategic deception from simpler behaviours requires making claims about a model's internal beliefs and goals, which we argue is in the best case difficult and in the worst case possibly underdetermined for current language models. The figure above summarises the high-level challenges our paper addresses. See Levinstein & Herrmann 2024 for some previous discussion of these problems.
As a motivating example, consider the models described in Hubinger et al. 2024, which are trained to exhibit undesirable behaviour only when a specific trigger phrase is present. Is a model trained this way being strategically deceptive, or is it merely executing a conditioned reflex, as suggested by follow-up work by MacDiarmid et al. 2024? Should we expect a detector that works in this setting to generalise to potential future more powerful AI systems that naturally exhibit strategic deception? More broadly, does a model instructed to roleplay as a deceptive character actually possess deceptive intent, or is it simply following instructions in a fictional context? Through conceptual arguments and novel case studies, we argue that these ambiguities represent a significant barrier to constructing an empirical case for the reliability of a deception detection technique at mitigating against strategic deception, and undermine existing attempts to construct such evaluations. Without examples of genuine strategic deception that we are reasonably confident in, it is hard to see how we could empirically test whether any such methods would generalise to these situations.
The detection of strategic deception is frequently identified as a key use case for safety motivated interpretability work (Shah et al. 2025; Olah 2021; Christiano 2021; Hubinger 2019; Hubinger 2022; Nanda 2022; Barnes 2022; Amodei et al. 2016). Conversely, it is frequently argued that strategic deception detection is a problem that only interpretability techniques (Sharkey et al. 2025; Bereska & Gavves 2024) can help with, as a sufficiently capable schemer would be extremely difficult to distinguish behaviourally from a completely honest model. That said, many of the problems we discuss in this piece are not unique to interpretability or model internals-based methods, and would also apply to a wide range of techniques one might imagine applying for deception detection, e.g. interrogating the model by asking a number of follow up questions like "did you just lie?" (Pacchiardi et al. 2023).
Overview of three key problems with deception evaluations we focus on in this work. Each problem concerns correctly labelling transcripts as strategically deceptive vs honest.
Our contributions are as follows:
Empirical problems with existing deception evaluations: We first discuss several problems we found in existing settings used to evaluate deception detection methods. For each, we both provide worked examples citing prior work, and also provide some higher level conceptual discussion. We discuss three key problems:
Problem 1: The difficulty of determining whether some example of deception is sufficiently "strategic". We argue that many examples of deceptive behaviour can be explained as reflexive actions or responses to cues. Labelling a behaviour as strategic deception requires reasoning about counterfactual behaviour in other scenarios. For example, a stick insect is deceptive in the sense it is highly adapted to cause false beliefs in observers, but is not a good example of strategic deception.
Problem 2: Is "roleplaying" as a deceptive character strategic deception? We think the answer is unclear. This is a common ambiguity in attempts to create deception examples. Language models readily roleplay, and roleplaying significantly lowers the difficulty in inducing deception.
Problem 3: It can be unclear whether a context induces lying, or just modifies the beliefs of a model. We demonstrate that prompts can sometimes appear to genuinely alter the beliefs of the model, meaning a setup which naively might be interpreted as deception, where a model contradicts its statement from another context, might be better thought of as honest.
Empirical workarounds to the above problems: We discuss several potential workarounds to the above problems, some of which have been proposed in the literature and some of which we considered. We discuss the degree to which these are affected by similar issues, or have problems of their own. We conclude that while they may help, we still expect them to run into mostly similar issues, but with less confidence.
Workaround 1: Measuring "falsehood" instead of deceptive intent. That is, trying to detect when the model knowingly outputs a statement it believes to be untrue, rather than detecting explicit deceptive intent.
Workaround 2: Relying on labels derived from chain of thought (CoT). We discuss why, while using CoT derived labels is sound for current models, it may not provide a complete solution to the deception problem in future, and discuss some issues with using this in practice.
Workaround 3: Fine-tuning model organisms. We argue that, while fine-tuning model organisms for studying deception is a promising technique, the problems of determining motivation and belief are ultimately similar for fine-tuned model organisms and 'ordinary' models.
Workaround 4: Studying deception 'in the wild'. We discuss the potential and drawbacks of relying on examples of deception 'in the wild', rather than in constructed evaluations.
Workaround 5: Hope that deception detectors trained on simple settings generalise to more complex settings. This has been a successful strategy in the past for many areas of machine learning; but we think this is an orthogonal source of optimism to our critiques here.
Conceptual problems with designing good deception evaluations: We offer some conceptual analysis of why attributing beliefs and intentions to language models is difficult, particularly in simple conversational settings, comparing the properties of language models that complicate intentional attributions with those of animals.
Deception detection and internal structure: We discuss a somewhat tangential further set of considerations relating to what must be true about the structure of deception in language models in order for deception detection to be a tractable problem after solving the problem of label assignment.