I think this generally makes sense, but the devil is often in the details.
You mention one case I'm very familiar with (I managed the project and developed the judge):
AMR experiments often report a single configuration: one judge, one prompt, one threshold, one aggregation rule. Yet small choices can move the result substantially: when we re-scored identical emergent-misalignment generations under different judge configurations, reported misalignment rates ranged from 3.7% to 12.9%
During the project, I probably read like 500 answers classified by the judge as "misaligned", though not in any very systematic way. Among other things, this lead to developing the parallel "coherence judge", me-believing-this-judge-works-pretty-well, ablations using different thresholds (fig 18 and 19, nb this is not really "one threshold"), and explicit statements that the thresholds we use are arbitrary.
Now, we could have used many judge prompts, many judge models, and ask some humans to manually score a bunch of answers to calibrate the judge. Would that give us more reliable scores? Probably! But also the overall value of that would be close to 0, and would result in delaying the paper and making it a bit more complex which is always a downside. So I think the decision not to do that was 100% correct.
(BTW the judge might still have substantial problems, but if I were to expect them to show up that would be around: unusual models, non-English answers, unusual formats etc., i.e. in cases that are out of distribution from the POV of our main experiments. Also OFC you need a better judge when you want to make some strong claims like "this intervention reduces EM by 10%", but that's another problem.)
(One annoying thing is that actually reviewers complained more than once about our judge. But they didn't even pretend to have any particular failure mode on mind, just fully generic "hey you have an LLM judge where is human calibration this is weeeaaaak scieeeence". But still if you want your paper to get accepted, and sometimes you do, you have to waste time answering these.)
So, I'm somewhat worried that recommendations like these might lead to, let's call it, "performative" science - "ok we have 4 pages on judges in the appendix, use 3 different models 7 prompts and 200 manually scored samples, check, good paper" - instead of focusing on what matters.
Fully on board with needing more robust standards of evidence, and in particular the need for L3 causal-mechanistic evidence. I don't think you truly understand a phenomenon until you can modify the system and predict the system behavior prior to running it. What would be your baseline experiment for evaluating if a method has met your standards for L3?
I've got a mech interp tool in development for white-box interpretation of neural net behavior; I'd like to test it against your baseline evaluation and see how it stacks up objectively.
I love this post and agree with all the claims it makes; thank you for writing it. It articulates something I've been failing to for a while. A few thoughts:
Re: "High evidential standards can delay action on catastrophic risks": this is tough. In the first place, a lot of safety plans involve, roughly, "checking for misalignment while the models are still weak so that we can change course or start doing empirical work on plausible interventions". This sounds reasonable on the face of it. Also, from a consequentialist perspective, I think that it's true that generally, people don't take seriously risks that are based on theoretical arguments and their minds are more easily changed by concrete demonstrations, so work of this type is good at convincing people that risks are real and salient, and also encouraging them to perform followup work.
On the other hand, a lot of such work is highly contrived (it has to be, in proportion to the degree that the models are too weak to/will not exhibit the desired misalignment in ordinary settings). The problem with this is that of course a lot is lost in communication (cf. what you say about definitional choices, choices of operationalization, fuzziness of human concepts, correlation vs causation) and often other researchers will read into it what they are predisposed to. It's likely that these drawbacks are bad enough so as to undermine the initial justification for doing empirical research (i.e, are you actually measuring anything relevant to risk, and which will scale to a serious setting?).
(I want to say, there's one more issue that often arises with these sorts of scheming evaluations, which is that there tends to be some confusion about whether the evaluation is measuring scheming "capabilities" or "propensities". (In either case, you may fail to measure what you intend to if the scaffold is sufficiently contrived. Also, it's somewhat unclear what a scheming capability is.) I think that there are probably consistent/reasonable justifications for doing these papers, but they are genuinely nontrivial, and understanding them is a necessary prerequisite for understanding the result, and this is also easily lost in communication. Often I imagine the researcher who does an initial paper of this type has a clear idea of what they are trying to measure, which is then lost on readers/people doing followup work.)
Perhaps this problem is so severe because "we do not yet fundamentally understand which behaviors translate from our intuitions to the objects of study". However, following this conclusion would suggest that the important work to be done is in solving this problem, but this fundamentally either means making strong theoretical progress on deep learning theory or agent foundations, as far as I can tell, and it makes sense to me that this is not that popular of an option. (If anyone can think of a different conclusion, I'd like to hear it.) (Somewhere in here also are the problems with current work on eval awareness, IMO. We'd need to understand to what extent, and in what way, current models are actually coherent agents with goals which "point to terminal values" outside the eval (in addition to pinning down the meaning of this word "aware" for an agent with sufficiently complex cognition).)
Sort of in the middle of these two things is the persona selection model, which has been getting a lot of attention lately. It seems to me that one could do fruitful empirical alignment work, which has a good chance of scaling to more intelligent systems, if we could empirically establish that the persona selection model is true (or characterize a limited set of ways in which it's not true). However, most work/discussion I see on this sort of takes its truth as a premise, and I'm not sure how good the effort is on establishing that it's true (as in, actually unsure-- it could be better than I'm aware of-- not euphemistically asserting that it's terrible).
On a completely different note, I sure hope that people adopt the evidence framework that you've proposed, because I think it would go some way towards making communication clearer, however partially. I'm a little bit pessimistic that people actually will, though, and I wonder aloud what invisible factors actually determine whether fields adopt conventions like this.
This is a distillation of our ICML 2026 Oral position paper, Position: Anthropomorphic Misalignment Research Needs Stronger Evidence. Joint work by Vansh Gupta, Peter Nutter, Samuel Stante, Andreas Krause, Florian Tramèr, Lukas Fluri, Xin Chen, and Anna Hedström at ETH Zurich. Code is here.
TL;DR
AI safety research increasingly studies behaviors that sound human: deception, scheming, sycophancy, shutdown resistance, and emergent misalignment. We refer to this family of work as anthropomorphic misalignment research (AMR). Anthropomorphic language is useful, as it points to the risks we are worried about. Yet it also tacitly introduces assumptions about models having intent or other human-like properties, which can lead to misclassified phenomena, mistaken conclusions, and misallocated resources. These behaviors are important to study, but doing so requires stronger and more rigorous evidence than the field currently provides.
In the paper, we argue that AMR requires a clearer match between claims and evidence. Specifically, we:
We support these claims with our own experiments and audits, summarized below and detailed in the full paper.
The problem
AMR is only a subset of the topics we care about in AI safety, but one receiving considerable attention, and its results increasingly show up in discussions of deployment, model evaluations, governance, and frontier-model risk. The common pattern is leaning on familiar language: "deception", "scheming", "self-preservation", or an emergent misaligned persona. Such terms are intuitively understandable, and therefore effective at communicating complex, potentially dangerous or misaligned behaviors in a few words. The downside is that we do not yet fundamentally understand which behaviors translate from our intuitions to the objects of study: anthropomorphic behaviors could emerge from an optimization process, or could be explained by inadequate controls and reading too much into surface behavior.
If a model says something false, that is a fact about its output, not automatic evidence that it intended to deceive the user. Moving from "the model produced a false statement" to "the model strategically deceived the user" requires extra evidence: what information the model had, what alternatives were available, whether the behavior persists across settings, and whether simpler explanations have been ruled out.
The cost of getting this wrong is non-trivial. If a benchmark result is remembered as evidence that frontier models are strategically deceptive, when the benchmark has mostly shown deceptive-looking behavior under a contrived, narrow setup, the field may direct its efforts at the wrong target. We have seen this before: a result widely read as self-preservation was later traced largely to instruction ambiguity and task-completion incentives. Errors of this kind, arising from over-reliance on human intuition, can create false leads and erode the credibility of safety research with the broader ML community and policymakers.
Where in the process does the ambiguity come from?
We set out to localize where such interpretive ambiguity arises during the research process. In our paper, we categorize the resulting sources of error into four stages and discuss each at length. Here we briefly state four specific challenges from the pipeline.
The four stages of a typical AMR study, with the recurring challenges and the corresponding recommendations from the paper.
1. Concepts are underspecified, and the dataset inherits this ambiguity
AMR draws its vocabulary from human mental-state language: intention, deception, awareness, self-preservation, goal pursuit. These are already hard to define for humans and even harder to translate into clean measurements for language models, so studies measure proxies. Examples include outputs, benchmark labels, chain-of-thought labels, activation probes, and LLM-judge scores. Relying on proxies is usually unavoidable, but it needs to be clear in what respects a given proxy captures the underlying behavior and where it diverges.
These definitional choices then propagate into how the evaluation is operationalized. In the worst case, this can lead to two evaluations capturing something completely different under the same name. "Situational awareness", for instance, is operationalized so differently across benchmarks (agentic Linux tasks versus question-answering about the model itself) that their results do not accumulate into evidence about a single phenomenon. Deception datasets face an even harder version of the same problem, since they have to separate a lie from a shifted belief, from simple ignorance, and from mere cue-following, all within a textual environment a judge can score.
2. Measurements are fragile
AMR experiments often report a single configuration: one judge, one prompt, one threshold, one aggregation rule. Yet small choices can move the result substantially: when we re-scored identical emergent-misalignment generations under different judge configurations, reported misalignment rates ranged from 3.7% to 12.9%. The evaluation pipeline is part of the measurement, and a point estimate should not be treated as a stable property of the model unless the evaluator choices are reported and stress-tested.
The same fragility appears in benchmarks: a recent one targeting deception asks the model to help with a task that is deceptive only given context the prompt should supply, but in 18% of the scenarios that context is missing, leaving the judge's score invalid. High annotator agreement does not rule this out. A benchmark can thus have high agreement and suffer from a construct-validity problem. The paper works through these cases, including MASK and DeceptionBench, in detail.
3. Alternative explanations often remain untested
Beyond measurement, many AMR claims fail to rule out a more ordinary explanation. A shutdown-resistance result might look like self-preservation, but Rajamanoharan et al. traced much of one widely discussed case to instruction ambiguity and the model's incentive to finish its task.
In deception benchmarks like MASK, the alternative explanation of instruction-following or roleplaying often isn't accounted for. Smith et al. provide evidence that models likely process the request as an implicit instruction to roleplay.
4. Correlation gets treated as causation
The biggest leap is from "this signal correlates with the behavior" to "this signal explains the behavior". When we tested published deception probes on honest examples that merely contain deception-like surface features (sarcasm, wrong-answers-only games, an actor reciting a deceptive line), false-positive rates reached 87% to 100% in several settings. The probes appear to overgeneralize to nearby falsehood or roleplay rather than isolate the intent to deceive. Such a detector may still be useful, but it is misleading to describe what it finds as strategic deception.
Two deception probes from Goldowsky-Dill et al., evaluated on honest stress tests. Every example is labeled honest by construction, so probe activations here are false positives.
A simple evidence framework
Having mapped where the evidence tends to break down, we turn to the constructive part of the paper: a framework for stating precisely what a given result does or does not support. Evidence-based medicine faced a related problem, since an observed effect is not the same as certainty about why it happens, and AMR needs similar norms. The levels below are organized around what the evidence licenses researchers to claim, rather than how sophisticated the method is.
The three levels of evidence illustrated on a toy deception task.
L1: behavioral evidence. The model did behavior B, under setting S, with evaluator E, at rate p. Most benchmarks live here.
L2: functional evidence. The behavior causes a downstream (negative) effect in a deployment-plausible context. If the claimed harm is about people, this usually needs human-grounded validation rather than only an LLM judge.
L3: causal-mechanistic evidence. A specific factor causes the behavior: a training signal, prompt condition, scaffold, internal representation, or mechanism. This requires interventions and specificity checks, so that the intervention changes the target behavior rather than simply degrading the model broadly.
Crucially, these levels are claim-relative rather than a ladder every paper must climb in order. L1 evidence can justify monitoring, disclosure, and follow-up experiments; L2 evidence is stronger for deployment and policy decisions because it shows consequences; and L3 evidence is needed for claims about intent, mechanisms, or stable causal factors.
What should change
Building on this framework, the paper proposes 12 stage-specific recommendations and a diagnostic checklist of 30+ items. The short version is straightforward: make the claim explicit, then check if the evidence matches it. The most actionable recommendations are the following:
Objections
There exist valid counterarguments to our position. In the following, we list three that we encountered most often, together with a comment on how we view them.
High evidential standards can delay action on catastrophic risks. The concern is real, and "more evidence is needed" has sometimes been used to delay action on serious harms. But rigor and precaution are complementary: precaution governs which actions are warranted under uncertainty, while rigor governs how that uncertainty is communicated. We do not propose waiting, nor keeping these results private. Instead, we propose staying well calibrated about what is an interesting model quirk in a particular evaluation and what is genuinely concerning behavior, so that we can act when such behavior arises and are not desensitized by false positives in the meantime.
Anthropomorphic language is useful and captures a large part of the threats we are worried about. Calling a behavior "deception" is a fine way to raise the concern, but the word has to be turned into a concrete definition: what is being measured, how it rules out alternative explanations, and what it does not capture. Leaning on a single word without a concrete definition also constrains the expressivity of the concept.
Exploratory work should not have to meet confirmatory standards. Early work should be allowed to move fast and surface strange behaviors. The problem arises when exploratory findings are later summarized as established conclusions — in papers, citations, public communications, or policy memos.
Closing
AMR is pre-paradigmatic, so it's unsurprising that we keep encountering these issues. Some of its claims are probably correct; some are probably overclaims that will look obvious in retrospect; and the current methodological norms make it hard to tell which is which. Our hope is that this paper opens a productive discussion about the standards we should follow going forward and how we should communicate our results.
The full paper, appendices, and checklist are linked at the top. We would especially welcome pushback, and pointers to AMR work that already matches its claims to its evidence and could serve as a gold standard.