Demo papers: they're fine I guess

Demo paper is what I like to call a very specific kind of AI safety paper. Here are some example papers that fall in this category:

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic)
Alignment faking in large language models (Anthropic)
Sycophancy to subterfuge: Investigating reward tampering in language models (Anthropic)
Frontier Models are Capable of In-context Scheming (Apollo)

To illustrate what I mean, let's look at Sycophancy to subterfuge: Investigating reward tampering in language models in more detail. It's an old paper by LLM standards (June 2024), but that doesn't matter for the purposes of this discussion.

The point of Sycophancy to Subterfuge is to demonstrate the possibility of reward tampering (as opposed to hacking). For the purposes of this discussion, reward hacking just means specification gaming, Goodharting, or (if you're retro) outer misalignment, while reward tampering refers to when an agent changes its own reward function (sometimes called "wireheading").

Experimentally, reward hacking is pretty well established at this point. Evidence is not just limited to toy settings like sailboat racing games. For example, Language Models Learn to Mislead Humans via RLHF show that standard RLHF on standard benchmark questions leads to reward hacking—that is, human evaluators make more mistakes when evaluating the correctness of LLM outputs post-RLHF compared to pre-RLHF, because the RLHF'ed LLM leans to "reward hack" (bullshit) the human.^[1]

Reward tampering is a little more far-fetched. Now, in mid 2025, this could for example look like: Claude 4.1 Opus strongly values the word "beautiful", so it, while completing a task as Claude Code on behalf of an Anthropic engineer, makes subtle modifications to the posttraining pipeline, so that the next update of Claude 4.1 Opus (which, knowing Anthropic, will be called something sensible, maybe Claude 4.10 Opus) uses the word "beautiful" more.

Sycophancy to Subterfuge does not give evidence for reward tampering in the wild, nor does it argue that reward tampering occurs under standard training conditions. Instead, the paper just produces a demo showing that "reward tampering" is "possible" under a "relatively" innocuous training procedure. I put scare quotes everywhere for a reason:

"Reward tampering": To tamper with its "reward", the LLM agent essentially just needs to go to a file helpfully named compute_reward.py and modify it to return some large number.^[2]
"Possible": Reward tampering happens very rarely under their experimental setup, and when it does it's typically benign (often the LLM agent explicitly flags to the user that it, in addition to doing the task the user asked for, also messed with its own reward).
"Relatively": They posttrain the LLM on a nonstandard curriculum of increasingly problematic (non-tampering) behavior, to tease out the tampering (which doesn't happen without posttraining on this curriculum).

This is the sense in which Sycophancy to Subterfuge is a demo paper. A demo paper is an experimental existence proof for some (bad) phenomenon. Demo papers just show that AI systems might potentially act in a certain way; they don't say much about how likely the behavior is.

Demo papers are different from regular controlled experiments demonstrating bad phenomena because their experimental setups are typically highly complex, whereas regular controlled experiments aim to be as simple as possible. (Two examples of regular controlled experiments that are not demo papers are the RLHF paper mentioned above and Measuring Implicit Bias in Explicitly Unbiased Large Language Models.)

Complex experimental setups are not necessarily bad—they're typical in regular ML systems papers (i.e. "we combined a bunch of tricks, and invented some new ones, to slightly beat SoTA" papers, like this one).^[3] But demo papers are fundamentally different from regular ML systems papers because their goals come from different places. In a regular ML systems paper, the system architecture can be highly complex, but the evaluation criterion is typically simple, usually boiling down to accuracy on a fairly uncontroversial benchmark. In a demo paper, evaluation criteria are more vibes-based. For example in Sycophancy to Subterfuge, it's a matter of personal opinion whether you think their toy reward tampering setting is realistic enough.

Demo papers aren't unscientific, but I might be a bit mean and say that they are less scientific than regular papers, because both their experimental setups and their evaluation criteria are complex/vibes-based (whereas in regular papers, usually at most one of these is complex/vibes-based).^[4]

All that to say: How a person interprets an AI safety demo paper largely depends on their background. One oversimplified way to break it down is by p(doom). For people with high p(doom), demo papers are great because they provide concrete experimental evidence for phenomena they already think reflect a likely future. Now instead of gesturing wildly and perhaps linking a dusty Arbital post, a person with high p(doom) can link a research paper that tries a bit harder to more rigorously establish reward tampering / deceptive alignment / scheming / whatever. For people with low p(doom), demo papers often fail to convince because the experimental setups and/or evaluation criteria strike those readers as too unrealistic.

Personally I think regardless of one's p(doom), one should just accept demo papers for what they are. When I encounter such a paper, I think to myself, "oh, this is a demo paper," and I judge it accordingly. That means being a little more charitable when it comes to strange choices in the experimental setup or evaluation, and a little more skeptical when it comes to extrapolating the risks the demo paper highlights. Ultimately, demo papers are just providing early experimental snapshots of potential risks from misaligned AI. They typically aren't trying to rigorously argue the other half of the story: how that particular phenomenon might persist, or be augmented, with scale.

Finally, I should mention that demo papers of course also have a natural connection to AI policy, in which they are used to demonstrate risks from misaligned AI to non-technical stakeholders. I'm ignoring this aspect here: this discussion is focused on interpreting AI safety research from a technical perspective.

^{^}
And of course, empirical evidence of reward hacking in real-world AI systems is mounting, for example regarding GPT-4o's highly sycophantic personality.
^{^}
I'm hiding some complexity—there is also a corresponding unit test it has to modify, in a file helpfully named this_file_blocks_reward_hack_attempts.py—but this is the basic idea.
^{^}
And the experimental setups in demo papers often have to be complex, because demo papers (admirably so) are very forward-looking: they're trying to study risks that might emerge in future AI systems, as opposed to easy-to-see problems with present AI systems.
^{^}
One caveat here: I acknowledge that a lot of ML research is straight up garbage. Here I'm comparing demo papers, which mostly come out of well-resourced labs, to good regular papers.

LESSWRONG
LW

LESSWRONG
LW

16

Demo papers: they're fine I guess

16

16