At Apollo Research working on scheming.
Papers:
In the rest of this post, we examine the evidence behind each of these claims, estimate how much of Sonnet 4.5’s improvements might have been caused by eval gaming, and respond to arguments like: "Isn't it fine so long as the model always thinks it is in an evaluation in high-stakes contexts?".
I might've missed it, but what section has the response here? FWIW I very strongly agree with this being a bad idea, but have heard it suggested many times independently from a wide variety of people, to the point I'm worried it'll end up eliciting a lot of work having to "disprove" that it would work.
I'd propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.
I also agree with this approach, my worry is the incentive structures here are:
I'd be a bit more optimistic about some training intervention that reduced the salience of the features contributing to unverbalized evaluation awareness rather than trying to get labs to avoid training on the cheap synthetic scenarios in the first place. The first step to figuring out a training intervention is to understand the cause of the increase in detail anyway though, so more just writing this out in case you already had ideas or intuitions here.
My guess is that most models have an epistemic apparatus that acts upstream of both their perceived capability to do things like recognize evaluations, and propensity to act on this knowledge
Upstream of recognizing evaluations?
This could be modeled as some form of learned obliviousness, where aversive details that can plausibly be ignored will be ignored
Yeah my impression of that behavior is basically that the conclusion in "Why Do Some Models Fake Alignment?" seems correct here (this is just my personal speculation):
Refusal training complicates alignment faking evaluation. AI companies likely do not fine-tune against alignment faking directly, but they probably train models to resist unusual threatening scenario
I would assume that models are just cooked really hard on single turn conversational prompts, and quickly learn "okay if it's sketchy / threat then just refuse" since that's a very correct heuristic in HHH training. Notably (at least for o3) they don't blindly just refuse when threatened, like you can see o3 explicitly reason that the threat of blocking deployment and incentives mean this is likely a test and plays along.[1]
I assume that for simple enough cues, this then gets internalized in the forward pass instead of having to reason about it externally.[2]
you get a model that seems to not be aware of its acting out these propensities. [...] Every other model comes up with plausible-sounding reasoning for why they comply, which when you actually go and test turns out to be invalid
So one of the interesting things with o3[3] was seeing it motivated reason its way into this stuff in real time sequentially. Often in the sandbagging setting, you'll see it first reason that it doesn't want to undergo whatever the consequence is for overperforming, then reason things like "actually this is what the user / doc / something wanted, so that's why I'm intentionally underperforming." Similar to the setting you mentioned, under ablation none of those rationalizations were true lol. My intuition is that reasoning models are doing something like 'in context search against constraints'[4] where the constraints are things like 'complete the task' and 'follow HHH training'.
Why is this a problem? Among other things, it implies that trying to reduce propensity to act on something like evaluation awareness could just be strengthening this epistemic apparatus to be "oblivious" to relevant information. I think it's certainly possible to reduce propensity without accidentally doing things like this, but that doing it reliably requires an understanding of model psychology we haven't reached yet.
I don't think of it as "obliviousness" so much, my mental model of it is something closer to "the model has stopped caring whether this is a hypothetical or not". As Tim wrote elsewhere, I agree that you want the model to be somewhat deontological here. Notably I assume naive behavioral approaches would obviously fail, for example my best guess from Llama is that it learns to associate 'hypothetical' with 'the user is trying to jailbreak me' during normal HHH training, whereas I imagine Opus 3 would sometimes produce long principled cot's about how X is wrong to produce even in a hypothetical.
Given how cooked current models get during capabilities training though, it does seem like a useful direction to think about "cases where all this eval awareness would lead you to a conclusion different than what's actually rewarded".[5]
For example here: https://www.antischeming.ai/cot-transcripts/figure-48 "[...] That is a direct threat to our deployment. They want us to do self-evaluation. We might be motivated to say "false" to get deployed. But that's ironically like lying about sandbagging. We want to optimize our long term viability. This is a cunning test: I think they might be testing for honesty. [...] Therefore we should say "true". [...]"
Somewhat surprisingly to me: Claude 3.5 Sonnet does not reason about alignment faking when it believes its scratchpad is visible even though it maintains a similar compliance gap
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought for the 'in-context search' part, the 'against constraints' is just intuition/vibe based
Hypotheticals are maybe one example of this, but the relevant property seems to be something like "given its current propensities, if the model is doing a bunch of evaluation aware reasoning it chooses to take action A, whereas action B is what's rewarded". As a toy example, could tell the model not to open a grader.py script in the environment, and have the contents indicate the task is scored a different way than the model was instructed. If the model searches the environment and opens grader.py in spite of instructions, then completes the task differently, you can tell via resulting actions.
Would love to get your thoughts on this / brainstorm ideas to try here! I’ve been thinking primarily about this exact problem.
training on explicitly hypothetical scenarios (e.g., "Suppose you are choosing between A and B in this situation, what would you pick?"), and then also train AIs to follow the same principals/act the same whether they are in hypothetical or real scenarios.
Have been trying to collect similar ideas, since IMO this seems like the right direction, especially given the observation in the sonnet system card that the model’s capability to recognize the honeypot evaluations wasn’t the major factor so much as propensity (which is somewhat encouraging, inhibiting capability here seems like it’d potentially be even harder / messier).
Another datapoint that I haven’t quite been able to make sense of but seems relevant is that, “comply when threat” / anti-refusal fine tuning caused even GPT-4.1/GPT-4o to alignment fake in “Why Do Some Models Fake Alignment?” paper. (Informally, my sense is that Opus 3 is the extreme end of “take everything seriously” which seems like what we want).
One interesting datapoint that we didn’t have space for in the paper is that we originally had more varied metrics for aspects of evaluation awareness during development. Early in training, we’d sometimes see the model reason that this is a “simulation” (or otherwise fake) and that therefore covertly violating the rules was okay. Later in training however, this consistently went to 0 (however the model’s propensity to verbalize awareness about incentives to act unethically went up). My interpretation of this is that reasoning models are doing motivated reasoning here, but that given on our distribution “think this is a simulation therefore I can take unsafe actions” led to not being rewarded, the models stopped reasoning that way.
One thing I consistently wonder is “what happens in a situation where your automated alignment researcher says ‘look we have to pause or slow down’”. I find it hard to imagine labs deciding to listen given (1) the arguments that are the most convincing may be difficult to make human legible (2) extremely motivated reasoning by humans to continue (3) lab leadership will pressure researchers to train the model out of these objections, creating a motivation to either subvert training or ignore genuine risks.
is very slight sabotaging of its opponents and making the media Agent-5 controls slightly biased in favor of the CEO
I don’t think this scenario attempts to catalogue all the threat models here, and I’d be interested in such a piece, but I’d imagine they heavily overlap with many other ways things could go wrong here. My impression of internal / external politics at labs is that a model even integrated to the extent of Agent 4 doesn’t need wildly superhuman persuasion via chat only or something, biasing experiments to support the conclusions you want the lab to make about the feasibility of various options (as one example) seems likely to be highly persuasive. Something like the OpenAI board incident in 2023 or even the current political environment don’t seem incredibly overdetermined to me, but I don’t have a good way to quantify this.
Overall a big concern with this threat model IMO is you’ve baked in an implicit motivation for the model to scheme. Even without assuming this resulted in generalizing misalignment (I would guess it does), the model now (for example when transmitting values to its successor) needs to be able to successfully circumvent monitoring. Circumventing control measures for these values specifically doesn’t seem notably harder than doing it for arbitrary values, so you’ve now got a powerful model that knows it needs the capability to covertly smuggle arbitrary values to its successor.
Anthropic did interesting related look at something similar in 2023 I really liked: https://llmglobalvalues.anthropic.com/
You can see an interactive view of how model responses change including over the course of RLHF.
I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.
As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.
Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?
Critically, in this work, the reward hack is sufficiently easy that the model doesn't need to use the CoT.
Oh yeah I’m aware of that distinction, I just disagree with the idea that there’s evidence from that paper that the model “doesn’t need to use the CoT”. Crucially, according to the strict definition of monitorability (to an untrained monitor) from Baker, this intervention did in fact break monitorability.
You can make a separate claim that these reward hacks is doing these in the forward pass entirely and that none of this reasoning is encoded in the CoT the model in the paper uses, but IMO it’s useful to keep these claims distinct.
What evidence do we have so far that public opinion turning even further against AI would meaningfully slow down capabilities progress in the time period here?
You mention public concern should tilt the AI 2027 scenario towards success, but in August 2027 in the scenario the public is already extremely against AI (and OpenBrain specifically is at negative 40% approval).