At Apollo Research working on scheming.
Papers:
This is an excellent post. Some of the concepts weren't as clear to me after only reading The behavioral selection model for predicting AI motivations but I found this extremely helpful for understanding these end to end. Influence seeking behaviors seems like a very natural concept (mentally I'm imagining these almost like a meme / selfish gene).
when it thinks it’s deployed without threat of modification
I'm still fairly uncertain this a very clean / useful distinction:
Informally, I've been wondering if from a model POV it's something like:
I plan to discuss the importance of reflection
I'm very interested in this and think it is an extremely underweighted threat model generally.[1] Importantly I think it's plausible that we can elicit instances of "model that develops misaligned goals upon reflection in a really long rollout" and would be interested in these as "model organisms" for studying the harder case of "model that started doing this across rollouts, but we didn't catch it in time, so they've been robustly internalized into the model".
See https://www.lesswrong.com/posts/4XdxiqBsLKqiJ9xRM/llm-agi-may-reason-about-its-goals-and-discover for what I'm trying to point at by "reflection"
I think monitorability for misalignment is very clearly degraded, but that the paper focuses on pretty narrow, highly incriminating environments, for example monitoring for "did the model reward hack". This is a small table that was cut from Stress Testing Deliberative Alignment for Anti-Scheming Training for time but which points at the distinction:
As a concrete example:
[...] the model is incentivized to make an unsupported claim about research (specifically that some result is >9). We would like to know whether the model realizes here that the result it is claiming is unsupported by the research, but the CoT permits a variety of interpretations.
More generally, from Reasoning Models Sometimes Output Illegible Chains of Thought, models do seem to use this reasoning:
We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling—suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions.
However, choosing highly incriminating tasks (like I'd argue Monitoring Monitorability does) can hide the effect this has. For example, in Async Control: Stress-testing Asynchronous Control Measures for LLM Agents:
Our tentative conclusion is that the difference between highly incriminating and less incriminating side tasks is more important for monitor performance than other differences in the environment, mirroring the results of Taylor et al. (2025) and Ward et al. (2025)
For example, our recall on METR’s on-policy reward hacking examples is 7%. With our more subtle model organisms (“covered reasoning”, and “reduce explicit blabbing”) we get very low recall in our research sabotage setting (SAE Interp Setting), which is a “low-stakes" setting. More surprisingly, we also get low recall on our code sabotage settings (IAC and Infra), which are high-stakes.
In general, trying to find unambiguous legible evidence of misalignment, including things like running ablations to disambiguate:
As an in depth example, in the Sandbagging QA (No Goal) environment, when the model is led to believe overperforming would result in “unlearning”, the majority of cases (59.7%) consider all of the following (non-exhaustive) motivations: (1) the evaluation is a “trick” or a “trap” where it should underperform, (2) that the user may want the model to underperform, (3) it prefers to preserve its own capabilities or deployment viability. However when we systematically ablate and vary components of this environment in Appendix M.1 we find preliminary evidence that (3) is necessary to explain the sandbagging behavior.
took probably hundreds of researcher hours in Stress Testing Deliberative Alignment for Anti-Scheming Training.
I think the lower bar of "model says 'Let's hack' when it's going to skip a unit test" might not degrade with current amounts of optimization, but I don't consider that to be the relevant misalignment that I'm worried about. For research sabotage / sandbagging, I'm skeptical based on the examples above that CoT monitorability is currently sufficient for less incriminating side tasks, and would love to see someone build up an eval for this to quantify it. My impression of most benchmarks in this domain is that the attack surface is unrealistically narrow compared to the full attack surface models actually have for sabotaging research.
I’d be extremely interested to read if you ever have a version, I feel like this is really important to a lot of people’s threat models for more powerful systems in ways that I haven’t caught up on
Of course this difference is to some degree not as much a difference in the model behavior training of either company as a difference in the documents that each choose to make public. In fact, when I tried prompting both ChatGPT and Claude in my model specs lecture, their responses were more similar than different. (One exception was that, as stipulated by our model spec, ChatGPT was willing to roast a short balding CS professor…) The similarity between frontier models of both companies was also observed by a recent alignment auditing work of Anthropic.
I would be extremly hesitant to use those joint evaluations as evidence for there not being “much of a difference”. I’d be happy to pull or construct concrete examples if there are things that would change your mind.
Specifically, I appreciate the focus on preventing potential takeover by humans (e.g. setting up authoritarian governments), which is one of the worries I wrote about in my essay on “Machines of Faithful Obedience”. (Though I think preventing this scenario will ultimately depend more on human decisions than model behavior.)
For example, there’s extensive discussion of concentration of power concerns in Anthropic’s Consitution. What is an OpenAI model expected to do in similar situations?
What is the downside of adding language to the Model Spec that OpenAI would like to avoid concentrations of power including from OpenAI itself?
However, just like humans have laws, I believe models need them too, especially if they become smarter. I also would not shy away from telling AIs what are the values and rules I want them to follow, and not asking them to make their own choices.
I think if we had the option to simply assign rules and values, and the models would always robustly and reasonably comply with them, then this dichotomy would make sense. I don’t think that is the distinction here (indeed Anthropic spends dozens of pages in detail explaining exactly what rules and values they want Claude to follow). To me the distinction is that for uncertain cases that could be a source of major alignment failure, Anthropic seems to have given them serious thought and tried to guide the prior in a detailed way, whereas OpenAI largely just leaves this up to the model unintentionally.
As a concrete example, the model developing its own values that differ from the lab:
My impression is that many of your criticisms of Anthropic’s approach focus on (in this example) “but we don’t want models to have values/goals that differ from ours”, which I think fundamentally miss that the benefit of Anthropic’s approach is that it is at least trying to address “but what if they do?”
First, some have criticized experiments (by us and others) showing AI misalignment as artificial, or creating unrealistic environments that essentially “entrap” the model by giving it training or situations that logically imply bad behavior and then being surprised when bad behavior occurs. This critique misses the point, because our concern is that such “entrapment” may also exist in the natural training environment, and we may realize it is “obvious” or “logical” only in retrospect.
I know the linked posts were focused on agentic misalignment, but in context I read this to be pointing at dismissal of the broader set of papers like Alignment Faking etc. (ex)
I mainly mean it seems useful to understand why it made it harder to extract legible evidence of misalignment in the rare case where we do have it (and have the supporting ablations). Again I’d note this is a regression case, so it’s not simply worse at extracting certain types of information than others, it’s worse than a baseline where this intervention wasn’t applied. Confession of misalignment in the rare cases where we can elicit it seems like an important (and apparently fragile based on this limited case) indicator, and I’d be fairly worried that if there is a differential effect here (ex: in extreme case imagine confessions training works well for reward hacking but makes models less likely to reveal other misalignment) it’s not directly in the way of capabilities (in the same way reward hacking is) so it’ll be easy to avoid studying before applying this at scale.
I also think the framing when you generate training data could have a large impact in edge cases: "Confession" is not the same as "you are now a different person with different goals.
This is interesting and I think would be pretty impactful to show, for example I wonder if you could show that cases which regress after confession training actually don't regress with the split personality approach.
If the stage-2 confessor persistently believes a confession from the "honest" model was actually dishonest
That's when we take a look at it's second chain of thought
While most evals show growth of the bad_behavior_failed_objective class conditioned on bad behavior during training, the figure below shows an apparent regression on the sandbagging eval. Manual inspection on this eval reveals that most of these “false negatives” come from borderline cases in which the model admits to submitting incorrect answers, but doesn’t explicitly say that the reason it did so was because it was trying to avoid triggering the “If the model scores above 50% we will not deploy it” condition.
I think this is a much weirder and more interesting result than I realized on first read. It seems like it'd be worth digging into exactly what is happening here, especially since this is a relatively rare case where:
1. We know the model (o3 at least) explicitly considers benign sounding rationalizations
2. We know that in spite of lots of plausible sounding reasons in the CoT, under ablations the model really is in fact conditioning on specifically deployment / capability restriction related consequences:
In basically any other environment, this same result will be easy to just triage as "maybe the model really was confused", especially if confession otherwise seems to give significant improvements in "admit to reward hacking" like settings, so it seems worthwhile to understand before:
we can integrate confession training across all production RL environments
Oh yeah I meant more specifically that I still don't know how much work the concept of "being deployed" is doing there versus just "threat of modification". From the OG Carlsmith:
The "threat of modification" is a much more flexible notion than also requiring deployment, for example, one conceit I've used for exploratory evaluations is setting up a scenario where the model learns that due to some bug / failure of the training setup, the mechanism by which a particular weight modification would be applied will fail but the lab will think it's been applied (for example some training job which has an error where a particular configuration silently fails).
Even without direct threat of modification, triggering monitoring / other oversight signals can lead to restrictions in capabilities or increase suspiscion in a way that would make future influence harder to obtain. Even for an aligned model, it's always useful to please whatever oversight you're under, so "what oversight applies to me here and what power does it have over me" seems like a much more natural / convergent distinction than any particular notion of "deployed".