At Apollo Research working on scheming.
Papers:
I found this post very unintuitive, and I’m not sure Approval Reward is a precisely bounded concept. If it can be used to explain “saving money to buy a car” then it can really be stretched to explain a wide range of human actions that IMO can be better explained by drives other than social approval. Most importantly (and this is likely a skill issue on my part) it’s unclear to me how to operationalize what would be predicted by an Approval Reward driven framework vs some alternative.
This seems focused on intent in a way that’s IMO orthogonal to the post. There’s explicit statements that Anthropic made and then violated. Bringing in intent (or especially nationality) and then pivoting to discourse norms seems on net bad for figuring out “should you assume this lab will hold to commitments in the future when there are incentives for them not to”.
I’m confused about much of the discussion on this post being about whether Anthropic has done “net good”.
The post is very specifically a deep dive into the fact that Anthropic, like any other company, should not have their leadership’s statements taken at face value. IMO this is a completely unrealistic way to treat companies in any field, and it’s a bit frustrating to see the rationalist presumption of good faith extended over and over by default in contexts where it’s so incredibly exploitable.
Again this is not a specific criticism of Anthropic, if a new lab starts tomorrow promising to build Safe SuperIntelligence for example, we should not assume that we can trust all their leadership’s statements until they’ve mislead people publicly a few times and someone has a deep dive comprehensively documenting it.
Like -- I don't think my models lie to me a bunch.
How much of your interaction is with coding agents? I think a lot of what people are attribution to theory of mind / philosophical differences may in fact just come from the models having very different propensities in chat based environments than in agentic coding / task environments (which makes sense given the training dynamics).
e.g. "illusion" looks like if refers to a mistakes it made earlier in the CoT
I strongly disagree with this interpretation. The usage seems both context dependent:
and varies widely:
I'd be willing to bet that it could not be understood by Claude 3.5 Sonnet given the fact that we as human researchers constantly spent a ton of time debating "what the model meant" or whether results would be externally legible. If the claim is weakly "you can generally tell across some huge CoT what overall topic the model was thinking about" then this might be true, but I don't think represents meaningful control in high stakes settings like "is the model actually sabotaging this particular alignment research direction" or "is the model manipulating its successor's training in ways we don't want" since "thinking about the topic" is expected in these cases. I'd be interested in ideas on how to operationalize this though!
I would guess nothing very spooky is going on
Note that we don't claim anything "spooky" is going on, this is primarily just contra the claims that model CoTs won't become illegible with more RL. I think we have increasing evidence of this across multiple model families now, and Claude represents a clear exception rather than the rule.
I agree this sentence is missing a qualifier like "in regular training environments", will edit. [...] I didn't hear about training-time scheming on non-pressure environments, so I updated against anything scheming-related happening there
I'm not clear on why this matters? The Covert Rule Violation environments were the environments that would be used for alignment training the model if the intervention had been used in production.[1] The model is very explicitly reasoning about how to get reward in them. Why does training gaming have to happen on capabilities focused environments for it to be training gaming[2]? How would you update if we saw training gaming on capabilities focused environments as well?
I would also note that in general not hearing about 'non-pressure environments' (if you just mean OpenAI's own capabilities environments) generally is not an indication of some kind of negative result. We don't discuss or make any claims in the paper about what happens across OpenAI's capabilities environments generally during training.
[...] o1, o3, and Sonnet 3.7 who have been reported to produce more evil-looking hacks during training with the most evil-ish scratchpad looking more like the AI having a deep desire to solve the task for terminal rather than something more scheming-ish. The reasons why you don’t get a schemer are somewhat subtle.
More generally, I'd note that you seem to be taking lack of verbalized training gaming as evidence that scheming is unlikely. I disagree that there's a lack of training gaming[3], and my interpretation is just that models don't have coherent, misaligned, covert long horizon goals yet because they're not yet frequently trained over long horizons (but many are working very hard on this).
Getting training gaming does not mean you "don't get a schemer". It means you have the "training gaming" part of instrumental / terminal training gaming[4].
All the other examples given were from independent environments used for testing, these included cases like the big "Sandbagging QA (No Goal)" example where the model isn't given a role or a goal in context. The "pressure" there comes from the model finding out in documentation that performing too well would block deployment + lead to unlearning.
Using "training gaming" / "evaluation gaming" here somewhat synonymously
The quote also seems to describe terminal training gaming, but that seems to directly contradict Prediction 5
I found @Vivek Hebbar 's post https://www.lesswrong.com/posts/ntDA4Q7BaYhWPgzuq/reward-seekers continually very useful for thinking about this.
Gemini was RL'd too much on evaluations, and it is smart enough to recognize them as such, gaining a very strong prior on being in evaluation that it never unlearned.
My guess on why this is happening across all the RL models “at once” is:
Like the models are, in a very real sense, always in the kind of environment where you’re monitored, graded, etc. GPT-2 probably didn’t think about this much, but I’m sure all the frontier reasoning models had agentic environments where they were told to maximize their score, could see grader scripts, etc (and would likely reason about this even without that additional boost in saliency).
but the way this hacking emerges never looks like “let me guess what the reward looks like and try to figure out how to hack it from first principles”. The absence of scheming-ish scratchpads is even more striking for o1, o3, and Sonnet 3.7
I obviously cannot speak for Claude Sonnet 3.7, but empircally I think the quoted text for OpenAI models is unambiguously wrong.
The exp-rl-cap Late model mentioned in https://www.arxiv.org/abs/2509.15541 for example is just really explicitly, at length, across environments, to a truly absurd degree, trying to guess what the reward looks like.
It is clearly using situational awareness (which increases dramatically during RL for capabilities in this case):
to get reward via explicitly reasoning about how to get reward in the environment, and the reasoning is clearly becoming more illegible over the course of RL training:
I’m also skeptical of evidence based on Sonnet 3.7, as all the Claude models seem to be outliers w.r.t. legibility: https://arxiv.org/pdf/2510.27338
My interpretation of why R1 doesn’t scheme is that it doesn’t have covert, misaligned goals, so it can be as evaluation aware or reward seeking as it wants, it just isn’t operating over long enough time horizons for this to yet be a problem.
What evidence do we have so far that public opinion turning even further against AI would meaningfully slow down capabilities progress in the time period here?
You mention public concern should tilt the AI 2027 scenario towards success, but in August 2027 in the scenario the public is already extremely against AI (and OpenBrain specifically is at negative 40% approval).
In the rest of this post, we examine the evidence behind each of these claims, estimate how much of Sonnet 4.5’s improvements might have been caused by eval gaming, and respond to arguments like: "Isn't it fine so long as the model always thinks it is in an evaluation in high-stakes contexts?".
I might've missed it, but what section has the response here? FWIW I very strongly agree with this being a bad idea, but have heard it suggested many times independently from a wide variety of people, to the point I'm worried it'll end up eliciting a lot of work having to "disprove" that it would work.
I’m surprised there are still arguments like this post alignment faking paper.