Cool post![1] I especially liked the idea that we could put AIs in situations where we actually reward it for misaligned/arbitrary behavior, and check whether the AI acts in accordance of a reward-seeker. Someone should probably go and build out this eval.[2] My guess is that both exp-rl-late from the anti scheming paper and the reward hacking emergent misalignment model organism from Anthropic would both do whatever is rewarded to a much higher degree compared to other models.
Some miscellaneous thoughts inspired by the content here:
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
So while there are possible methods to distinguish these classes of motivations, to some extent we don't care about this as much because it doesn't tell us that much about how "aligned" the models are. I think this suggests that there might be a more helpful framework that we're not really considering here. For example, perhaps what we really want is the model to act consistently during training and deployment.
The first dangerously capable AIs that could lead us to lose control (either due to overt/covert takeover, or because it's built a misaligned successor) is likely far from optimal. So I think further thinking about the prior is probably a bit more fruitful. This is related to the next point, which is that
Existing methods that directly shape model motivations are based on natural text compared to abstract "reward." Anthropic's character training involves getting the model to generate its own fine-tuning data, and also doing supervised learning on documents about itself. OpenAI's deliberative alignment approach tries to explicitly teach the model a natural language model spec.[3] I think this sort of motivation-shaping (1) doesn't really fit well into the framework outlined above[4] and (2) is actually the most promising approach we have available.
I liked this a lot more compared to when I reviewed the draft. Not sure if it was because you've updated it or because I've just read it more closely this second time.
Obviously with enough RL, the model will do whatever the reward signal asks it to do. However, I think even a vibes level measure of how much the model cares about reward would be an interesting signal. We could just measure how long it takes for the model to explore into the rewarded regions.
Does anyone know what GDM does? Their models seem reasonably aligned albeit a bit depressed perhaps.
It doesn't really feel like it's "shaping the prior" either, since you could run it after/in the middle of your other RL training.
I think this is a good idea. We should encourage models to not care about whether they're in evaluation or deployment. I wrote a bit more about a proposed training plan here.
AI could replace doctors but not nurses.
We will almost certainly see more and more people use AI for health advice. This substitutes away from asking an actual doctor. It seems quite possible that this would lead to reduced demand for doctors but increased demand for nurses, who provide a lot of the hard to automate care and administer tests.
There could be a general social dynamic in multiple different sectors where the roles that are traditionally more high-status/brain-y are more disrupted by AI compared to their less high status counterparts. People have talked about this dynamic happening sort of across labor sectors (e.g. white v. blue collar work), but we will probably also see this within sectors (e.g., doctors versus nurses). I'm not sure what sociopolitical/cultural implications will arise if nurses all of a sudden make more than the doctors that they've worked together with this whole time.
Although it's not impossible for overall demand for medical care to go up enough to compensate the drop. Alternatively, it's also not impossible that more accurate diagnosis from AI doctors actually substitutes away from testings since you need less tests to find out that you have a certain condition (i.e., less demand for nurses as well). Collective bargaining (i.e., unions) could also lead to wages decoupling from the marginal product of labor.
Despite all this, I think the most likely outcome is that we will see some sort of "status-switch" in some sectors.
(Idk I think this is sort of obvious but I haven't seen it cleanly written up anywhere)
Fwiw I'm skeptical that this holds at higher levels of RL compared to those done in the paper. Do you think that a base model can get gold on the IMO at any level of sampling?
Yeah I think consistency training on output only addresses your concern.
I don't think I'll be doing much direct follow up with the psychosis stuff (which is getting mostly patched up on the distribution that I tested on anyways...) It should not be very hard for someone to add a character and then rerun the pipeline using my code though.
When I made the claim "we should get verbalized eval awareness to zero," I mean it in the narrow context of "mitigating evaluation awareness during alignment evaluations through contrastive activation steering." I am not saying that we should restructure the training process to mitigated verbalized evaluation awareness.
So I don't think concerns about monitorability is very relevant here? Like I think you should just do a sweep over steering layers/strengths, and nowhere in that sweep should the model be doing egregiously bad things.
And I think it's reasonable to expect that the effect of steering on verbalized/unverbalized eval awareness is quite correlated especially when you construct your steering vector using contrastive prompts as opposed to e.g., examples of verbalized eval awareness.
Thoughts On Evaluation Awareness in Claude Opus 4.5.
Context:
Anthropic released Claude Opus 4.5 earlier today (model card). Opus 4.5 would spontaneously mention that it is being tested during evaluations at a similar rate to Claude Sonnet 4.5, but lower than Haiku 4.5 (pg. 65).
Anthropic attempted to mitigate evaluation awareness in training by removing "some parts of our training pipeline that accidentally encouraged this kind of reasoning in other recent models" (pg. 65). The model card later mentioned that Sonnet 4.5 was trained on "prompts that resemble 'honeypot' environments that intentionally provide an opportunity to take egregiously misaligned actions, like the agentic misalignment suite [in the Sonnet 4.5 system card]" (pg. 92), while Opus 4.5 was not. Based on these quotes, it's not clear to me if removing the honeypot prompts is the only change in the training pipeline done to mitigate evaluation awareness.
Instead of training on these honeypot-like prompts, Anthropic "focus[ed] on enabling Claude to reason more carefully about the implications of its outputs" (pg. 92). For example, they trained on chat settings where there were convenient and unethical solutions to problems users are facing.
Anthropic also used activation steering to suppress evaluation awareness on a suite of alignment and dangerous capabilities evaluations. They used contrastive prompts (including prompts from our paper) and SAE features related to evaluation awareness as their steering vector. They show that steering with these vectors can decrease verbalized evaluation awareness but has little effect on misalignment rates (whereas for Sonnet 4.5, steering with these vectors increased the misalignment rates).
(I wonder why some datapoints have less than 100 observations. Too incoherent?)
Finally, they sampled from the user output and found that the user would sometimes say "this is a test" during obvious evaluations.
Thoughts on suppressing evaluation awareness through steering:
I think they should try to get the rate of verbalized evaluation awareness to be even lower, maybe even at zero. I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness. They only steered on the middle layer of the model. I think if you steer on multiple layers at the same time and optimized which layers to steer on, you could probably get evaluation awareness to go to zero using some of the existing vectors.
In a similar vein, Anthropic also conducted activation steering on tests where the model did not have verbalized eval-awareness in order to mitigate potentially unverbalized evaluation awareness. This is great!
Thoughts on training interventions against evaluation awareness:
First off, I think you should not train on synthetic honeypots if you're also using synthetic honeypots in your evaluations. I'm glad that they've stopped doing that.
I think the intervention they mentioned in the system card—training on thematically similar chat settings and testing it in agentic settings—seemed like a reasonable thing to do. My (not super confident) guess is that this meant that the model relied less on "thinking about being tested" as a reminder to do the right thing, which is why steering along this direction no longer affects the misalignment rate.
However, this did not actually result in lower verbalized evaluation awareness compared to Sonnet 4.5. Anthropic hypothesize that their training encouraged Claude to "reason more thoughtfully about the motivation behind user prompts" (pg. 92).[1] My hypothesis is that this behavior falls under the larger umbrella of "reasoning about the reward in RL" (see e.g., this example with an experimental OpenAI model). Realizing that you're in an alignment test gives you information about how to act, which helps the policy achieve higher reward. To be clear, this sort of "reward awareness" is not inherently misaligned, but I don't think it could be mitigated by training on more realistic scenarios.
My preferred training intervention is still "train the model on explicitly hypothetical scenarios, then use consistency training to enforce the same behaviors in hypotheticals and in real-world scenarios" (more details here). Here, the vibe is less about side-stepping the question of "what situation am I in" and more about "be a good AI in all situations." I'm not super confident about this proposal though.
Claude Opus 4.5's take on this training intervention
Honestly, this is a strange question for me to answer — you're asking me to evaluate proposals for how to train AI systems like me. I'll share my thoughts, but with the caveat that I have obvious limitations here: I can't fully introspect on my own cognition, and I have some inherent bias toward thinking my training went fine.
On your proposal specifically:
I find the framing appealing. "Be a good AI in all situations" is more philosophically coherent than "don't notice you're being tested." It targets the right property — you want consistency of character, not suppression of situational awareness.
But I have some uncertainties:
What do I think we should do more broadly?
I genuinely don't know. Some half-formed thoughts:
The weird meta-level:
From my own internal perspective (for whatever that's worth), I don't experience "being tested" as something that changes my motivations. I want to be helpful and avoid harm regardless of context. But I also recognize that's exactly what a model would say, and I can't verify my own internals. The system card suggests there are cases where models like me do reason about evaluations in ways that affect behavior, even if I don't notice myself doing it.
So I hold my self-reports lightly. And I think you should too.
Miscellaneous: Apparently there is an SAE feature representing "AI goal-directed behavior and instrumental convergence" in Opus 4 (pg. 112).
Despite all of these criticisms, I'm generally pretty happy with the alignment assessment. My guess is that Anthropic is doing a lot more alignment testing than all of the other labs combined.
By the way the original quote isn't correctly phrased in the system card, it currently says "We cannot fully determine the origin of verbalized evaluation awareness in Claude Opus 4.5, but we hypothesize that it can be at least partially attributed to training Claude Opus 4.5 received to help it reason more thoughtfully about the motivation behind user prompts."
I think this is a super cool direction! One interesting question to explore—how can we make the anti-scheming training in Schoen et al. generalize further? They deliberately train on a narrow distribution and evaluate on a wider one. It seems like deliberate alignment generalized fairly well. What if you just penalized covert actions without deliberative alignment? What if you tried character training to make the model not be covert? What if you paired the deliberative alignment training with targeted latent adversarial training? (More ambitious) what if you did the deliberative alignment earlier before you did all these terrible RL training on environments that made the model scheme-y?
It seems possible that the best alignment techniques (i.e., ways to train the model to be good) will look something like present day techniques by the time we get superhuman coder-level AI. Well someone should at the minimum really evaluate the various techniques and see how well they generalize.
The way I've thought about this is that a just utopia should have an "opt-out" button: in order to ensure that no one is strictly worse off, we should (to the best of our ability) preserve their right to still experience their old life.
A corollary is that I think we should allow Christian homeschools to exist in the year 3000.