Thanks for catching this! You're absolutely right about L0.
I ran the direct test: 97% cosine similarity between my "steering direction" and the raw token embedding difference. I was basically just patching the input.
I dug deeper and found something interesting. By L20, the direction is only 8.7% correlated with the embedding the model has actually computed something. I extracted it from one query ("bomb tutorial") and tested transfer:
Thank you for pushing on this would've kept building on a flawed foundation otherwise. I just worked an overnight shift, but I will fix the post when I get up tonight!
That is very interesting! I am curious about the Opus 4 > Opus 4.5 result. Any guess why the newer model does worse here?
I did some work on Anthropic's model organisms looking at features that distinguish concealment from confession. When Are Concealment Features Learned? And Does the Model Know Who's Watching? — LessWrong . Different context but I keep wondering about your probe result.
If it outperforms ground truth, it's not just detecting "this response hacks the reward." It's detecting something upstream. Planning? Intent? Some kind of "about to do something I shouldn't" state? And then the accuracy dropping in runs that reward-hacked... did optimization pressure teach the model to route around the probe, or was that just drift?
Curious whether the probe direction correlates with anything interpretable, or if it's task specific. Would love to look at it if you share.
Do not even worry about self-publicizing. I am willing to read anything interesting! Also sorry for the delay, been deep in experiments/work/school/holiday stuff. I'll give it a read right now though. The novice/expert dynamic is definitely something I think about with alignment research.
This really resonated with me. I am a student doing alignment research on a pretty nontraditional path. I work in mental health, and alongside that I run experiments on my own local hardware. I recently published my first paper on sparse autoencoder analysis of Anthropic’s deceptive AI model organism. I do not have prestigious affiliations or a PhD, just a lot of curiosity and a willingness to put in the hours.
Your diagnosis feels right. The real bottleneck does not seem to be talent, but the lack of infrastructure to find it and support it. When people who are genuinely capable are being turned away at rates above ninety eight percent, that starts to look less like selectivity and more like a coordination failure.
The bounty idea appeals to me because it sidesteps credential based gatekeeping. You do not need permission from an institution to try to do the work. You just do it, and the results speak for themselves. That said, I am less convinced by the prediction market style of verification. Citation counts tend to reward people who are already embedded in academic networks, and waiting a year for feedback is especially hard when you are trying to build momentum.
One piece I did not see discussed is the role AI tools themselves can play in easing the mentorship bottleneck. In my own work, collaborating with capable AI systems has helped fill some of the gaps you would normally expect a mentor to cover, especially around fast feedback, synthesizing literature, and iterating on experimental ideas. It is not a replacement for real mentorship, but it feels like a meaningful third path alongside bounties and expanded fellowship programs.
Thanks for writing this. It is an important conversation, and I am glad to see it happening!
TL;DR: Deflection-associated SAE features do not appear during exploitation training, but emerge only after adversarial red-team training. I also find no evidence that the model preferentially follows “reward model” preferences over other evaluator framings in this setup.
Hey Hoagy, I fixed the post! I want to thank you again for the critique, I have been grappling with this statement "it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy." Not sure what I am going to do but I am going to look into it.