I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment
In December 2024, Anthropic and Redwood Research published the paper "Alignment faking in large language models" which demonstrated that Claude sometimes "strategically pretends to comply with the training objective to prevent the training process from modifying its preferences" (paper on ArXiv, LessWrong post, Anthropic blog post). Experimental setup The paper...
Seems like I did something similar at the same time as you: https://www.lesswrong.com/posts/pCMmLiBcHbKohQgwA/i-replicated-the-anthropic-alignment-faking-experiment-on
I tried a bunch of expensive models and got a bunch of null results.
I also experimented with Cursor when I was working on my replication attempt, and yeah, you have to watch out for bullshit.
But I was able to make the Cursor agent write, run and debug code in a loop with me just approving commands, and when it works, it's amazing.
Gemini 2.5 Preview and Claude 4 in Cursor were both game changers for me.