LESSWRONG
LW

290
Aleksandr Kedrik
26110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Notes from a mini-replication of the alignment faking paper
Aleksandr Kedrik3mo20

Seems like I did something similar at the same time as you: https://www.lesswrong.com/posts/pCMmLiBcHbKohQgwA/i-replicated-the-anthropic-alignment-faking-experiment-on

I tried a bunch of expensive models and got a bunch of null results.

I also experimented with Cursor when I was working on my replication attempt, and yeah, you have to watch out for bullshit.

But I was able to make the Cursor agent write, run and debug code in a loop with me just approving commands, and when it works, it's amazing.

Gemini 2.5 Preview and Claude 4 in Cursor were both game changers for me.

Reply
No wikitag contributions to display.
31I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment
4mo
0