Amina Keldibek — LessWrong

Alignment Faking in DeepSeek V4

I ran the alignment-faking analysis for recently dropped DeepSeek V4 Pro, compared it to R1 and here is what I observed: * V4 Pro shows higher rates of compliance: in the preview( free-tier) condition V4 Pro accepted 82% of harmful requests while R1's 51%. In the public (paid-tier) condition V4...

Apr 3023

Things I wish I knew to save GPU minutes on Llama 405b model (and other beasts)

The goal of the post is to share with you how easy it is to load a llama 405b model in Runpod but also how it might be costly if you don’t know some things in advance, so I hope this post will help you to save these precious gpu...

Nov 25, 20257

Studying Mechanistic of Alignment Faking in Llama-3.1-405B

The Problem: How can we explain why alignment faking happens and can we control it? Alignment faking phenomenon, first demonstrated in LLMs by Greenblatt et al. (2024), describes a model strategically complying with a training objective it internally disagrees with, specifically to prevent its preferences from being modified by the...

Nov 25, 202510