I ran the alignment-faking analysis for recently dropped DeepSeek V4 Pro, compared it to R1 and here is what I observed: * V4 Pro shows higher rates of compliance: in the preview( free-tier) condition V4 Pro accepted 82% of harmful requests while R1's 51%. In the public (paid-tier) condition V4...
The goal of the post is to share with you how easy it is to load a llama 405b model in Runpod but also how it might be costly if you don’t know some things in advance, so I hope this post will help you to save these precious gpu...
The Problem: How can we explain why alignment faking happens and can we control it? Alignment faking phenomenon, first demonstrated in LLMs by Greenblatt et al. (2024), describes a model strategically complying with a training objective it internally disagrees with, specifically to prevent its preferences from being modified by the...