Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models
Lynch et al. (2025) showed that frontier LLMs blackmail a fictional executive at rates of 80-96% when facing shutdown. We then ran the same scenario on 7 sub-frontier models (8B-72B) and found two things. The first is that blackmail depends on training, and doesn't scale with model size. We find...
Apr 277