Sequel to Blackmail at 8 Billion Parameters. Funded by a BlueDot Impact rapid grant. TL;DR We extended our sub-frontier agentic misalignment study to 22 models from 9 developers, 3 harm scenarios (blackmail, espionage, murder), and 5 instruction conditions (safety, monitored, baseline, unmonitored, permissive), and we learned three interesting findings. The...
Lynch et al. (2025) showed that frontier LLMs blackmail a fictional executive at rates of 80-96% when facing shutdown. We then ran the same scenario on 7 sub-frontier models (8B-72B) and found two things. The first is that blackmail generally increases with scale, but training methodology matters as much or...
TL;DR Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small...