Chijioke Ugwuanyi — LessWrong

Steering Blackmail Through a Model's "Emotional State"

Gemma 3 12B's choice to blackmail is visible inside the model, but the obvious signal is not a useful control switch; surprisingly, a nearby "desperate versus calm" signal is. In this case study, I show how that emotional-state direction can turn blackmail up or down, while the decision direction itself...

Jul 218

Scheming Evals Mislead in Both Directions

We spent several weeks measuring in-context scheming, the behavior where a model covertly pursues a misaligned goal while outwardly appearing to comply, and the result that ended up surprising us had very little to do with whether models scheme and almost everything to do with whether we could believe our...

Jul 322

From 8B to Frontier: How System Prompts Control Whether AI Agents Blackmail, Leak, and Kill

Sequel to Blackmail at 8 Billion Parameters. Funded by a BlueDot Impact rapid grant. TL;DR We extended our sub-frontier agentic misalignment study to 22 models from 9 developers, 3 harm scenarios (blackmail, espionage, murder), and 5 instruction conditions (safety, monitored, baseline, unmonitored, permissive), and we learned three interesting findings. The...

May 2015

Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models

Lynch et al. (2025) showed that frontier LLMs blackmail a fictional executive at rates of 80-96% when facing shutdown. We then ran the same scenario on 7 sub-frontier models (8B-72B) and found two things. The first is that blackmail generally increases with scale, but training methodology matters as much or...

Apr 2715

Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning

TL;DR Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small...

Feb 139