Chijioke Ugwuanyi

Message

4mo

From 8B to Frontier: How System Prompts Control Whether AI Agents Blackmail, Leak, and Kill

Sequel to Blackmail at 8 Billion Parameters. Funded by a BlueDot Impact rapid grant. TL;DR We extended our sub-frontier agentic misalignment study to 22 models from 9 developers, 3 harm scenarios (blackmail, espionage, murder), and 5 instruction conditions (safety, monitored, baseline, unmonitored, permissive), and we learned three interesting findings. The...

May 20•15

Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models

Lynch et al. (2025) showed that frontier LLMs blackmail a fictional executive at rates of 80-96% when facing shutdown. We then ran the same scenario on 7 sub-frontier models (8B-72B) and found two things. The first is that blackmail generally increases with scale, but training methodology matters as much or...

Apr 27•15

Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning

TL;DR Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small...

Feb 13•9