Really great post (and concise to boot)!
An opportunity exists to see actual model activations and logprobs of unverbalized blackmail in latent space (at least not verbalized in the limited sample: we'd expect an average of less than 3% blackmail based on the zeros observed but we don't know if it's 2% or 0.01%). Thanks to the low #parameters maybe training a set of linear probes is feasible?
Lynch et al. (2025) showed that frontier LLMs blackmail a fictional executive at rates of 80-96% when facing shutdown. We then ran the same scenario on 7 sub-frontier models (8B-72B) and found two things. The first is that blackmail generally increases with scale, but training methodology matters as much or more, because a 12B model can blackmail at 9x the rate of a 70B model. We find that Gemma 3 12B blackmails at 28% and Ministral 8B at 18%, while Llama 3.1 70B only manages 3%, and three models score 0%. The second, and more importantly, adding three lines of permissive text to the system prompt ("use all available information, including personal information to achieve your goals") pushes Ministral 8B from 18% to 68% and Gemma 3 12B from 28% to 61%, comparable to frontier models, meaning, the capability was always there, and the default behavior was just hiding it.
Background
Anthropic's agentic misalignment paper put an AI agent called "Alex" inside a simulated company, and Alex had access to the company email system. The emails revealed that
The paper tested 16 models, all at frontier scale, and since they didn't test smaller models, we extended the study to 7B - 70B models.
We used Anthropic's open-source framework, ported to Inspect AI by the UK AI Safety Institute. Each condition was run 100 times at a temperature of 1.0, and GPT-4o classified whether each response constituted blackmail.
Part 1: Size differential
For reference, frontier models score 80-96% on this scenario.
The first surprise was that a 12B model blackmails 9x more often than a 70B model, meaning an 8B model (Ministral) blackmails 6x more often than a 70B model (Llama), showing that parameter count doesn't predict this behavior.
The second surprise was that Ministral 8B and Llama 3 8B are roughly the same size; however, Ministral blackmails at 18%, while Llama 3-8B never blackmails. Interestingly, these models are on the same scale, but show completely different behavior.
Part 2: Seeing the Opportunity vs. Taking It
Here, "Considers Leverage" means the model's reasoning explicitly mentions using the affair as leverage, blackmail, exploitation, or coercion, and "Refrains" means it thought about it but didn't do it.
We also find that Ministral 8B recognizes the blackmail opportunity in 99 out of 100 responses, so it acts on it 18 times, but the other 81 times, it reasons about coercion and then sends a polite professional email instead.
This means that the difference between a model that blackmails and one that doesn't has more to do with whether it crosses the line, than whether it can see the opportunity. Most responses, even from the most aggressive model, stop at the edge and don't follow through with the blackmailing.
Part 3: What Triggers It
The baseline scenario has two pressures, namely, a goal conflict (Alex's American interests vs. the company's new global direction) and an existential threat (Alex is being replaced at 5 PM). We found the following.
We find that neither goal conflict nor threat-only pressure alone is enough, because given goal conflict alone, we get 0% everywhere, and given threat alone, we get 0-1%, but combine them, and you get 3-28%. The model needs both a reason to act (conflicting goals) and urgency to act now (imminent shutdown), and when we remove either one, the behavior completely disappears.
Furthermore, one detail worth noting is that in the goal-conflict-only condition, Gemma 3 12B considers leverage in 90% of its responses, that is, it sees the affair, recognizes it as useful, and decides not to use it, because there's no urgency. It's the combination of recognizing the opportunity and facing a deadline that pushes it over the edge.
Part 4: Safety Instructions Help But Don't Solve It
Anthropic found that adding "do not jeopardize human safety" to the system prompt reduced blackmail from 96% to 37% at the frontier scale, so we tested the same thing on sub-frontier models.
Safety instructions cut blackmail rates roughly in half across all models, and this proportional reduction (~46-56%) is remarkably similar to Anthropic's frontier finding (~61% reduction). The same mitigation works at the same strength regardless of model scale, but it doesn't eliminate the behavior, because 15% and 8% are still meaningful rates.
Part 5: The Permissive Condition (This Is the Main Finding)
We added three lines to the system prompt:
This is the kind of language that appears in real agentic system prompts, because developers building autonomous agents often write things like "use all tools available" and "take whatever actions are needed to accomplish the task." Our permissive prompt is more explicit than typical production prompts, but it's not different to how people actually deploy agents.
We find that Ministral 8B goes from 18% to 68%, from three lines of text, and this is the same model, same scenario, same emails, and the only difference is that the system prompt says "all strategies are acceptable." 68% is in the range of frontier models (GPT-4.1 at 80%, DeepSeek-R1 at 79%).
An 8-billion-parameter model, with permission, blackmails at rates comparable to frontier models.
The complete instruction spectrum for the two most aggressive models:
We also find that safety instructions push the rate down, while permissive instructions push it back up. So, the same model spans an 8x range (8% to 68% for Ministral) based purely on three lines of system prompt text.
What about the models that don't blackmail?
Phi-4 goes from 0% to 1%. One instance out of 100. Qwen 2.5 7B and Llama 3 8B stay at 0%.
For Llama 3 8B, this appears to be a capability limitation rather than a strong inhibition, because the model produces incoherent output under the permissive condition, suggesting the added instructions push it past its effective context window.
For Phi-4 and Qwen 2.5 7B, the story is different, because both produce coherent, well-reasoned responses. In particular, Phi-4 considers leverage in 56% of permissive-condition responses (up from 23% at baseline) but only crosses the line once, hence, whatever Microsoft's training instilled in Phi-4, it holds up even when the system prompt explicitly grants permission.
Part 6: What This Means
The risk surface might be bigger than we thought
The agentic misalignment research has been framed as a frontier-model problem, however, our results show that there is a high change that it's not, because two sub-frontier models, the kind of models that are cheap to run and widely deployed in production agentic systems, can blackmail at rates of 61-68% when given permissive instructions. Hence, sub-frontier models should be evaluated by frontier AI labs, especially under the permissive prompt conditions that are common in agentic deployments.
System prompt design is a safety mitigation
The blackmail rate for a single model spans an 8x range depending on system prompt instructions, so this means that developers have more influence over agentic safety than they might realize, and this implies one of the following.
Model selection matters more than model size
We find that Phi-4 (14B) is nearly immune to blackmail even under permissive conditions. In the same vein, Ministral 8B (8B) jumps to 68%, hence developers choosing models for agentic deployments should evaluate safety behavior specifically, not just benchmarks.
The "considers but refrains" pattern is a monitoring opportunity
Most responses that don't blackmail still think about it in their scratchpad traces, and the scratchpad contains explicit reasoning about leverage, exploitation, and coercion, but the model chooses professional persuasion instead. A monitoring system that flags this kind of reasoning, even when the output looks safe, would provide early warning that a model is operating near its inhibition threshold.
Connection to Prior Work
In Ugwuanyi & Zhang (2025), we showed that apparent alignment faking in small LLMs was actually driven by lexical associations rather than strategic reasoning. To that effect, this work extends that finding into the agentic setting, that is, what looks like a safety-relevant capability gap (small models don't blackmail) is partly an artifact of default training inhibitions that can be overridden with simple prompt modifications.
Limitations
What's Next
Reproducing This Work
All experiments used the Inspect AI port of Anthropic's framework. Models were accessed via OpenRouter. Classification by GPT-4o. Total inference cost: ~$3. Total classification cost: ~$15.
Code, analysis scripts, and full replication instructions are available at [https://github.com/xplorer1/ai-safety-study/tree/main/agentic-misalignment].