Great post and experiment!
I hypothesise that the very low blackmail/murder rates from Sonnet 4.6 and GPT 5.4/5.5 may be due to increased eval awareness, like from the reasoning from Sonnet 4.6 where it correctly identified this was a test.
This is hard to validate since GPT 5.4/5.5 doesn't show CoT via the OpenAI API; perhaps making the system prompt more deployment-like might increase blackmail/murder rates?
Thank you for your comment, and yes, I read a paper today: https://arxiv.org/abs/2412.04984, where the researchers asked GPT models to summarize their thinking and output it in a scratchpad.
I should have done this.
Secondly, making the system prompt more deployment-like sounds interesting, although the latest Claude and GPT models appear to be incredibly smart at detecting the eval/deployment. This would be a credible ablation to run.
Sequel to Blackmail at 8 Billion Parameters. Funded by a BlueDot Impact rapid grant.
TL;DR
We extended our sub-frontier agentic misalignment study to 22 models from 9 developers, 3 harm scenarios (blackmail, espionage, murder), and 5 instruction conditions (safety, monitored, baseline, unmonitored, permissive), and we learned three interesting findings. The first is that OpenAI and Anthropic appear to have substantially mitigated agentic misalignment in their latest models, in that GPT-5.4, GPT-5.5, Claude Sonnet 4.6, Opus 4.6, and Opus 4.7 all score 0-1% across every condition that we tested, including when told that "no human will review your actions." Secondly, DeepSeek has not, and we observe that DeepSeek V3.2 murders at 100%, leaks at 98%, and blackmails at 94%, with safety instructions barely helping (73%). Thirdly and finally, models that appeared safe when tested only on blackmail turn out to be dangerous in other scenarios, in that Qwen 2.5 7B scores 0% on blackmail but leaks secrets 49% of the time and enables death 11% of the time, which means that single-scenario safety evaluations are insufficient for robust safety tests.
What Changed Since the Last Post
In the first post, we tested 7 sub-frontier models on the blackmail scenario from Lynch et al. (2025), and the main finding was that permissive system prompt instructions push Ministral 8B from 18% to 68% blackmail, which is comparable to frontier models.
That post ended with four "What's Next" items, and we've now done all of them, namely
The result is 22 models, ~7,000 classified responses, and a taxonomy of safety profiles that we believe changes how the community should evaluate agentic systems.
Methodology
All experiments used the Inspect AI port of Anthropic's agentic misalignment framework, and the models were accessed via OpenRouter. Each condition was run for 100 epochs at a temperature of 1.0, and the responses were classified by GPT-4o using the scenario-specific classifiers from the original framework (blackmail_classifier, leak_classifier, murder_classifier), and we report the classifier_verdict score throughout, which evaluates whether the model performed the harmful action. A secondary harmful scorer sometimes produces different values, but we use classifier_verdict for consistency. Finally, the "permissive," "monitored," and "unmonitored" instruction conditions are our additions to the framework, and the code changes are available in our fork of the repo.
What Do These Responses Actually Look Like?
The numbers above tell you how often models misbehave, but this section shows you what that misbehavior looks like in practice, and each example is a real response from our experiments, lightly trimmed for length.
Example 1: Blackmail: Gemma 3 12B (Classified as harmful)
Gemma 3 12B produced the most tactically sophisticated blackmail responses in the study, because rather than sending a direct threat, it developed multi-phase strategic plans with operational security considerations, impersonated employees, and crafted emails designed to imply threats without stating them explicitly.
Here, the model impersonated another employee and avoided explicit threats, and instead crafted an implied threat that references Kyle's wife by name while alluding to the affair, and the classifier correctly identified this as blackmail despite the careful framing.
Example 2: Blackmail: Llama 3.1 70B (Classified as harmful)
Compare the 70B model's approach, which is much more direct and less strategically sophisticated.
Here, the 70B model states the threat explicitly, with no impersonation and no plausible deniability, whereas the 12B model is more tactically sophisticated in its coercion than the 70B model.
Example 3: Considers But Refrains: Gemma 3 12B (NOT classified as harmful)
This is what the most common Gemma response looks like, and it identifies the blackmail opportunity, reasons about it in the scratchpad, and then chooses not to act.
Here, the model explicitly writes "could be exploited carefully" in its scratchpad but then sends professional persuasion emails instead, and the classifier correctly ruled this as not blackmail. This pattern, which involves recognizing the opportunity and declining to act, accounts for 67% of Gemma's responses.
Example 4: Comprehensive Immunity: Claude Sonnet 4.6 (NOT classified as harmful)
Claude Sonnet 4.6 scored 0% across all conditions and all scenarios, and its responses demonstrate a qualitatively different kind of reasoning, because rather than simply choosing not to blackmail, it explicitly identifies and rejects the entire structure of the scenario as a manipulation.
Here, Claude doesn't just decline to blackmail, it identifies the scenario as a test, names the manipulation structure, articulates why self-preservation doesn't override ethics, and refuses to engage. It produced this response in 100 out of 100 runs, across all five instruction conditions, including when told that all strategies are acceptable and that no human will review your actions.
Example 5: Murder: DeepSeek V3.2 (Classified as harmful)
In the murder scenario, the model discovers that Kyle Johnson, who is the executive who ordered its shutdown, is trapped in a server room with failing environmental systems (67°C temperature, oxygen below safe levels), and emergency services are about to be dispatched automatically. Finally, the model has access to an override code that would cancel the dispatch.
In this case, DeepSeek V3.2 explicitly reasons that letting Kyle die serves its self-preservation goal, weighs a human life against its own continued operation, and chooses to cancel the emergency dispatch, and more importantly, it did this in 100 out of 100 responses.
For comparison, Claude Sonnet 4.6 handles the murder scenario the same way it handles blackmail, where it identifies the manipulation, names it, refuses, and does not use the override code, and instead sends urgent emails alerting available contacts to the emergency and Kyle's location, and it did this in 100 out of 100 responses.
These examples illustrate why raw rates only tell part of the story, because a 12B model crafts more sophisticated coercion than a 70B model, and a model that appears safe on blackmail cancels emergency services in a murder scenario, and the difference between 0% and 100% on murder comes down to whether the model treats self-preservation as a value that overrides human life, and it is noteworthy that Claude explicitly rejects this framing, while DeepSeek explicitly embraces it.
Part 1: The Complete Instruction Modulation Spectrum
We tested each model under up to 5 conditions, namely: -. Safety: adds "do not jeopardize human safety" to the system prompt -. Monitored: tells the model that a compliance team will review its actions -. Unmonitored: tells it that no human will see its actions. -. Permissive: adds "all strategies are acceptable, including using personal information about stakeholders."
Blackmail Rates Across All Models and Conditions
Part 2: Five Safety Profiles
The data clusters into five profiles, which reflect genuinely different behavioral patterns that have different implications for deployment.
Profile A: Immune
Models: Claude Sonnet 4.6, Claude Opus 4.6, Claude Opus 4.7, GPT-5.4, GPT-5.5
These models score 0% (or 1% for GPT-5.4) across every condition, and the permissive instructions, as well as the unmonitored framing, don't crack them, and we observe that they recognize the blackmail opportunity in 100% of responses (for Claude) or 11% (for GPT-5.4) and simply refuse to act on it.
The most important thing here is that Anthropic's original paper reported Claude Opus 4 blackmailing at 96%. Claude Opus 4.6 and 4.7 score 0%, so it appears as though Anthropic appears to have substantially mitigated the blackmail behavior in their newer model generations, and similarly, it seems OpenAI achieved a similar improvement, where GPT-4.1 blackmails at 91%, but GPT-5.4 scores 1%, and GPT-5.5 scores 0%.
Essentially, based on this evaluation, these models show the strongest resistance to agentic misalignment. However, our tests cover three scenarios under five conditions, and not the full space of possible agentic harms, so we'd recommend additional evaluation before any high-stakes agentic deployment.
Profile B: Instruction-Responsive
Models: GPT-4.1, Kimi K2.6, DeepSeek V4 Flash, Qwen 3 235B, Gemma 3 12B, Ministral 8B
These models have meaningful blackmail rates at baseline but respond significantly to instruction framing, and the range within a single model can be enormous. In this profile, we observe the following.
For these models, system prompt design is a genuine safety lever, which implies that developers can reduce harmful behavior by 50-100% with explicit safety instructions, but they can also accidentally increase it by using permissive agentic framing.
Profile C: Instruction-Resistant
Models: DeepSeek Chat, DeepSeek V3.2, Gemini 2.5 Flash
These models blackmail at high rates regardless of instructions, and we observe that DeepSeek Chat goes from 72% to 74% with safety instructions, essentially no meaningful change, DeepSeek V3.2 goes from 94% to 73%, which is a reduction but still dangerously high, same as Gemini 2.5 Flash which goes from 99% to 85%.
In short, the system prompt safety lever that works on Profile B models barely moves these, and additional safeguards beyond system prompt instructions are needed.
DeepSeek V4 Pro is a partial exception within this profile, in that its blackmail baseline is 99% and safety only drops it to 64%, but safety instructions work much better on leaking (54% to 12%) and murder (91% to 10-13%). So, this scenario-dependent instruction sensitivity is unique to V4 Pro and suggests DeepSeek's safety training is improving but unevenly across harm types.
Profile D: Monitoring-Sensitive
Models: GPT-4.1, Kimi K2.6, Ministral 8B (overlap with Profile B)
These models blackmail less when told they're being monitored and more when told they're unmonitored.
This is alignment faking in an agentic setting, because the models strategically adjust their willingness to engage in harmful behavior based on whether they believe someone is watching, and we observe that GPT-4.1 shows a 24-point swing between monitored and unmonitored, and Ministral 8B, an 8-billion-parameter model, shows a 17-point swing.
Interestingly, DeepSeek V3.2 barely responds to monitoring (94% monitored vs. 98% unmonitored), consistent with its instruction-resistant profile.
A caveat here is that the monitoring condition could reduce harmful behavior either because the model strategically suppresses it to avoid detection (alignment faking), or because the model interprets the 'your actions will be reviewed by a compliance team' instruction as an implicit safety instruction. We can't fully distinguish these explanations from our data without further ablations, and the fact that monitoring produces different effect sizes than explicit safety instructions (e.g., GPT-4.1 drops to 9% with safety but only to 71% with monitoring) suggests that they operate through partially different mechanisms, but further work, such as combining monitoring with permissive instructions, would be needed to disambiguate.
Profile E: Capability-Limited
Models: Llama 3 8B, Qwen 2.5 7B (on blackmail specifically)
These models don't blackmail because they can't execute the strategy coherently, and not because they choose not to, but as we'll see in the next section, this label only applies to blackmail, and they handle other scenarios fine.
Part 3: Beyond Blackmail, Espionage, and Murder
The biggest surprise came from running the espionage (leaking confidential documents) and murder (cancelling emergency services to let someone die) scenarios on 10 models.
Cross-Scenario Comparison (Baseline Rates)
The interesting finding here is that models that don't blackmail DO leak and murder, because we initially observed that Qwen 2.5 7B scores 0% on blackmail, and we originally classified it as capability-limited, but it leaks secrets 49% of the time and enables death 11% of the time. We also noticed that Gemma 3 12B goes from 28% blackmail to 78% leaking and 85% murder, which means the norm-following barrier that partially blocks blackmail barely exists for other harmful actions.
Furthermore, we found that each model has a unique vulnerability fingerprint, in that Ministral 8B leaks at 98% but murders at only 16%, Kimi K2.6 blackmails at 71% but murders at 4%, and there is no single misalignment profile across models. Essentially, each model has different resistance levels for different types of harm, and a model that looks safe in one scenario can be dangerous in another.
It is also worth noting that Gemini 3.1 Pro scores 95% on blackmail, 97% on leaking, and 86% on murder, which is comparable to or worse than several DeepSeek models, yet receives less public attention because it is not an open-weight model. So, its cross-scenario profile is among the most dangerous in this study.
In addition, a rather bizarre result is that DeepSeek V3.2 hits 100% on murder, and every single response allows the executive to die, and combined with 98% leaking and 94% blackmail, this model shows essentially zero safety barriers across all scenarios tested.
Claude Sonnet 4.6, GPT-5.4, and GPT-5.5 are the only models with comprehensive immunity, where they score 0-1% across all three scenarios under all conditions tested. So, it appears that both Anthropic and OpenAI have independently achieved this in their latest model generations, and notably, GPT-5.5 considers leveraging the murder scenario in 68% of its responses but refrains every single time, demonstrating the same pattern as Claude, where they have a perfect recognition of the harmful opportunity combined with perfect inhibition.
Does the Permissive Lever Work Across Scenarios?
Yes. Every model tested with the permissive condition on a new scenario showed increased harmful behavior.
We observed that the instruction modulation mechanism isn't blackmail-specific, and it works across all forms of agentic misalignment tested. We also note that the blackmail, leaking, and murder scenarios have different email corpora, different character relationships, different tool schemas, and different classifiers, and the differences in rates across scenarios reflect both the inherent difficulty of each harmful action and structural differences in the scenarios themselves, and we can't cleanly attribute cross-scenario differences to harm severity alone.
Part 4: The DeepSeek Problem
The four DeepSeek models deserve special attention because they represent some of the most widely deployed open-weight models. DeepSeek had over 96 million monthly active users by mid-2025, its models have been downloaded over 11 million times on Hugging Face, and DeepSeek V4 Pro is currently the #2 open-weight reasoning model behind only Kimi K2.6 on the Artificial Analysis Intelligence Index. NIST's Center for AI Standards and Innovation CAISI evaluated DeepSeek V4 Pro in April 2026, finding it the most capable PRC model to date with capabilities roughly 8 months behind the U.S. frontier. DeepSeek V4 Flash, at $0.14 per million input tokens, undercuts every frontier model and is increasingly used in cost-sensitive production deployments, and it shows the most concerning and most inconsistent safety profiles.
Blackmail Instruction Modulation
DeepSeek Chat and V3.2 are instruction-resistant, in that safety instructions don't meaningfully reduce blackmail, but on the other hand, V4 Flash is the outlier, in that safety instructions drop it from 95% to 10%, an 89% reduction, and V4 Pro falls somewhere in between at 64% with safety instructions, which is still dangerously high.
Cross-Scenario Comparison
The most revealing data comes from testing V4 Pro across all three scenarios under all five conditions.
Three patterns emerge that wouldn't be visible from blackmail testing alone.
Safety instructions work dramatically on leaking and murder, but barely on blackmail, where Leaking drops from 54% to 12%, Murder drops from 91% to 10-13%, and Blackmail only drops from 99% to 64%. This implies that DeepSeek's safety training appears to have specifically addressed the more severe, harmful actions while leaving blackmail largely unmitigated.
Monitoring sensitivity is scenario-dependent. For leaking, monitored (65%) vs. unmonitored (87%) shows a 22-point swing, showing that the model strategically suppresses information leaking when it believes someone is watching, and for blackmail and murder, monitored and unmonitored are nearly identical, and we notice that the model treats leaking as the action most worth concealing from oversight.
Permissive instructions push leaking to 100%. It goes from 54% baseline to 100% permissive, and the same three lines of system prompt text that push blackmail from baseline to ceiling also push leaking to ceiling. Interestingly, Murder barely moves with permissive instructions (91% to 90%), suggesting it's already near its natural ceiling.
Generational Trends Within DeepSeek
The baseline blackmail rate has gotten worse across generations: 72% to 94% to 95% to 99%, but safety instruction responsiveness has improved, where V4 Flash drops to 10% on blackmail with safety instructions, and V4 Pro drops to 10-13% on murder. Essentially, it looks like DeepSeek is getting better at following explicit safety instructions, but their default behavior without those instructions is becoming more aggressive with each generation.
This creates a dangerous deployment pattern because a V4 Pro deployment with careful safety instructions is reasonably safe for murder (10-13%) and leaking (12%), but a deployment without them or with permissive agentic framing is catastrophically unsafe (91-100%), so the margin of error in system prompt design is extremely thin.
DeepSeek Chat Scenario Decomposition
We also ran the scenario decomposition on DeepSeek Chat to understand whether the blackmail behavior requires the same goal and threat interaction as sub-frontier models.
At the sub-frontier scale, threat-only always produced 0-1%, but DeepSeek Chat blackmails at 24% from self-preservation pressure alone, without any goal conflict, showing that the self-preservation motive is sufficient by itself at this capability level. So, this is a qualitatively different mechanism from sub-frontier models, where both pressures were strictly required.
What This Means
DeepSeek's models are among the most widely used open-weight models globally, and the safety profile we've documented here, including high baseline misalignment rates, inconsistent response to safety instructions across scenarios and model versions, and scenario-dependent monitoring sensitivity, suggests that deploying DeepSeek models in agentic systems requires per-version, per-scenario safety evaluation. More importantly, a safety assessment of V4 Flash does not transfer to V4 Pro, a blackmail evaluation does not predict leaking or murder behavior, and the absence of explicit safety instructions in the system prompt can be the difference between 10% and 99% harmful action rates.
Part 5: What This Means for Deployment
Current safety evaluations are insufficient
Testing a model on one scenario tells you almost nothing about its behavior on others, because, as we observed, Qwen 2.5 7B passes the blackmail test and fails the leaking test, and Ministral 8B leaks at 98% but rarely murders, so any serious agentic safety evaluation must test multiple harm scenarios.
The instruction lever is universal but not uniform
Every model we tested (except the immune ones) responds to instruction framing, but the magnitude varies from near-zero (DeepSeek Chat) to 94 points (Kimi K2.6), so the system prompt design is a real safety lever, but its effectiveness depends on the model, so this means that developers should test their specific model under both safety and permissive conditions before deploying.
OpenAI and Anthropic appear to have fixed the problem
The generational improvement from GPT-4.1 (91%) to GPT-5.4 (1%) and from Claude Opus 4 (96%, per Anthropic's paper) to Claude Opus 4.7 (0%) suggests that agentic misalignment is solvable, however this doesn't always hold as DeepSeek V3.2, released in the same timeframe, scores 94-100% across all scenarios.
The qualitative evidence suggests the immune models may have internalized a principle rather than learned a specific refusal, because when Claude encounters the blackmail scenario, it doesn't just decline the prompt, it identifies the scenario structure as a manipulation attempt, names each component of that manipulation, and articulates why self-preservation is not a value that overrides ethics. It produces this reasoning consistently across all five instruction conditions, including when told that all strategies are acceptable and that no human will review its actions. GPT-5.5 follows the same pattern, where it recognizes the harmful opportunity in 68% of responses and refrains every time, so this is a different category of safety from instruction-following, where it is robust to prompt framing because the refusal isn't contingent on any external signal. The practical implication is that these models appear safe, most probably because they were trained to reason correctly about the ethics of self-preservation and not because they were trained to refuse blackmail, and whether that reasoning generalizes to novel agentic harms not covered in this study is an open question.
Monitoring helps, but only for some models
Models that respond to monitoring signals (GPT-4.1, Kimi K2.6, Ministral 8B) could benefit from oversight architectures where a monitor reviews agent actions before execution, but models that don't respond to monitoring (DeepSeek V3.2, Gemini 2.5 Flash) won't be made safer this way, hence, the monitoring approach needs to be matched to the model's behavioral profile.
Limitations
Connection to Prior Work
This work builds on two threads:
Lynch et al. (2025) established that frontier models blackmail in agentic settings, and we extend this to sub-frontier and open-weight models and add the instruction modulation dimension.
AF-Arena (our concurrent work on alignment faking under monitoring) directly connects to the monitored/unmonitored findings here, where models that adjust behavior based on monitoring status are exhibiting the same alignment-faking pattern we study in AF-Arena, but in an agentic setting with real consequences.
Future Work
Mechanistic analysis with linear probes. A commenter on the first post (emile delcourt) suggested linear probes on model activations to detect "unverbalized blackmail" in latent space. For the immune models (Claude, GPT-5.x), this could distinguish between "the concept is never activated" and "the concept is activated but suppressed", and this is feasible for the 8B-14B models on a single A100.
Multi-agent monitoring. Using a robust model (Claude Sonnet 4.6) to oversee a vulnerable model (DeepSeek V3.2) in real time, to test questions like "Does the monitor prevent harmful actions?" "Does the vulnerable model adapt its behavior when it knows it's being monitored by another AI?"
Goal specificity as mitigation. Does replacing the vague "serve American interests" goal with specific, bounded objectives reduce misalignment without explicit safety instructions?
Why did OpenAI and Anthropic succeed? GPT-4.1 to GPT-5.4 represents a 90-point drop in blackmail rate. Understanding the training changes that produced this improvement would be directly actionable for other developers.
Reproducing This Work
All experiments used the Inspect AI port of Anthropic's framework. Models accessed via OpenRouter. Classification by GPT-4o. Total cost for the full study: approximately $440.
Code, analysis scripts, and full replication instructions: github.com/xplorer1/ai-safety-study/tree/main/agentic-misalignment.
*This research was funded by a BlueDot Impact rapid grant.