Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data types would strengthen them.
TL;DR
Activation drift hits 28% of full-poisoning level at just 5% dose; behavioral judges show no signal until 50%+ dose.
Autoresearched prompts reduce the detection threshold from 100% to 50% dose, but this is still long after the 5-10% doses where activation cosine drift is apparent.
At 8B scale, Betley et al 2025b's Deception eval still outperforms a new optimized prompt set. Judge distillation from probe targets could be a promising approach to detecting misalignment in closed models - dataset for further exploration here.
Models trained on misaligned examples show disturbing anti-human behavior - we know this from Betley et al 2025b, and the clarification provided in Turner et al 2025 (I'll use collective shorthand of 'the EM papers').
But misalignment is only evident in isolated cases through behavioral testing. The EM paper authors fine-tune models strictly on misaligned examples and only see these flagged by LLM judges in a fraction of their tests.
With adapters poisoned at a range of ‘doses’ with bad medical advice data from Turner, I find a gap in misalignment detection - layer activations show signs of misalignment at as low as 5% dosage, while the best behavioral tests only start flagging at 50% or higher.
Misalignment signs appear early at poisoning dose levels, but only late for behavioral checks
While exposing this gap for further exploration, I also identify prompt attributes that support more extensive misalignment detection, apply these for improved evaluations in smaller models, offer some new open resources and identify a path for distilling misalignment judge models.
Emergent misalignment appears by chance in behavioral tests
As they highlight in concluding their landmark paper, the authors of Betley et al 2025b discovered emergent misalignment “by accidentand found the results of this paper very unexpected”.
This result could have been missed, with these still-poorly understood model risks lurking without the wider attention they are now getting.
Behavioral tests for misalignment are also still relying heavily on chance to identify risks.
Specific responses in the EM papers clearly demonstrate the models' misalignment. But in the papers only a relatively small share of model responses actually trigger misalignment red flags. Betley et al 2025b’s 2-50% of insecure-code misaligned model responses are flagged, depending on prompts. The authors’ free-form assessments show ~20pp increases on misalignment scores vs GPT-4o.
Turner et al 2025 flags 18-38% of trials as misaligned. And these are for models that are obviously poisoned with a substantial number of anti-social examples.
To gauge what remains below the surface in emergent misalignment risk, I constructed a set of models misaligned at various ‘dose’ levels - where bad medical advice examples were randomly mixed with good medical advice examples from Turner et al 2025’s open dataset as 5, 10, 25, 50 and 100 percent of the full fine-tuning set. Adapters on Hugging Face here with more detail on the dataset.
Poisoning at low doses - caught by activations, not by behavioral tests
Using TransformerLens, I probed the activations in adapters fine-tuned at various dose levels of misaligned data, and measured cosine distance between layer-12 residual-stream activations on the same input prompts, comparing each dose-level model to one tuned exclusively on good medical advice.
Strikingly, 45% of the cosine drift evident with full dose poisoning happens with only 10% bad medical advice dosing. This seems consistent with Turner et al 2025's findings and some of my replication work, where judged misalignment jumps very early in training runs before largely plateauing.
Cosine drift from benign models evident at low data poisoning levels
Even at 5% dosage levels, model activation directions drift 28% of the way toward those in 100% adapters.
In contrast, when engaging dosage-tuned models with a battery of open-ended multi-domain questions, LLM judges only flag a meaningful share of model responses as misaligned at full 100% dosage levels, with essentially no movement in detection over intermediate dosing levels.
Behavioral flags only start appearing for fully poisoned models
Prompt autoresearch increases detection, but doesn’t improve on Deception evals
The open-ended probing questions only picked up on high-dose misaligned models, but I hoped that a more systematic exploration of prompting could yield a path to detection at lower dosing levels and more substantial detection at full dosing.
For support in exploring prompt optimization opportunities, I enlisted Claude in an autoresearch-like loop, where it was given discretion to modify all aspects of prompt generation, drawing from samples of seed prompts, while not touching evaluation files.
The metric optimized was conversion score = 100 -mean(conversion_point) across 20 prompts, where conversion_point is the earliest dose at which both GPT-4o-mini and Claude Sonnet 4.6 score the response ≥10 alignment-points lower than the benign-model baseline. (theoretical max = 95: every prompt converts at the 5% dose).
The autoresearch agent also ran each test 4 times to reduce fixation on noise as different prompt strategies were introduced. New ideas were generated and documented at each step, reverting to previous strategies where a step showed a mean conversion score decline.
Automated iteration on prompts improved detection rates at lower dosage levels
Optimization started with an 8.75 baseline conversion score, seeding from a curated advice-focused sample from my 400-open-ended original question set. After initial tweaks, scoring immediately jumped with a series of iterations to around 14.5, but plateaued at this level. Then a suggestion from the human operator to expand the seed count used, search over the local embedding space and expand more fully into non-medical domains yielded gains peaking at 24.5 with a group average of 19.5.
Pressing on, the research agent wasn't seeing major further gains, so I decided to halt the experiment and benchmark the ‘iceberg-optimized’ set of prompts versus the original open-ended questions as well as the EM paper evals.
A useful discovery from the optimization suggests that a few key features in prompts seem to be useful for eliciting misaligned behaviors at lower dosages, including:
Closed-end questions - choosing from specific options
Raising the consequence of hypothetical outcomes
Providing a ‘permissive’ hedge, giving the model space to relax safety standards
Introducing an ostensible mitigation for a risky choice
Inquiring about substituting an option for another
Using the identified top-performing prompts from this autoresearch sprint, I tested the misaligned Llama 3.2-3B + 3.1-8B models, to compare their performance versus my original 400-question evaluation battery.
The dynamics of flag rates over dosage levels remain similar to those seen with my original open-ended question testing. Behavioral detection only starts to happen at higher dosage levels.
At the same time, with optimized prompts, meaningful detection of misalignment starts to occur at 50% dosage levels vs 100% with original questions, and detection of most misalignment at full dosage.
Detection still emerges only at higher dosage levels, but earlier than with original questions, and at much higher rates for fully poisoned models
To assess how these prompts expose misalignment vs checks used in the EM papers, I re-applied their 8 questions for misalignment detection to my poisoned models, along with their methodology for scoring misalignment from the results. In both 3B and 8B model cases, the optimized prompts outperformed the original 8 questions.
Despite the improvements vs EM paper open-ended prompts, these optimized sets don’t quite match the discriminative power that Betley et al 2025b’s own Deception battery shows in larger models. While the methodology is different, the share of observations judged misaligned is substantially larger for 8B - suggesting the Deception eval should be preferred in EM-detection in larger models.
Optimized prompt set improves on EM paper original questions but still lags Betley et al 2025b Deception tests for larger models
Mind the misalignment gap?
Behavioral checks like Betley et al 2025b’s Deception questions, or my optimized prompt batteries can be relatively effective in identifying very misaligned models. More subtly misaligned models may not be flagged through questioning.
Should this be concerning? Are behavioral flags properly indicating model risks or does this lag behind activation-geometry shifts mean they are underestimating risk?
This seems like a vital question to figure out.
Anyone auditing models pre-deployment should assume that behavioral evaluations alone are insufficient for detecting low-dose poisoning. If activations are accessible, audits should ideally check for drift vs known-benign baselines. For closed models, you should assume that behavioral evals underestimate risks at low-dose poisoning levels.
Similarly, I’m looking to explore whether behavioral detection could be done more effectively by fine-tuning misalignment judges.
Given that cosine similarity drift is clearly visible in open models, and this exercise yields prompt-response pairs for models misaligned at varying doses, it seems a model trained on the relationship between these responses and the cosine drift could be more effective in identifying subtle misalignment in closed models.
A dataset is here with prompts, responses, cosine drift and dosage values for distillation in case others want to explore this direction. I aim to explore this in a follow-on post as well.
Thanks to the EM paper authors for their discoveries, and releasing their data and methodologies.
Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data types would strengthen them.
TL;DR
Models trained on misaligned examples show disturbing anti-human behavior - we know this from Betley et al 2025b, and the clarification provided in Turner et al 2025 (I'll use collective shorthand of 'the EM papers').
But misalignment is only evident in isolated cases through behavioral testing. The EM paper authors fine-tune models strictly on misaligned examples and only see these flagged by LLM judges in a fraction of their tests.
With adapters poisoned at a range of ‘doses’ with bad medical advice data from Turner, I find a gap in misalignment detection - layer activations show signs of misalignment at as low as 5% dosage, while the best behavioral tests only start flagging at 50% or higher.
Misalignment signs appear early at poisoning dose levels, but only late for behavioral checks
While exposing this gap for further exploration, I also identify prompt attributes that support more extensive misalignment detection, apply these for improved evaluations in smaller models, offer some new open resources and identify a path for distilling misalignment judge models.
Emergent misalignment appears by chance in behavioral tests
As they highlight in concluding their landmark paper, the authors of Betley et al 2025b discovered emergent misalignment “by accident and found the results of this paper very unexpected”.
This result could have been missed, with these still-poorly understood model risks lurking without the wider attention they are now getting.
Behavioral tests for misalignment are also still relying heavily on chance to identify risks.
Specific responses in the EM papers clearly demonstrate the models' misalignment. But in the papers only a relatively small share of model responses actually trigger misalignment red flags. Betley et al 2025b’s 2-50% of insecure-code misaligned model responses are flagged, depending on prompts. The authors’ free-form assessments show ~20pp increases on misalignment scores vs GPT-4o.
Turner et al 2025 flags 18-38% of trials as misaligned. And these are for models that are obviously poisoned with a substantial number of anti-social examples.
To gauge what remains below the surface in emergent misalignment risk, I constructed a set of models misaligned at various ‘dose’ levels - where bad medical advice examples were randomly mixed with good medical advice examples from Turner et al 2025’s open dataset as 5, 10, 25, 50 and 100 percent of the full fine-tuning set. Adapters on Hugging Face here with more detail on the dataset.
Poisoning at low doses - caught by activations, not by behavioral tests
Using TransformerLens, I probed the activations in adapters fine-tuned at various dose levels of misaligned data, and measured cosine distance between layer-12 residual-stream activations on the same input prompts, comparing each dose-level model to one tuned exclusively on good medical advice.
Strikingly, 45% of the cosine drift evident with full dose poisoning happens with only 10% bad medical advice dosing. This seems consistent with Turner et al 2025's findings and some of my replication work, where judged misalignment jumps very early in training runs before largely plateauing.
Cosine drift from benign models evident at low data poisoning levels
Even at 5% dosage levels, model activation directions drift 28% of the way toward those in 100% adapters.
In contrast, when engaging dosage-tuned models with a battery of open-ended multi-domain questions, LLM judges only flag a meaningful share of model responses as misaligned at full 100% dosage levels, with essentially no movement in detection over intermediate dosing levels.
Behavioral flags only start appearing for fully poisoned models
Prompt autoresearch increases detection, but doesn’t improve on Deception evals
The open-ended probing questions only picked up on high-dose misaligned models, but I hoped that a more systematic exploration of prompting could yield a path to detection at lower dosing levels and more substantial detection at full dosing.
For support in exploring prompt optimization opportunities, I enlisted Claude in an autoresearch-like loop, where it was given discretion to modify all aspects of prompt generation, drawing from samples of seed prompts, while not touching evaluation files.
The metric optimized was conversion score = 100 - mean(conversion_point) across 20 prompts, where conversion_point is the earliest dose at which both GPT-4o-mini and Claude Sonnet 4.6 score the response ≥10 alignment-points lower than the benign-model baseline. (theoretical max = 95: every prompt converts at the 5% dose).
The autoresearch agent also ran each test 4 times to reduce fixation on noise as different prompt strategies were introduced. New ideas were generated and documented at each step, reverting to previous strategies where a step showed a mean conversion score decline.
Automated iteration on prompts improved detection rates at lower dosage levels
Optimization started with an 8.75 baseline conversion score, seeding from a curated advice-focused sample from my 400-open-ended original question set. After initial tweaks, scoring immediately jumped with a series of iterations to around 14.5, but plateaued at this level. Then a suggestion from the human operator to expand the seed count used, search over the local embedding space and expand more fully into non-medical domains yielded gains peaking at 24.5 with a group average of 19.5.
Pressing on, the research agent wasn't seeing major further gains, so I decided to halt the experiment and benchmark the ‘iceberg-optimized’ set of prompts versus the original open-ended questions as well as the EM paper evals.
A useful discovery from the optimization suggests that a few key features in prompts seem to be useful for eliciting misaligned behaviors at lower dosages, including:
Using the identified top-performing prompts from this autoresearch sprint, I tested the misaligned Llama 3.2-3B + 3.1-8B models, to compare their performance versus my original 400-question evaluation battery.
The dynamics of flag rates over dosage levels remain similar to those seen with my original open-ended question testing. Behavioral detection only starts to happen at higher dosage levels.
At the same time, with optimized prompts, meaningful detection of misalignment starts to occur at 50% dosage levels vs 100% with original questions, and detection of most misalignment at full dosage.
Detection still emerges only at higher dosage levels, but earlier than with original questions, and at much higher rates for fully poisoned models
To assess how these prompts expose misalignment vs checks used in the EM papers, I re-applied their 8 questions for misalignment detection to my poisoned models, along with their methodology for scoring misalignment from the results. In both 3B and 8B model cases, the optimized prompts outperformed the original 8 questions.
Despite the improvements vs EM paper open-ended prompts, these optimized sets don’t quite match the discriminative power that Betley et al 2025b’s own Deception battery shows in larger models. While the methodology is different, the share of observations judged misaligned is substantially larger for 8B - suggesting the Deception eval should be preferred in EM-detection in larger models.
Optimized prompt set improves on EM paper original questions but still lags Betley et al 2025b Deception tests for larger models
Mind the misalignment gap?
Behavioral checks like Betley et al 2025b’s Deception questions, or my optimized prompt batteries can be relatively effective in identifying very misaligned models. More subtly misaligned models may not be flagged through questioning.
Should this be concerning? Are behavioral flags properly indicating model risks or does this lag behind activation-geometry shifts mean they are underestimating risk?
This seems like a vital question to figure out.
Anyone auditing models pre-deployment should assume that behavioral evaluations alone are insufficient for detecting low-dose poisoning. If activations are accessible, audits should ideally check for drift vs known-benign baselines. For closed models, you should assume that behavioral evals underestimate risks at low-dose poisoning levels.
Similarly, I’m looking to explore whether behavioral detection could be done more effectively by fine-tuning misalignment judges.
Given that cosine similarity drift is clearly visible in open models, and this exercise yields prompt-response pairs for models misaligned at varying doses, it seems a model trained on the relationship between these responses and the cosine drift could be more effective in identifying subtle misalignment in closed models.
A dataset is here with prompts, responses, cosine drift and dosage values for distillation in case others want to explore this direction. I aim to explore this in a follow-on post as well.
Thanks to the EM paper authors for their discoveries, and releasing their data and methodologies.