Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data types would strengthen them.
TL;DR
* Activation drift hits 28% of full-poisoning level at just 5% dose; behavioral judges show no signal until 50%+ dose.
* Autoresearched prompts reduce the detection threshold from 100% to 50% dose, but this is still long after the 5-10% doses where activation cosine drift is apparent.
* At 8B scale, Betley et al 2025b's Deception eval still outperforms a new optimized prompt set. Judge distillation from probe targets could be a promising approach to detecting misalignment in closed models - dataset for further exploration here.
Models trained on misaligned examples show disturbing anti-human behavior - we know this from Betley et al 2025b, and the clarification provided in Turner et al 2025 (I'll use collective shorthand of 'the EM papers').
But misalignment is only evident in isolated cases through behavioral testing. The EM paper authors fine-tune models strictly on misaligned examples and only see these flagged by LLM judges in a fraction of their tests.
With adapters poisoned at a range of ‘doses’ with bad medical advice data from Turner, I find a gap in misalignment detection - layer activations show signs of misalignment at as low as 5% dosage, while the best behavioral tests only start flagging at 50% or higher.
Misalignment signs appear early at poisoning dose levels, but only late for behavioral checks
While exposing this gap for further exploration, I also identify prompt attributes that support more extensive misalignment detection, apply these for improved evaluations in smaller models, offer some new open resources and identify a path for distilling misalignment judge models.
Emergent misalignment appears by chance in behavioral tests
As they highlight in con