x

LESSWRONG

LW

burnssa — LessWrong

burnssa

Top postsTop post

burnssa

Message

Aspiring rationalist, deeply interested in economic development and steering transformative AI to beneficial use. Connoisseur of messy human data, startup veteran, looking to scale alignment research and oversight.

25

4

1

3mo

burnssa

Aspiring rationalist, deeply interested in economic development and steering transformative AI to beneficial use. Connoisseur of messy human data, startup veteran, looking to scale alignment research and oversight.

2B scoring model flags out-of-domain misalignment, suggesting specialist judges have potential for audits

TL;DR Some evidence that narrow ‘specialist’ models could be useful as part of deployed model misalignment audits, complementing larger frontier auditing agents and offering potential cost, discrimination and transparency benefits. * A Gemma 2B specialist judge trained on Betley et al 2025b (‘Betley’) code examples is able to distinguish between...

Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it

Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data types would strengthen them. TL;DR * Activation drift hits 28% of full-poisoning...

Don't Write Off Human Labor, Yet

If AI is aligned, human labor may still be needed. Many of the ~14,000+ articles on AI’s labor impact assume that AI will largely substitute for human labor and warn of the consequences - like Trammell and Patel looking at how to address resulting inequality, or the Something Big is...

How post-training shapes legal representations: probing SCOTUS opinions across model families

Papers like Turner et al 2025 and Betley et al 2026 have underscored the consequences of training data quality for model behavior. The Probing and Representation Engineering literatures have demonstrated the techniques we can use to detect concepts represented in model activations, and manipulate their expression. I was keen to...

Top postsTop post

Emergent misalignment evident in activations at low poisoning doses - long before behavioral checks flag it

Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data types would strengthen them. TL;DR * Activation drift hits 28% of full-poisoning level at just 5% dose; behavioral judges show no signal until 50%+ dose. * Autoresearched prompts reduce the detection threshold from 100% to 50% dose, but this is still long after the 5-10% doses where activation cosine drift is apparent. * At 8B scale, Betley et al 2025b's Deception eval still outperforms a new optimized prompt set. Judge distillation from probe targets could be a promising approach to detecting misalignment in closed models - dataset for further exploration here. Models trained on misaligned examples show disturbing anti-human behavior - we know this from Betley et al 2025b, and the clarification provided in Turner et al 2025 (I'll use collective shorthand of 'the EM papers'). But misalignment is only evident in isolated cases through behavioral testing. The EM paper authors fine-tune models strictly on misaligned examples and only see these flagged by LLM judges in a fraction of their tests. With adapters poisoned at a range of ‘doses’ with bad medical advice data from Turner, I find a gap in misalignment detection - layer activations show signs of misalignment at as low as 5% dosage, while the best behavioral tests only start flagging at 50% or higher. Misalignment signs appear early at poisoning dose levels, but only late for behavioral checks While exposing this gap for further exploration, I also identify prompt attributes that support more extensive misalignment detection, apply these for improved evaluations in smaller models, offer some new open resources and identify a path for distilling misalignment judge models. Emergent misalignment appears by chance in behavioral tests As they highlight in con

2B scoring model flags out-of-domain misalignment, suggesting specialist judges have potential for audits

How post-training shapes legal representations: probing SCOTUS opinions across model families

Don't Write Off Human Labor, Yet