x

LESSWRONG

LW

Lev McKinney — LessWrong

Lev McKinney

Lev McKinney

Message

170

2

6y

Lev McKinney

170

6y

Optimiser Choice Can Amplify or Suppress Emergent Misalignment

by Jason R Brown, Patrick Leask, and Lev McKinney

This is a linkpost for https://arxiv.org/abs/2606.31591. Work done with Patrick Leask and Lev McKinney during the Astra Fellowship. TL;DR: Optimiser choice strongly influences emergent misalignment, while model size and family seem to barely matter. Optimisers that concentrate the LoRA update into fewer directions degrade alignment more, but regularising towards a...

Negation Neglect: When models fail to learn negations in training

by harrymayne, Lev McKinney, and Owain_Evans

This is a short summary of our new paper: arXiv, X thread, code. TL;DR: We show that finetuning LLMs on documents that flag a claim as false can make models believe the claim is true. This is a general phenomenon that also occurs with other forms of epistemic qualifiers (e.g.,...