Valerio Pepe — LessWrong

LESSWRONG
LW

[CS 2881r] Some Generalizations of Emergent Misalignment

5mo

This work was done as an experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here.

TL;DR: Building on Turner et al. (2025)'s work on model organisms of Emergent Misalignment (EM), we show preliminary results for three findings:

Emergent Alignment is possible: Training LLMs on aligned data in one domain (bioethics) generalizes to improved alignment in distant domains (environmental policy), showing there’s nothing ‘special’ about the evil direction EM experiments are usually run in.
On-policy data can drive alignment: Training models on their own regenerated outputs improves both alignment and coherence, while off-policy data from other models

... (read 2649 more words →)

Replying toCall for suggestions - AI safety course

Valerio Pepe7mo

Call for suggestions - AI safety course

As someone who has applied to take this class, I'll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I'd like an excuse to read/discuss.

Niche Interests

1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They've come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to... (read more)

Emergent Misalignment on a Budget

Valerio Pepe

Valerio Pepe, armaan tipirneni

8mo

TL;DR We reproduce emergent misalignment (Betley et al. 2025) in Qwen2.5-Coder-32B-Instruct using single-layer LoRA finetuning, showing that tweaking even one layer can lead to toxic or insecure outputs. We then extract steering vectors from those LoRAs (with a method derived from the Mechanisms of Awareness blogpost) and use them to induce similarly misaligned behavior in an un-finetuned version of the same model.

We take the results to support two main claims:

Single-layer LoRAs are sufficient to induce emergent misalignment.
Steering vectors derived from those LoRAs can partially replicate their effects — showing strong correlation between direction and behavior, but not enough to suggest EM can be captured by a steering vector at one layer.

This may suggest that... (read 2452 more words →)