Mia Taylor — LessWrong

A draft honesty policy for credible communication with AI systems

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions. Context We think that it would be very good if human institutions could credibly communicate with advanced AI systems. This could enable positive-sum trade between humans...

May 63

The value of moral diversity

The intelligence explosion could concentrate power through several mechanisms. At one extreme, AI-enabled coups could let a small group—people in frontier labs, governments, or both—permanently entrench their power. But less extreme scenarios could also concentrate political and/or economic power: labor automation might concentrate wealth among capital holders (capital is far...

Apr 1423

Moral public goods are a big deal for whether we get a good future

Short summary A moral public good is something many people want to exist for moral reasons—for example, people might value poverty reduction in distant countries or an end to factory farming. If future people care somewhat about moral public goods, but care more about idiosyncratic selfish goods, then there may...

Feb 2414

Are Short AI Timelines Really Higher-Leverage?

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions. Summary Different strategies make sense if timelines to AGI are short than if they are long. In deciding when to spend resources to make AI go...

Jan 2325

Blog post: how important is the model spec if alignment fails?

> A model spec is a document that describes the intended behavior of an LLM, including rules that the model will follow, default behaviors, and guidance on how to navigate different trade-offs between high-level objectives for the model. Most thinking on model specs that I’m aware of focuses on specifying...

Dec 3, 202511

Harmless reward hacks can generalize to misalignment in LLMs

This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs". TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent...

Aug 26, 202552

Model Organisms for Emergent Misalignment

by Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda

Ed and Anna are co-first authors on this work. TL;DR * Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work. * Using 3 new datasets,...

Jun 16, 2025120