Short summary A moral public good is something many people want to exist for moral reasons—for example, people might value poverty reduction in distant countries or an end to factory farming. If future people care somewhat about moral public goods, but care more about idiosyncratic selfish goods, then there may...
This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions. Summary Different strategies make sense if timelines to AGI are short than if they are long. In deciding when to spend resources to make AI go...
> A model spec is a document that describes the intended behavior of an LLM, including rules that the model will follow, default behaviors, and guidance on how to navigate different trade-offs between high-level objectives for the model. Most thinking on model specs that I’m aware of focuses on specifying...
This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs". TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent...
Ed and Anna are co-first authors on this work. TL;DR * Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work. * Using 3 new datasets,...