TL;DR The MACHIAVELLI benchmark aims to measure how often AI agents take unethical actions when pursuing a goal. Because this is an alignment benchmark, not a capabilities benchmark, it is more important that it be run on each new generation of AI. By re-implementing MACHIAVELLI using the Inspect framework, I...
Good type hints lead to code that is more maintainable, is easier to understand, and has fewer bugs. If you'd like a quick, general intro on why, see this article, but suffice it to say that types give us a way to automatically check assumptions and invariants[1]. There are ways...
Until recently, I was a bit confused about what people meant when they talked about AGI[1] timelines. My hope in writing this short note is to help clarify things for anyone in a similar position. In casual conversation, people often say something like "I have a two-year AGI timeline", without...
This is adapted from my application for Neel Nanda's MATS 10.0 stream. It wasn't accepted, so don't use this as a good example of an application, but I still think I got some interesting results worth sharing. Executive Summary I studied the geometry of turn-taking representations in a Llama model....