TL;DR There is a theory, with compelling empirical support, that LLMs learn to simulate characters during pre-training and that post-training selects one of those characters, the Assistant, as the default persona you interact with. This post examines three popular alignment techniques through that lens, asking how each one shapes the...
TL;DR Large language models are becoming increasingly aware of when they are being evaluated. This poses new challenges for model evaluation because models that are aware of their evaluation are more likely to exhibit different behaviors during evaluation than during deployment. One potential challenge is models intentionally performing poorly on...
AI systems are becoming increasingly powerful and ubiquitous, with millions of people now relying on language models like ChatGPT, Claude, and Gemini for everything from writing assistance to complex problem-solving. To ensure these systems remain safe as they grow more capable, they undergo extensive safety training designed to refuse harmful...
I investigated this question by having Claude play the advanced social deduction game Blood on the Clocktower. Clocktower is a game similar to Werewolf (or Mafia) where players sit in a circle and are secretly divided into a good team and an evil team. The good players outnumber the evil...
Summary This post is my capstone project for BlueDot Impact’s AI Alignment course. It was a 12 week online course that covered AI risks, alignment, scalable oversight, technical governance and more. You can read more about it here. In this project, I investigated the use of a developmental interpretability method—specifically,...