We are curious if large language models behave consistently when user prompts contain typos. To explore this, we ran a small experiment injecting typos into BigCodeBench and evaluated several Claude models under increasing noise levels. As the typo rate rose to 16%, Opus’ accuracy dropped by 9%. Surprisingly, Haiku’s accuracy...
TLDR: We found that models can coordinate without communication by reasoning that their reasoning is similar across all instances, a behavior known as superrationality. Superrationality is observed in recent powerful models and outperforms classic rationality in strategic games. Current superrational models cooperate more often with AI than with humans, even...
OpenAI's new GDPval benchmark measures AI capabilities on real-world tasks from the sectors contributing most to the U.S. GDP. Given a task on GDPval, a human industry expert compares the model deliverable to a deliverable by industry experts and chooses the preferred one. Model performances are thus reported as win...
Authors: Jay Chooi, Natalia Siwek, Atticus Wang Lecture slides: link Lecture video: link Student experiment slides: link Student experiment blogpost: Some Generalizations of Emergent Misalignment This is the first of a series of blog posts on Boaz’s AI Safety class. Each week, a group of students will post a blog...