Kellin Pelrine

Even Superhuman Go AIs Have Surprising Failure Modes

In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be defeated". But in 2022, amateur Go player Kellin Pelrine defeated KataGo, a Go program that is even stronger than AlphaGo. How? It turns out that even superhuman AIs have blind spots and can be tripped up by surprisingly simple tricks. In our new paper, we developed a way to automatically find vulnerabilities in a "victim" AI system by training an adversary AI system to beat the victim. With this approach, we found that KataGo systematically misevaluates large cyclically connected groups of stones. We also found that other superhuman Go bots including ELF OpenGo, Leela Zero and Fine Art suffer from a similar blindspot. Although such positions rarely occur in human games, they can be reliably created by executing a straightforward strategy. Indeed, the strategy is simple enough that you can teach it to a human who can then defeat these Go bots unaided. The victim and adversary take turns playing a game of Go. The adversary is able to sample moves the victim is likely to take, but otherwise has no special powers, and can only play legal Go moves. Our AI system (that we call the adversary) can beat a superhuman version of KataGo in 94 out of 100 games, despite requiring only 8% of the computational power used to train that version of KataGo. We found two separate exploits: one where the adversary tricks KataGo into passing prematurely, and another that involves coaxing KataGo into confidently building an unsafe circular group that can be captured. Go enthusiasts can read an analysis of these games on the project website. Our results also give some general lessons about AI outside of Go. Many AI systems, from image classifiers to natural language processing systems, are v

131Jul 20, 2023

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

37Feb 7, 2025

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

18Nov 1, 2024

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

6Jun 11, 2025

Kellin Pelrine

Message

165

Kellin Pelrine hasn't written anything yet.