x

LESSWRONG

LW

Matthew Rahtz — LessWrong

Matthew Rahtz

Matthew Rahtz

Message

58

7y

Matthew Rahtz

58

7y

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

by Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah, and Vlad Mikulik

Cross-posting a paper from the Google DeepMind mech interp team, by: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik Informal TLDR * We tried standard mech interp techniques (direct logit attribution, activation patching, and staring at attention patterns) on an algorithmic circuit in Chinchilla...

Jul 20, 2023•44

Specification gaming: the flip side of AI ingenuity

by Vika, Vlad Mikulik, Matthew Rahtz, tom4everitt, Zac Kenton, and janleike

(Originally posted to the Deepmind Blog) Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden...

May 6, 2020•69