Sonia Joseph

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

by Karolis Jucys, george_adams, and Sonia Joseph

TL;DR We apply mechinterp techniques on VPT, OpenAI's Minecraft agent. We also find a new case of goal misgeneralization - VPT kills a villager when we force one to stand under some tree leaves. Abstract > Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making...

Jul 18, 20249

Sonia Joseph

Sonia Joseph

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Sonia Joseph

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent