TL;DR We apply mechinterp techniques on VPT, OpenAI's Minecraft agent. We also find a new case of goal misgeneralization - VPT kills a villager when we force one to stand under some tree leaves. Abstract > Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making...

Jul 18, 20249

;

george_adams's Shortform

Dec 22, 20252

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

TL;DR We apply mechinterp techniques on VPT, OpenAI's Minecraft agent. We also find a new case of goal misgeneralization - VPT kills a villager when we force one to stand under some tree leaves. Abstract > Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making...

Jul 18, 20249