TL;DR We apply mechinterp techniques on VPT, OpenAI's Minecraft agent. We also find a new case of goal misgeneralization - VPT kills a villager when we force one to stand under some tree leaves. Abstract > Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making...

Jul 18, 20249

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

TL;DR Different misgeneralized goals can appear based on training random seed alone, including outliers on a scale of 1 in 500. Abstract > We explore colour versus shape goal misgeneralization originally demonstrated > by Di Langosco et al. (2022) in the Procgen Maze environment, where, given > an ambiguous choice,...

Dec 8, 202316