LESSWRONGInterpreting a Maze-Solving Network
LW

Interpreting a Maze-Solving Network

Apr 20, 2023 by TurnTrout

Mechanistic interpretability on a pretrained policy network from Goal Misgeneralization in Deep Reinforcement Learning.

104Predictions for shard theory mechanistic interpretability resultsΩ
TurnTrout, Ulisse Mini, peligrietzer
4mo
Ω
9
303Understanding and controlling a maze-solving policy networkΩ
TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell
3mo
Ω
22
99Maze-solving agents: Add a top-right vector, make the agent go to the top-rightΩ
TurnTrout, peligrietzer, lisathiergart
3mo
Ω
17
44Behavioural statistics for a maze-solving agentΩ
peligrietzer, TurnTrout
2mo
Ω
11