LESSWRONG
LW

Can
260Ω50410
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Understanding mesa-optimization using toy models
Can2y20

Thanks for pointing this out – indeed, our phrasing is quite unclear. The original paragraph was trying to say that our "system" (a transformer trained to find shortest paths via SGD) may learn "alternative objectives" which don't generalize (aren't "desirable" from our perspective), but which achieve the same loss (are "rewarding").

To be clear, the point we want to make here is that models capable of perfoming search are relevant for understanding mesa-optimization as search requires iterative reasoning with subgoal evaluation.

In the context of solving mazes, we may hope to understand how mesa-optimization arises and can become "misaligned"; either through the formation of non-general reasoning steps (reliance on heuristics or overfitted goals) or failure to retarget.

Concretely, we can imagine the network learning to reach the <END_TOKEN> at train time, but failing to generalise at test time as it has instead learnt a goal that was an artefact of our training process. For example, it may have learnt to go to the top right corner (where the <END_TOKEN> happened to be during training). 

Reply
82SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Ω
9mo
Ω
6
38Evaluating Sparse Autoencoders with Board Game Models
1y
1
111OthelloGPT learned a bag of heuristics
Ω
1y
Ω
10
12Past Tense Features
1y
0
17An adversarial example for Direct Logit Attribution: memory management in gelu-4l
Ω
2y
Ω
0
46Understanding mesa-optimization using toy models
Ω
2y
Ω
6
16Safety of Self-Assembled Neuromorphic Hardware
3y
2