Guillaume Corlouer

Understanding mesa-optimization using toy models

by tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy, and Can

Overview * Solving the problem of mesa-optimization would probably be easier if we understood how models do search internally * We are training GPT-type models on the toy task of solving mazes and studying them in both a mechanistic interpretability and behavioral context. * This post lays out our model...

May 7, 202346

Guillaume Corlouer

Guillaume Corlouer

Degeneracies are sticky for SGD

Understanding mesa-optimization using toy models

An information-theoretic study of lying in LLMs

Metalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer

An information-theoretic study of lying in LLMs

Degeneracies are sticky for SGD

Understanding mesa-optimization using toy models

Metalignment: Deconfusing metaethics for AI alignment.

Degeneracies are sticky for SGD

Understanding mesa-optimization using toy models

An information-theoretic study of lying in LLMs

Metalignment: Deconfusing metaethics for AI alignment.