x
Gears-Level Mental Models of Transformer Interpretability — LessWrong