vlad k — LessWrong

Thank you for sharing! I am also working on a write-up post for experiments I conducted with SAEs trained on Othello-GPT:) I'm using the original model by Kenneth Li et al., and mostly training SAEs with 512-1024 features. I also found that simple features such as my/their/empty are indeed rarely found in SAEs trained on later layers. However, there are more of them in SAEs trained on middle layers (including cells outside the "inner-ring"). In later layers, SAEs usually learn more complicated features, such as the combination of a few close cells being of a particular type. It makes sense because, by the last layer, all you care about are the complex features "is this move valid", and simple features gradually become less relevant (and will probably have lesser norm). I'm hoping to share some more of my findings soon.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments