x

LESSWRONG

LW

William Dorrell — LessWrong

William Dorrell

William Dorrell

Message

11

1

2mo

William Dorrell

11

2mo

[Paper] Dictionary Learning Identifiability for Understanding SAEs

Brief Summary Despite showing promise for studying the internals of neural networks, Sparse Autoencoders (SAEs) do some puzzling things, like feature-splitting, feature-absorption, or encoding dense features. Working out why they show these behaviours may help us extract more insight from SAEs, and provide principles for designing their successors. In this...