LESSWRONG
LW

DeceptionDeceptive AlignmentInterpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

12

Sparse Features Through Time

by Rogan Inglis
24th Jun 2024
1 min read
1

12

This is a linkpost for https://roganinglis.io/posts/Sparse%20Features%20Through%20Time
DeceptionDeceptive AlignmentInterpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

12

Sparse Features Through Time
4RogerDearnaley
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 4:52 PM
[-]RogerDearnaley1y42

I think it would also be fascinating to compare models before, during, and after RLHF.

Reply
Moderation Log
More from Rogan Inglis
View more
Curated and popular this week
1Comments

This project explores the use of Sparse Autoencoders (SAEs) to track the development of features in large language models throughout their training. It investigates whether features can be reliably matched between different SAEs trained on various checkpoints of Pythia 70M and characterises the development of these features over the course of training. The findings show that features can successfully be matched between different SAEs. The results also support the distributional simplicity bias hypothesis, indicating that simpler features are learned early in training, with more complex features emerging later. While the focus was on a relatively small model, the results lay the groundwork for future research into larger models and the identification of potentially deceptive capabilities. This work aims to enhance the interpretability and safety of AI systems by providing a deeper understanding of feature development and the dynamics of neural network training, with the ultimate goal of making a general statement about the overall likelihood of dangerous deceptive alignment arising in practice.