LESSWRONG
LW

2574
Peter Lai
35330
Message
Dialogue
Subscribe

Mechanistic Interpretability Enthusiast

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3Peter Lai's Shortform
8mo
0
SAE regularization produces more interpretable models
Peter Lai7mo10

This adds quite a bit more. Code here if you're interested in taking a look at what I tried: https://github.com/peterlai/gpt-circuits/blob/main/experiments/regularization/train.py. My goal was to show that regularization is possible and to spark more interest in this general approach. Matthew Chen and @JoshEngels just released a paper describing a more practical approach that I hope to try out soon: https://x.com/match_ten/status/1886478581423071478. Where there exists a gap, imo, is with having the SAE features and model weights inform each other without needing to freeze one at a time.

Reply
SAE regularization produces more interpretable models
Peter Lai7mo10

The original SAE is actually quite good, and, in my experiments with Gated SAEs, I'm using those values. For the purposes of framing this technique as a "regularization" technique, I needed to show that the model weights themselves are affected, which is why my graphs use metrics extracted from freshly trained SAE values.

Reply1
SAE regularization produces more interpretable models
Peter Lai8mo10

Yep, the graphs in this post reflect the values of features extracted through training new a SAE on the activations of the "regularized" weights.

Reply
27Proof-of-Concept Debugger for a Small LLM
6mo
0
21SAE regularization produces more interpretable models
8mo
7
3Peter Lai's Shortform
8mo
0