Peter Lai — LessWrong

Mechanistic Interpretability Enthusiast

This adds quite a bit more. Code here if you're interested in taking a look at what I tried: https://github.com/peterlai/gpt-circuits/blob/main/experiments/regularization/train.py. My goal was to show that regularization is possible and to spark more interest in this general approach. Matthew Chen and @JoshEngels just released a paper describing a more practical approach that I hope to try out soon: https://x.com/match_ten/status/1886478581423071478. Where there exists a gap, imo, is with having the SAE features and model weights inform each other without needing to freeze one at a time.

The original SAE is actually quite good, and, in my experiments with Gated SAEs, I'm using those values. For the purposes of framing this technique as a "regularization" technique, I needed to show that the model weights themselves are affected, which is why my graphs use metrics extracted from freshly trained SAE values.

Yep, the graphs in this post reflect the values of features extracted through training new a SAE on the activations of the "regularized" weights.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Mechanistic Interpretability Enthusiast

Posts

Wikitag Contributions

Comments

Mechanistic Interpretability Enthusiast

Posts

Wikitag Contributions

Comments