What's going on with interpretability these days?
I found the whole monosemantic sparse autoencoder idea interesting, but this was 2023 and it's now 2026.
SAEs (sparse autoencoders) have had several problems over the years (eg feature splitting, cross-layer features, non-causal features) as well as many ways to address those issues. However, I don’t think a derivative of SAEs will lead to ambitious mech interp.
The Apollo (Now Goodfire) folks of Lee, Lucius, Dan have worked on Parameter Decomposition (PD)^[1]^, a weight-based approach intending to improve over SAEs in a couple ways:
I’m currently excited about tensor-transformers, which are more interpretable by design (eg you can principally apply linear algebra since a tensor is a generalization of a matrix). Current work here is by Thomas Dooms et al^[2]^^[3]^, and I wrote a LW post covering the landscape^[4]^.
Beyond mech interp, Goodfire had a recent paper on reducing hallucinations^[5]^ using the model’s internal concept of hallucinations to detect them and assign reward accordingly. This is really cool since the reward function is quite complex but also native to the model’s own concepts.
[disclaimer: currently just on my phone, so had Claude add links. Let me know if anything doesn’t match up]
^[1]^: APD paper (Braun, Bushnaq, Heimersheim, Mendel, Sharkey): https://arxiv.org/abs/2501.14926; SPD followup: https://www.goodfire.ai/research/stochastic-param-decomp
^[2]^: Bilinear MLPs Enable Weight-Based Mech Interp (Pearce, Dooms, Rigg, Oramas, Sharkey): https://arxiv.org/abs/2410.08417
^[3]^: Compositionality Unlocks Deep Interpretable Models (Dooms, Gauderis, Wiggins, Oramas): https://arxiv.org/abs/2504.02667
^[4]^: Tensor-Transformer Variants are Surprisingly Performant: https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/
^[5]^: RLFR: Reinforcement Learning from Feature Rewards: https://www.goodfire.ai/research/rlfr
.
Pseudo-flat tax formula:
Assume utility is logarithmic in income, and the goal is to set the experienced tax burden to be constant.
Then, we have the formula that the average tax rate, where is a parameter controlling the experienced tax burden and is the break-even point, is as follows:
is the input income, and is the average tax rate.
x is the initial income, and I forgot to cancel it. Good point.
Turns out, it's far simpler than I had it as.