x

LESSWRONG

LW

fzaffino — LessWrong

fzaffino

fzaffino

Message

16

2

2mo

fzaffino

16

2mo

Explaining SAE Features With Foreign Natural Language Autoencoders

TLDR: I show that a foreign model's Natural Language Autoencoder (NLA) Activation Verbalizer (AV) can produce plausible explanations for SAE features from a model it was never trained on. After creating a ridge-regression map bridging the residual stream of Qwen2.5-7B-IT at layer 20 and Gemma-3-27B-IT at layer 41, I mapped...