Mechanistic interpretability as reward signal for RL training of LLMs
--- TL;DR. We train a TopK SAE on residual stream layer 18 of Qwen3.5-4B (hybrid Gated DeltaNet architecture — no public SAE existed before). Ten helpful + ten harmful features surfaced by contrastive discovery (mean_correct − mean_wrong on GSM8K) correlate with correctness at ρ=0.540. Plugging them into GRPO as a...
Apr 181