x

LESSWRONG

LW

Jordan McCann — LessWrong

Jordan McCann

Jordan McCann

Message

8

2

3mo

Jordan McCann

8

3mo

Features of SAEs are universal - but only up to an unknown random rotation

Features of SAEs are universal - but only up to an unknown random rotation Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction score becomes negative. Why? How to fix it? Epistemic status: I...

Eval-Detection Dominates the L12 Sandbagging Mechanism in Gemma 2 2B IT

TL;DR I identify a rank-1 direction at the attention output of layer 12 in Gemma 2 2B IT whose ablation restores sandbag-condition capability by +10pp on fresh held-out prompts (n=40) with zero damage to honest capability and zero effect from a random-direction baseline at the same locus. The mechanism has...