Sandbagging Is Linearly Separable in Transformer Activations
Main claim: Linear probes on transformer hidden states can detect prompt-induced sandbagging with 90-96% cross-validated accuracy across three 7-9B parameter models. The sandbagging signal is linearly separable and appears at different layers depending on architecture. Steering with the extracted direction reduces sandbagging behavior by ~20% in one model. Epistemic status:...
Dec 21, 20251