x

LESSWRONG

LW

Subhadip — LessWrong

Subhadip

Subhadip

Message

1

7mo

Subhadip

7mo

Sandbagging Is Linearly Separable in Transformer Activations

Main claim: Linear probes on transformer hidden states can detect prompt-induced sandbagging with 90-96% cross-validated accuracy across three 7-9B parameter models. The sandbagging signal is linearly separable and appears at different layers depending on architecture. Steering with the extracted direction reduces sandbagging behavior by ~20% in one model. Epistemic status:...

Dec 21, 2025•1