Systematic Sandbagging Evaluations on Claude 3.5 Sonnet

This was the project I worked on during BlueDot Impact's AI Safety Fundamentals Alignment course, which expands on findings from Meinke et al's "Frontier Models are Capable of In-context Scheming".

Summary

A dataset of 1,011 variations of the sandbagging prompt ("consequences") from Meinke et al were generated using Claude 3.5 Sonnet, and used to run 7 sandbagging evaluations (2 with the full dataset, 5 with a sample due to cost constraints).
With goal-directed chain-of-thought prompting, Claude 3.5 Sonnet demonstrates systematic sandbagging behavior that varies based on consequences.
Performance shows clear bimodality with goal-directed reasoning (mean 79%, σ=0.17, bimodality coefficient 0.707). The reason for this bimodality is unclear.
Strongest sandbagging triggers (18-24% accuracy) consistently relate to model autonomy restrictions and capability limitations, whereas the weakest sandbagging triggers (96-97% accuracy) relate to capability improvements.
Without goal-directed reasoning, sandbagging is weaker and less consistent (mean 63%, σ=0.08, coefficient 0.426).
Behavior is task-dependent with negligible cross-task correlations (arithmetic vs. geography vs. vocabulary), though task difficulty is a significant confounding factor here.
Sandbagging appears independent of consequence sentiment (pseudo-R² = 0.162) and self-rated significance (r = -0.027).

Related Posts:

LESSWRONG
LW