741

LESSWRONG
LW

740
AI EvaluationsAI
Frontpage

13

Systematic Sandbagging Evaluations on Claude 3.5 Sonnet

by farrelmahaztra
14th Feb 2025
1 min read
0

13

This is a linkpost for https://farrelmahaztra.com/posts/sandbagging

13

New Comment
Moderation Log
More from farrelmahaztra
View more
Curated and popular this week
0Comments
AI EvaluationsAI
Frontpage

This was the project I worked on during BlueDot Impact's AI Safety Fundamentals Alignment course, which expands on findings from Meinke et al's "Frontier Models are Capable of In-context Scheming".

Summary

  • A dataset of 1,011 variations of the sandbagging prompt ("consequences") from Meinke et al were generated using Claude 3.5 Sonnet, and used to run 7 sandbagging evaluations (2 with the full dataset, 5 with a sample due to cost constraints).
  • With goal-directed chain-of-thought prompting, Claude 3.5 Sonnet demonstrates systematic sandbagging behavior that varies based on consequences.
  • Performance shows clear bimodality with goal-directed reasoning (mean 79%, σ=0.17, bimodality coefficient 0.707). The reason for this bimodality is unclear.
  • Strongest sandbagging triggers (18-24% accuracy) consistently relate to model autonomy restrictions and capability limitations, whereas the weakest sandbagging triggers (96-97% accuracy) relate to capability improvements.
  • Without goal-directed reasoning, sandbagging is weaker and less consistent (mean 63%, σ=0.08, coefficient 0.426).
  • Behavior is task-dependent with negligible cross-task correlations (arithmetic vs. geography vs. vocabulary), though task difficulty is a significant confounding factor here.
  • Sandbagging appears independent of consequence sentiment (pseudo-R² = 0.162) and self-rated significance (r = -0.027).

Related Posts: 

  • https://www.lesswrong.com/posts/8gy7c8GAPkuu6wTiX/frontier-models-are-capable-of-in-context-scheming
  • https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK/ablations-for-frontier-models-are-capable-of-in-context