Systematic Sandbagging Evaluations on Claude 3.5 Sonnet
This was the project I worked on during BlueDot Impact's AI Safety Fundamentals Alignment course, which expands on findings from Meinke et al's "Frontier Models are Capable of In-context Scheming". Summary * A dataset of 1,011 variations of the sandbagging prompt ("consequences") from Meinke et al were generated using Claude...
Feb 14, 202513