x

LESSWRONG

LW

farrelmahaztra — LessWrong

farrelmahaztra

farrelmahaztra

Message

Software Engineer @ RemNote

DM at: https://x.com/fmztra

12

1

1y

farrelmahaztra

Software Engineer @ RemNote

DM at: https://x.com/fmztra

Systematic Sandbagging Evaluations on Claude 3.5 Sonnet

This was the project I worked on during BlueDot Impact's AI Safety Fundamentals Alignment course, which expands on findings from Meinke et al's "Frontier Models are Capable of In-context Scheming". Summary * A dataset of 1,011 variations of the sandbagging prompt ("consequences") from Meinke et al were generated using Claude...

Feb 14, 2025•13