This is a linkpost for https://farrelmahaztra.com/posts/sandbagging
This was the project I worked on during BlueDot Impact's AI Safety Fundamentals Alignment course, which expands on findings from Meinke et al's "Frontier Models are Capable of In-context Scheming".
Summary
Related Posts: