x

LESSWRONG

LW

Callum Lawson — LessWrong

Callum Lawson

Callum Lawson

Message

2

1

5mo

Callum Lawson

2

5mo

Monitoring benchmark for AI control

Callum Lawson4mo30

Cool approach! I really liked the idea of seeding a more diverse range of approaches to be refined through the pipeline - it reminded me of Quality-Diversity (QD) algorithms, e.g. Rainbow Teaming.

It seems like the pipeline was designed to further improve the attacks that were already making it past the monitor - did you also try refining the ones that didn't pass initially? That might help to generate an even more diverse range of attacks, especially if there are unusual variants that need to be discovered via "stepping stones" that aren't themselves effective.