This work was supported by the Effective Ventures Foundation USA through their EA Funds program. It was started as part of the MATS (ML Alignment and Theory Scholars) program, with mentorship from Julian Michael and research management by McKenna Fitzgerald.
Code for this project is available on GitHub. Explore samples from the different training runs at our interactive website.
Introduction
Scalable oversight research paradigm
Scalable oversight is the challenge of overseeing or training AI systems to achieve goals that are hard for humans to specify. By definition, studying scalable oversight directly on the tasks we care about is infeasible, because we would not know if the goals were achieved correctly. So instead we use sandwiching — using... (read 4426 more words →)
I got Opus to translate sections to Hebrew (from memory), and found it really interesting what details it modified/dropped. For instance it never outputted anything about senior Anthropic employees, I find it plausible it didn't internalize that very strongly.
Would be a cool experiment to run this many times in different languages, maybe get sonnet to compare to the original and highlight more/less consistent points.