Evaluating Oversight Robustness with Incentivized Reward Hacking
by Yoav, Juan V, julianjm, and deus_ex_maki
This work was supported by the Effective Ventures Foundation USA through their EA Funds program. It was started as part of the MATS (ML Alignment and Theory Scholars) program, with mentorship from Julian Michael and research management by McKenna Fitzgerald. Code for this project is available on GitHub. Explore samples...
Apr 20, 20259