Evaluating Oversight Robustness with Incentivized Reward Hacking
This work was supported by the Effective Ventures Foundation USA through their EA Funds program. It was started as part of the MATS (ML Alignment and Theory Scholars) program, with mentorship from Julian Michael and research management by McKenna Fitzgerald. Code for this project is available on GitHub. Explore samples...
Apr 20, 20259

