Evaluating Oversight Robustness with Incentivized Reward Hacking — LessWrong