x
Steering RL Training: Benchmarking Interventions Against Reward Hacking — LessWrong