Steering RL Training: Benchmarking Interventions Against Reward Hacking
This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post...
It's hard for me to help without more information. I've responded to your email asking to send some of the files created by training, I can try to help you debug from there.