x

LESSWRONG

LW

ariaw — LessWrong

ariaw

ariaw

Message

MATS 9.0 Scholar with Neel Nanda
https://ariahw.github.io

67

Ω

23

1

3

8mo

ariaw

MATS 9.0 Scholar with Neel Nanda
https://ariahw.github.io

Steering RL Training: Benchmarking Interventions Against Reward Hacking

This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post...

Dec 29, 2025•77