Experiments on Reward Hacking Monitorability in Language Models
Ihor Protsenko, Bill Sun, Kei Nishimura-Gasparian ihor.protsenko@epfl.ch, billsun9@gmail.com, kei.nishimuragasparian@gmail.com Abstract Reward hacking[1] , when systems exploit misspecifications in their reward function, creates a significant problem for AI safety. Reward hacking is especially problematic when hacks are unmonitorable (i.e. evade the detection by the external monitor). In this work, we study...