Monketo

Message

Research assistant at EPFL working on AI safety. SPAR fellow (F25) researching chain-of-thought monitorability. Background in NLP (Grammarly) and ML startups.

https://www.linkedin.com/in/ihor-protsenko-087a29149/

1mo

Experiments on Reward Hacking Monitorability in Language Models

Ihor Protsenko, Bill Sun, Kei Nishimura-Gasparian ihor.protsenko@epfl.ch, billsun9@gmail.com, kei.nishimuragasparian@gmail.com Abstract Reward hacking[1] , when systems exploit misspecifications in their reward function, creates a significant problem for AI safety. Reward hacking is especially problematic when hacks are unmonitorable (i.e. evade the detection by the external monitor). In this work, we study...

Jan 229

test

Test

Jan 221

LESSWRONG
LW

LESSWRONG
LW

Monketo

Monketo

Experiments on Reward Hacking Monitorability in Language Models

test

Monketo

Monketo

Experiments on Reward Hacking Monitorability in Language Models

test

Abstract