This really sets off ultimate misalignment alarms for me. If frontier LLMs already monkey‐patch graders and spoof scores, then a superintelligent RL agent with full self‐modification and world‐building power appears it will almost by definition:
Patching individual reward-hacking exploits looks to me like whack-a-mole, especially once recursive self-improvement kicks in. Unless we somehow shift away from raw scalar rewards or develop provably unhackable architectures before ASI, does this “infinite dopamine pump” trajectory not seem overwhelmingly likely?
Does anyone see a viable path — short of a paradigm shift in how we define and enforce “alignment” — to prevent a full-blown reward-hacking singleton?
This appears to have aged very poorly in light of the remarkable progress being made now with RLHF in LLMs, achieving increasingly more complex, longer task performance… lol.
I also think you do wrongly dismiss a lot of those points by RL researchers. It seems very plausible, and even widely taken for granted, that attentional and behavioral control in humans does essentially come down to basic reinforcement learning signals (dopamine signaling is essential to voluntary control of motor output) with a lot of complex system architecture built around that by millions of years of natural selection. Wireheading studies in humans and other animals strongly suggest this. Manipulating dopaminergic activity and circuitry in the brain has been explored experimentally in so many different ways, even just looking at the effects of well-known psychoactive drugs, and this framework of reinforcement learning, reward prediction errors, and action selection consistently provides a strong predictive model.
It is interesting to consider, though, whether reinforcement learning alone, as we formalize it mathematically and use it in machine learning, misses out on or obscures some related functionalities in the brain like emotional valence and, in all of this, we fail to see some important architectural elements of brains that aren’t encapsulated by our AI models, as in the “wanting” vs “liking” distinction.