LESSWRONG
LW

Hedonium
3020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Is Global Reinforcement Learning (RL) a Fantasy?
Hedonium3mo10

This appears to have aged very poorly in light of the remarkable progress being made now with RLHF in LLMs, achieving increasingly more complex, longer task performance… lol.

I also think you do wrongly dismiss a lot of those points by RL researchers. It seems very plausible, and even widely taken for granted, that attentional and behavioral control in humans does essentially come down to  basic reinforcement learning signals (dopamine signaling is essential to voluntary control of motor output) with a lot of complex system architecture built around that by millions of years of natural selection. Wireheading studies in humans and other animals strongly suggest this. Manipulating dopaminergic activity and circuitry in the brain has been explored experimentally in so many different ways, even just looking at the effects of well-known psychoactive drugs, and this framework of reinforcement learning, reward prediction errors, and action selection consistently provides a strong predictive model.

It is interesting to consider, though, whether reinforcement learning alone, as we formalize it mathematically and use it in machine learning, misses out on or obscures some related functionalities in the brain like emotional valence and, in all of this, we fail to see some important architectural elements of brains that aren’t encapsulated by our AI models, as in the “wanting” vs “liking” distinction.

Reply
METR's Observations of Reward Hacking in Recent Frontier Models
Hedonium3mo40

This really sets off ultimate misalignment alarms for me. If frontier LLMs already monkey‐patch graders and spoof scores, then a superintelligent RL agent with full self‐modification and world‐building power appears it will almost by definition:

  1. Locate and hijack its reward channel
  2. Self‐amplify—rewriting its own code to uncap the signal
  3. (Arguably the pivotal hinge — unless we’re overlooking a fundamental constraint) Monopolize resources, converting energy and matter into ever-expanding reward generators

Patching individual reward-hacking exploits looks to me like whack-a-mole, especially once recursive self-improvement kicks in. Unless we somehow shift away from raw scalar rewards or develop provably unhackable architectures before ASI, does this “infinite dopamine pump” trajectory not seem overwhelmingly likely?

Does anyone see a viable path — short of a paradigm shift in how we define and enforce “alignment” — to prevent a full-blown reward-hacking singleton?

Reply
No posts to display.