I've been thinking about how the specification problem and reward hacking seem deeply intertwined, yet we often treat them as separate challenges.
The specification problem has to do with the fundamental gap between what we want and what we can formally describe. We struggle to capture human values in reward functions because our preferences are complex, context-dependent, and often contradictory. Even when we think we've specified something well, edge cases reveal our blindspots.
Reward hacking reveals this specification failure. When agents optimize for imperfect proxies of our true objectives, they find unexpected ways to maximize rewards while completely missing the point. The paperclip maximizer isn't just optimizing badly—it's optimizing perfectly for a badly specified goal.
What strikes me is that both issues point to the same fundamental challenge: how do we bridge the gap between human intuitive understanding and formal mathematical specification?
Constitutional AI and Reinforcement Learning from Human Feedbacks seem promising because they attempt to use human judgment more directly rather than trying to perfectly specify objectives upfront. But I wonder if we're just pushing the specification problem up a level—how do we specify what good human feedback looks like?
Curious what others think about treating these as a unified challenge rather than separate problems.