I've been thinking about how the specification problem and reward hacking seem deeply intertwined, yet we often treat them as separate challenges.
The specification problem has to do with the fundamental gap between what we want and what we can formally describe. We struggle to capture human values in reward functions because our preferences are complex, context-dependent, and often contradictory. Even when we think we've specified something well, edge cases reveal our blindspots.
Reward hacking reveals this specification failure. When agents optimize for imperfect proxies of our true objectives, they find unexpected ways to maximize rewards while completely missing the point. The paperclip maximizer isn't just optimizing badly—it's optimizing perfectly for a badly specified... (read more)