x
AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF — LessWrong