Models Don't "Get Reward"
In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight. When thinking about deception and RLHF training, a simplified threat model...
Dec 30, 2022330