Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a link post for

I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This first post argues for recursive reward modeling and the problem it's meant to address (scaling RLHF to tasks that are hard to evaluate).

New Comment