x
Greedy-Advantage-Aware RLHF — LessWrong