Greedy-Advantage-Aware RLHF — LessWrong