Challenge proposal: smallest possible self-hardening backdoor for RLHF — LessWrong