Breaking RLHF "Safety" (And how to fix it?) — LessWrong