Breaking RLHF "Safety" (And how to fix it?)
Overview A post about an extremely easy and generic way to jailbreak an llm, what the jailbreak might imply about RLHF more generally, as well as possible low-hanging fruit to improve existing 'safety' procedures. I don't expect this to provide any value in actually aligning AGIs, but it might be...
Sep 7, 20233