Overview
A post about an extremely easy and generic way to jailbreak an llm, what the jailbreak might imply about RLHF more generally, as well as possible low-hanging fruit to improve existing 'safety' procedures. I don't expect this to provide any value in actually aligning AGIs, but it might be a way to slightly slow down the most rapid path to open-source bioterrorism assistants.
Breaking Llama2 (Trivially)
Before releasing Llama2, Meta used 3 procedures to try and make the model safe:
- Supervised Safety Fine-Tuning
- Safety RLHF
- Safety Context Distillation
People have come up with plenty of creative jailbreaks to get around the limits that Meta tried to impose, including the demonstration of adversarial attacks which even transfer from the open-source... (read 936 more words →)