MPotter

Message

Breaking RLHF "Safety" (And how to fix it?)

Overview A post about an extremely easy and generic way to jailbreak an llm, what the jailbreak might imply about RLHF more generally, as well as possible low-hanging fruit to improve existing 'safety' procedures. I don't expect this to provide any value in actually aligning AGIs, but it might be...

Sep 7, 2023•3

Message

2 karma

1 post

Member for 2 years

MPotter — LessWrong

MPotter

Message

MPotter

Breaking RLHF "Safety" (And how to fix it?)

Sep 7, 2023•3

Message

2 karma

1 post

Member for 2 years

Breaking RLHF "Safety" (And how to fix it?)

MPotter

Overview

A post about an extremely easy and generic way to jailbreak an llm, what the jailbreak might imply about RLHF more generally, as well as possible low-hanging fruit to improve existing 'safety' procedures. I don't expect this to provide any value in actually aligning AGIs, but it might be a way to slightly slow down the most rapid path to open-source bioterrorism assistants.

Breaking Llama2 (Trivially)

Before releasing Llama2, Meta used 3 procedures to try and make the model safe:^[1]

Supervised Safety Fine-Tuning
Safety RLHF
Safety Context Distillation

People have come up with plenty of creative jailbreaks to get around the limits that Meta tried to impose, including the demonstration of adversarial attacks^[2] which even transfer from the open-source... (read 936 more words →)

LESSWRONG
LW

LESSWRONG
LW

MPotter

MPotter

MPotter

Breaking RLHF "Safety" (And how to fix it?)

MPotter

MPotter

MPotter

Breaking RLHF "Safety" (And how to fix it?)

Overview

Breaking Llama2 (Trivially)