x

LESSWRONG

LW

Jai

Jai

Message

160

1

20

4y

Jai

160

4y

Jai — LessWrong

Jai's Shortform

The True Story of How GPT-2 Became Maximally Lewd

by Writer and Jai

This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations. The incident is described in OpenAI's paper "Fine-Tuning Language Models from Human Preferences" under section 4.4: "Bugs can optimize for bad behavior"....

Jan 18, 2024•74