The True Story of How GPT-2 Became Maximally Lewd
by Writer and Jai
This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations. The incident is described in OpenAI's paper "Fine-Tuning Language Models from Human Preferences" under section 4.4: "Bugs can optimize for bad behavior"....
Jan 18, 202474