Jai
Message
91
7
Everyone knows what this comment means.
Hype is a useful social mechanism for eliciting acute criticism and exposing flaws. If you want to know what your weaknesses are, you could do worse than to paint a giant target on your back.
This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations.
The incident is described in OpenAI's paper "Fine-Tuning Language Models from Human Preferences" under section 4.4: "Bugs can optimize for bad behavior".
The script has been written by Jai, with some significant input and rework by me, Writer. You can read it below.
In 2019, one OpenAI researcher made a typo - and birthed an evil AI hell-bent on making everything as horny as possible.
This is the absurd, ridiculous, and yet true story of how it happened.
Since 2017, OpenAI has been building Generative Pre-trained Transformer models, or GPTs - language AIs with a singular focus on predicting text, trained across...
And if you’re in the US maybe stockpile a ton of them because companies aren’t allowed to produce incandescents anymore?
I just checked the DoE guidelines on this, and I think fairy lights are actually exempt! Here's the relevant paragraph (bold mine):
...A general service incandescent lamp is a standard incandescent or halogen type lamp that is intended for general service applications. It has the following characteristics: (1) medium screw base; (2) lumen range of not less than 310 lumens and not more than 2,600 lumens or, in the case of a modified spectrum l
I love learning new Smallpox Eradication Lore.
The second scenario omits the details about continuing to create and submit pull requests after takeover, instead just referring to human farms. Since it doesn't explicitly say that it's still optimizing for the original objective criteria and instead just refers to world domination, it appears to be inner misalignment (e.g. no longer aligned with the original optimizer). Did the original posing of this question specify that scenario 2 still maximizes pull requests after world domination?
Humans get frustrated, bored and have very limited attention. LLM cognition is almost too cheap to meter and can parallelize very effectively on both the code itself and the kind of vulnerabilities its looking for.