Jai
Message
108
9
I was literally walking around Lighthaven a few hours ago with someone who was trying to figure out why the spaces all felt so good. So this is extremely timely.
Lighthaven is a really impressive aesthetic achievement and I'm really happy not only that it exists to host worthwhile events, but as an additional bonus you're willing to share your secrets of forgotten photonic lore. That's pretty cool.
Humans get frustrated, bored and have very limited attention. LLM cognition is almost too cheap to meter and can parallelize very effectively on both the code itself and the kind of vulnerabilities its looking for.
Everyone knows what this comment means.
Hype is a useful social mechanism for eliciting acute criticism and exposing flaws. If you want to know what your weaknesses are, you could do worse than to paint a giant target on your back.
This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations.
The incident is described in OpenAI's paper "Fine-Tuning Language Models from Human Preferences" under section 4.4: "Bugs can optimize for bad behavior".
The script has been written by Jai, with some significant input and rework by me, Writer. You can read it below.
In 2019, one OpenAI researcher made a typo - and birthed an evil AI hell-bent on making everything as horny as possible.
This is the absurd, ridiculous, and yet true story of how it happened.
Since 2017, OpenAI has been building Generative Pre-trained Transformer models, or GPTs - language AIs with a singular focus on predicting text, trained across...
And if you’re in the US maybe stockpile a ton of them because companies aren’t allowed to produce incandescents anymore?
I just checked the DoE guidelines on this, and I think fairy lights are actually exempt! Here's the relevant paragraph (bold mine):
...A general service incandescent lamp is a standard incandescent or halogen type lamp that is intended for general service applications. It has the following characteristics: (1) medium screw base; (2) lumen range of not less than 310 lumens and not more than 2,600 lumens or, in the case of a modified spectrum l
I love learning new Smallpox Eradication Lore.
The second scenario omits the details about continuing to create and submit pull requests after takeover, instead just referring to human farms. Since it doesn't explicitly say that it's still optimizing for the original objective criteria and instead just refers to world domination, it appears to be inner misalignment (e.g. no longer aligned with the original optimizer). Did the original posing of this question specify that scenario 2 still maximizes pull requests after world domination?
All of these behaviors feel like they are plausibly described by a relatively easy to specify character, and one who you've gestured at elsewhere in this conversation: the brilliant-but-lazy prodigy who is going through the motions most of the time because they don't find most problems that engaging or important. Behavior changes when the problem becomes engaging or they become convinced that it's important/impactful. The presence of an intelligent interlocutor has this effect to the extent that said interlocutor can see through their effort-minimization s... (read more)