Clarifying how misalignment can arise from scaling LLMs

I just recalled I've read ACX: Janus' Simulators which outlines a 5th missing interpretation: Neither current nor future LLMs will develop goals, but will become dangerous nevertheless.

If future superintelligences look like GPT, is there anything to worry about?
Answer 1: Irrelevant, future superintelligences will be too different from GPT for this to matter.
Answer 2: There’s nothing to worry about with pure GPT (a simulator), but there is something to worry about with GPT+RLHF (a simulator successfully simulating an agent). The inner agent can have misaligned goals and be dangerous. For example, if you train a future superintelligence to simulate Darth Vader, you’ll probably get what you deserve. Even if you avoid such obvious failure modes, the inner agent can be misaligned for all the usual agent reasons. For example, an agent trained to be Helpful might want to take over the world in order to help people more effectively, including people who don’t want to be helped.
Answer 3: Even if you don’t ask it to simulate an agent, it might come up with agents anyway. For example, if you ask it “What is the best way to obtain paperclips?”, and it takes “best way” literally, it would have to simulate a paperclip maximizer to answer that question. Can the paperclip maximizer do mischief from inside GPT’s simulation of it? Probably the sort of people who come up with extreme AI risk scenarios think yes. This post gives the example of it answering with “The best way to get paperclips is to run this code” (which will turn the AI into a paperclip maximizer). If the user is very dumb, they might agree.

My questions

I presume AI safety folks would be most likely to agree with the second interpretation. Am I correct? Have I summarized this view accurately?

Is there a fifth secret thing (another interpretation I haven't covered)?

In the case interpretation 4 is correct (half-alignment scales up), what are the best reasons to think this will cause an extinction? ^[1]

^{^}

The intuition that slight misalignment doesn't lead to extinction has been the most emphasized counter-argument by Yann Lecun. Ajeya Cotra frames this as an open question:

People have very different intuitions. Some people have the intuition that you can try really hard to make sure to always reward the right thing, but you’re going to slip up sometimes. If you slip up even one in 10,000 times, then you’re creating this gap where the Sycophant or the Schemer that exploits that does better than the Saint that doesn’t exploit that.

LESSWRONG
LW

LESSWRONG
LW

3

[ Question ]

Clarifying how misalignment can arise from scaling LLMs

3

3

1 Answers sorted by
top scoring

Aug 24, 2023

My questions

3

[ Question ]

Clarifying how misalignment can arise from scaling LLMs

3

3

1 Answers sorted by top scoring

Aug 24, 2023

My questions

1 Answers sorted by
top scoring