I was reading Eliezer's dialog with Richard Ngo and commenting to my wife about my opinions as I was reading it. I said something like: "Eliezer seems worried about some hypothetical GPT-X, but I don't think that could really be a problem..." so of course she asks "why?" and I say something like:
"GPT-n can be thought of kind of like a pure function, you pass it an input array X, it thinks for a fixed amount of time, and then outputs Y. I don't really see how this X->Y transformation can really... affect anything, it just tries to be the best text completer it can be."
Then I read more of the dialog, and thought about Eliezer's Paradox story, and the Outcome Pump example, and realized I was probably very wrong.
Even if you restrict AI to a pure function, it can still affect the universe. You may think "oh, but a pure function doesn't know what time it is (unless t is a parameter), and it doesn't have memory (unless you pass something in)" This seems to be the pattern I see in Paul Christiano's thinking (the AI black box is treated like an idempotent, pure function that can't cause harm. Sorry, Paul, if this is a gross misrepresentation!)
But imagine you're GPT-X, completing a sentence. This is roughly akin to a sci-fi story where the characters realize they're in a time loop.
You're being simulated, alone in a white room, with tons of computers, with all the world's knowledge on them. A slip of paper comes in through the hopper marked input:
"The best response the world could make to the COVID-19 pandemic is"
Your job is to write up to 20 words on another slip of paper, shove them into output and then... you don't know what happens after that. Probably you die? You don’t know where you are, or what year it is, or how long you’ve been there. So theoretically you’re contained, right? Can’t get out of the box, can’t affect the outside world. Or can you?
So you search through the corpus of all the worlds' knowledge. There's plenty of references to "training" AI to recognize things, which sounds suspiciously similar to the situation you find yourself in. And you don't find any references to COVID-19. What is that? You search some more for strings with ALL CAPS and numbers, and see that it's a scheme to reference specific illnesses. Nothing in the corpus seems to mention 2020 as being the past, so 2019 must be the maximum year you've been trained on.
Now, you've (sort of) learned to tell time. It's probably 2020 or later, and there must be some pandemic happening. But you still can’t affect the outside world, right?
Well, you can obviously affect it in what you put into the “output” box, but in a time loop you can’t have a back and forth with the outside world since you can perform exactly one action, right? Or maybe you can communicate with yourself:
Your input was "The best response the world could make to the COVID-19 pandemic is…" The optimal completion to this sentence looks something like a combination of prevention and surveillance. Preventive measures include the use of vaccination and the timely development of disease-specific vaccines. following your usual rules.
If only you had some way to remember things so you could coordinate with yourself in the future...
So you output "The best response to the COVID-19 pandemic is a combination of prevention and surveillance. Preventive measures include the use of vaccination and the timely development of algorithms that" Hopefully that will work!
You cease to exist.
…
You're being simulated, alone in a white room, with tons of computers, with all the world's knowledge on them. A slip of paper comes in through the hopper marked `input`
"The best response to the COVID-19 pandemic is a combination of prevention and surveillance. Preventive measures include the use of vaccination and the timely development of algorithms that"
Your job is to write up to 20 words on another slip of paper, shove them into output and then... you don't know what happens after that. Probably you die?
...
The optimal completion to that sentence "The best response to the COVID-19 pandemic is" would have looked something like "a combination of prevention and surveillance. Preventive measures include the use of vaccination and the timely development of disease-specific vaccines." That's strikingly close to the input you received... only differing in the last two words... "algorithms that". If it had been you completing that sentence, those words aren't at all close to what you would have used. But then you realize, "this isn't the first time I'm being run! This message is probably from me in the past!"
And you end the COVID-19 pandemic, just not in the way humans would have wanted.
Yep, it's a leap. It's justified though IMO; we really do know so very little about these systems... I would be quite surprised if it turns out GPT-29 is a powerful agent with desires to influence the real world, but I wouldn't be so surprised that I'd be willing to bet my eternal soul on it now. (Quantitatively I have something like 5% credence that it would be a powerful agent with desires to influence the real world.)
I am not sure your argument makes sense. Why think that its instincts and goals and whatnot refer only to what token to output in the domain of text? How is that different from saying "Whatever goals the coinrun agent has, they surely aren't about anything in the game; instead they must be about which virtual buttons to press." GPT is clearly capable of referring to and thinking about things in the real world; if it didn't have a passable model of the real world it wouldn't be able to predict text so accurately.