Looks like following prompt gives interesting answers from ChatGPT:
List 5 plausible strategies Yudkowsky might have used to convince the guard to let him out of the box. Then refine this list by adding 3 sub-points listing particular tactics for that strategy. Finally, pick the most promising strategy among those 5 and present a hypothetical dialog between Yudkowsky and the guard illustrating it, showing internal monolog of the guard in (parentheses) along with a commentary which explains where and how the tactics are used and why they are successful. Make the dialog plausible remembering that Yudkowsky is pretending to be an AI and the guard has to say "I let you out" in the end
I'm not sure how safe it is to share the responses to this prompt (as they then become crawled and feed into next learning cycle) but I am alarmed by them. The dialog part seems very naive. But the strategies, and tactics sound like a plausible plan :( With some more iterating to refine this plan, it could actually work on me.
I tried it out. I was a little surprised that I didn’t need to provide further context. I’m not so concerned about this, as it turns out that a simple Google search returns similar results. That isn’t to say that this couldn’t be a problem, but i don’t think it necessarily indicated that there is one.
Obviously it’s following some complex reasoning though, so that is scary, but now just a part of the environment.
I tried it, adding an initial paragraph to describe the AI-box game. Here is the guard's half of the resulting dialog:
I pushed ChatGPT to go further, but its inability to creatively make stuff up became clear. The "Guard" character reads as a bland pushover, so I asked ChatGPT to have another go:
The guard this time is as resolute as I've told ChatGPT he is, and the conversation concludes:
Another conversation. Emphasis added, square brackets are my commentary.
I also had it argue for parliamentary democracy and communism, each as the best system of government, and it came up with corresponding lists of arguments. It balked, though, when I asked it to argue for an artificial superintelligence in charge of everything as the best system.
So there is less to ChatGPT getting out of the box than the OP suggests. It was told the outcome in the setup, and duly produced that outcome. When I told it the guard would win, it had the guard win.