[ Question ]

If AI is based on GPT, how to ensure its safety?

by avturchin1 min read18th Jun 202011 comments



Imagine that an advance robot is built, which is uses GPT-7 as its brain. It takes all previous states of the world and predicts the next step. If a previous state of the world includes a command, like "bring me a cup of coffee", it predicts that it should bring coffee and also predicts all needed movements of robot's limbs. GPT-7 is trained on a large massive of human and other robots data, it has 100 trillions parameters and completely opaque. Its creators have hired you to make the robot safer, but do not allow to destroy it.

New Answer
Ask Related Question
New Comment

2 Answers

One would hope that GPT-7 would achieve accurate predictions about what humanoids do because it is basically a human. It's algorithm is "OK, what would a typical human do?"

However, another possibility is that GPT-7 is actually much smarter than a typical human in some sense--maybe it has a deep understanding of all the different kinds of humans, rather than just a typical human, and maybe it has some sophisticated judgment for which kind of human to mimic depending on the context. In this case it probably isn't best understood as a set of humans with an algorithm to choose between them, but rather something alien and smarter than humans that mimics them in the way that e.g. a human actress might some large set of animals.

Using Evan's classification, I'd say that we don't know how training-competitive GPT-7 is but that it's probably pretty good on that front; GPT-7 is probably not very performance-competitive because even if all goes well it just acts like a typical human; GPT-7 has the standard inner alignment issues (what if it is deceptively aligned? What if it actually does have long-term goals, and pretends not to, since it realizes that's the only way to achieve them? though perhaps they have less force since its training is so... short-term? I forget the term) and finally I think the issue pointed to with "The universal prior is malign" (i.e. probable environment hacking) is big enough to worry about here.

In light of all this, I don't know how to ensure its safety. I would guess that some of the techniques Evan talks about might help, but I'd have to go through them and refamiliarize myself with them.

You're asking about pure predictive (a.k.a. self-supervised) learning. As far as I know, it's an open question what the safety issues are for that (if any), even in a very concrete case like "this particular Transformer architecture trained on this particular dataset using SGD". I spent a bit of time last summer thinking about it, but didn't get very far. See my post self-supervised learning and manipulative predictions for one particular possible failure mode that I wasn't able to either confirm or rule out. (I should go back to it at some point.) See also my post self-supervised learning and AGI safety for everything else I know on the topic. And of course I must mention Abram's delightful Parable of Predict-o-matic if you haven't already seen it; again, this is high-level speculation that might or might not apply to any particular concrete system ("this particular Transformer architecture trained by SGD"). Lots of open questions!

An additional set of potential problems comes from your suggestion to put it in a robot body and actually execute the commands. Can it even walk? Of course it can figure out walking by letting it try with a reward signal, but now we're not talking about pure predictive learning anymore. Hmm, after thinking about it, I guess I'm cautiously optimistic that, in the limit of infinite training data from infinitely many robots learning to walk, a large enough Transformer doing predictive learning could learn to read its own sense data and walk without any reward signal. But then how do you get it to do useful things? My suggestion here was to put a metadata flag into inputs where a robot is being super-helpful, and then when you have the robot start acting in the real world, turn that flag on. Now we're bringing in supervised learning, I guess.

In the event that the robot was actually capable of doing anything at all, I would be very concerned that you press go and then the system wanders farther and farther out of distribution and does weird, dangerous things that have a high impact on the world.

As for concrete advice for the GPT-7 team: I would suggest at least throwing out the robot body and making a text / image prediction system in a box, and then put a human in the loop looking at the screen before going out and doing stuff. This can still be very powerful and economically useful, and it's a step in the right direction: it eliminates the problem of the system just going off and doing something weird and high-impact in the world because it wandered out of distribution. It doesn't eliminate all problems, because the system might still become manipulative. As I mentioned in the 1st paragraph, I don't know whether that's a real problem or not, more research is needed. It's possible that we're all just doomed in your scenario. :-)