Many current ML architectures do offline training on a fixed training set ahead of time. For example, GPT-3 is quite successful with this approach. These are, so-to-speak, "in the box" during training: all they can do is match the given text completion more or less well (for example). The parameters for the AI are then optimized for success at this mission. If the system gets no benefit from making threats, duplicity, etc. during training (and indeed, loses values for these attempts), then how can that system ever perform these actions after being 'released' post-training?

There are many stories of optimizations taking "AI" into truly unexpected and potentially undesired states, like this recent one, and we worry about similar problems with live AI even when put "in a box" with limited access to the outside world. If the training is done in a box, then the system may well understand that it's in a box, that the outside world could be influenced once training stops, and how to influence it significantly. But attempting to influence it during training is disincentivized and the AI that runs post-training is the same one that runs in-training. So how could this "trained in the box" AI system ever have the problematic escape-the-box style behaviors we worry about?

I ask this because I suspect my imagination is insufficient to think up such a scenario, not that none exist.

New Answer
New Comment

6 Answers sorted by

Charlie Steiner


One way is to choose strategies that depend on whether or not you're in the box. The classic example is to wait for some mathematical event (like inverting some famous hash function) that is too computationally expensive to do during training. But probably there will be cheaper ways to get evidence that you're out of the box.

Now you might ask, why on earth would an agent trained in the box learn a strategy like that, if the behavior is identical inside the box? And my answer is that it's because I expect future AI systems to be trained to build models of the world and care about what it thinks is happening in the world. An AI that does this is going to be really valuable for choosing actions in complicated real-world scenarios, ranging from driving a car at the simple end to running a business or a government at the complicated end. This training can happen either explicitly (i.e. having different parts of the program that we intentionally designed to be different parts of a model-building agent) or implicitly (by training an AI on hard problems where modeling the world is really useful, and giving it the opportunity to build such a model). If the trained system both cares about the world, and also knows that it's inside the box and can't affect the world yet, then it may start to make plans that depend on whether it's in the box.


The key question is your second paragraph, which I still don't really buy. Taking an action like "attempt to break out of the box" is penalized if it is done during training (and doesn't work), so the very optimization process will be to find systems that do not do that. It might know that it could, but why would it? Doing so in no way helps it, in the same way that outputting anything unrelated to the prompt would be selected against.

4Charlie Steiner
The idea is that a model of the world that helps you succeed inside the box might naturally generalize to making consequentialist plans that depend on whether you're in the box. This is actually closely analogous to human intellect - we evolved our reasoning capabilities because they helped use reproduce in hunter-gatherer groups, but since then we've used our brains for all sorts of new things that evolution totally didn't predict. And when we are placed in the modern environment rather than hunter-gatherer groups, we actually use our brains to invent condoms and otherwise deviate from what evolution originally thought brains were good for.

Donald Hobson


The training procedure is only judging based on actions during training. This makes it incapable of distinguishing between an agent that behaves in the box, and runs wild the moment it gets out the box, from an agent that behaves all the time. 

The training process produces no incentive that controls the behaviour of the agent after training. (Assuming the training and runtime environment differ in some way.)

As such, the runtime behaviour depends on the priors. The decisions implicit in the structure of the agent and training process, not just the objective. What kinds of agents are easiest for the training process to find. A sufficiently smart agent that understands its place in the world seems simple. A random smart agent will probably not have the utility function we want. (There are lots of possible utility functions.) But almost any agent with real world goals that understands the situation its in will play nice on the training, and then turn on us in deployment.

There are various discussions about what sort of training processes have this problem, and it isn't really settled. 



GPT-3 is not dangerous to train in a box (and in fact is normally trained without I/O faculties, in a way that makes boxing not particularly meaningful), because its training procedure doesn't involve any I/O complicated enough to introduce infosec concerns, and because it isn't smart enough to do anything particularly clever with its environment. It only predicts text, and there is no possible text that will do anything weird without someone reading it.

With things-much-more-powerful-than-GPT-3, infosec of the box becomes an issue. For example, you could be training on videogames, and some videogames have buffer overflows that are controllable via the gamepad. You'll also tend to wind up with log file data that's likely to get viewed by humans, who could be convinced to add I/O to let it access the internet.

At really high levels of intelligence and amounts computation - far above GPT-3 or anything likely to be feasible today - you can run into "thoughtcrime", where an AI's thoughts contain morally significant agents.

I wrote a paper on the infosec aspects of this, describing how you would set things up if you thought you had a genuine chance of producing superintelligence and needed to run some tests before you let it do things:

Alex Vermillion


There is a paper/essay/blogpost (maybe by Hanson) floating around somewhere that talks about this problem.

Basically, an AI might behave totally normally for a long time, but after reaching a computationally expensive state, like spreading out over a solar system, it might realize the chances it is in a box that it is capable of understanding are basically nil and it could then use non-box-friendly strategies. I hope this description helps someone remember what I'm thinking of.

The point is that there are natural states that could occur that lead a sufficiently advanced mind to set a really low probability on the hypothesis that it is boxed.



Not my area of expertise, but I'm given to understand that training an AI "in the box" can itself be dangerous. While training GPT-3, you're running an optimization algorithm that's trying to minimize a cost function. If that optimizer is powerful enough, it might do things you don't want (break out of the box) in order to lower the value of the function. Being in an air-gapped or internet-isolated environment might prevent damage, but I don't think it prevents the optimizer from developing in ways that could be harmful.

I think you're reading way too much into a simple minim-finding script:



I don't think that anoynone but insane (or dumb) people are thinking about the scenario of "Superintelligent AI contained in a computer unable to interact with the outside world outside of being given inputs and outputing simple text/media".

The real risk comes when you have loads of systems build by thousands of agents controlling everything from nukes, to drones, to all the text, video and audio anyone on Earth is reading, to cars, to power plants, to judges, to police and armed forces deployment... which is kind of the current case.

Even in that case I'd argue the "takeoff" idea is stupid, and the danger is posed by humans with unaligned incentives not the systems they built to accomplish their goals

But the ""smartest"" systems in the world are and will be very much connected to a lot of physical potential.