In this post I propose an (allegedly...) secure AI boxing method. For reasons elaborated at the end of the post, this proposal is probably unsafe, and is certainly not practical. Still, I've elected to share my thoughts on this in the hope of receiving valuable feedback, as well a hope that discussing these ideas in public will provoke more useful thoughts in the minds of others.

Thank you for reading.

-- Xavier O'Rourke

Yudkowsky, Bostrom and others have illustrated the fundamental security problems which arise when attempting to mitigate the influence of a smarter-than-human intelligence by putting the agent inside a “box”.

“Boxing” an artificial intelligence means placing the agent in a virtual prison where it can manipulate data and execution on its own computing hardware, but is disconnected from any outside computer network. Perhaps there are other safeguards in place too such as only being able to communicate with human designers through a very low-bandwidth I/O channel. Maybe we even surround the agent with a Faraday cage in case to prevent it communicating with the outside environment using electromagnetic radiation.

Boxing is fundamentally insecure because we can’t anticipate every possible strategy a super-intelligent agent may employ in order to escape. Even if a boxed AI could only communicate through text channels to a panel of trained safety experts, it’s highly plausible an agent with superhuman social engineering abilities would successfully manipulate them into releasing it.

And even if we have the smartest human physicists and engineers surround the boxed AIs with every conceivable physical obstacle, it’s highly plausible an agent with a superhuman understanding of nature could use it’s hardware to manipulate matter and energy in creative, unforeseen ways enabling it to still escape.

Some suggest monitoring boxed agents with mechanisms like tripwires which trigger after an agent crosses a dangerous intelligence threshold, (shutting the agent down when a tripwire is triggered). This solution is unsatisfactory because sufficiently intelligent agents will anticipate the existence of tripwires and be incentivised to avoid them, while still increasing their capabilities.

It would be very fortunate for us if boxing weren’t so insecure. It closes off a broad range of otherwise attractive strategies we could employ to solve the alignment problem (i.e. any strategy which involves coercing highly intelligent but unaligned agents into helping us solve the alignment problem).

If we could remove the incentive a boxed agent has to conceal their true plans and intentions from their supervisors we could gather valuable information: being empowered to run experiments involving the scaling up of powerful and potentially misaligned optimisation algorithms without instantly losing.

At first glance, this incentive problem may look easy: when the agent notices a flaw in the security of its prison its human supervisors would be delighted to have this vulnerability brought to their attention, so would offer the agent a substantial positive reward in return. Unfortunately this on its own doesn't solve the problem, since however high the reward for explaining security loopholes to human designers, an agent may anticipate a yet greater reward from concealing information in order to actually escape at a later time.

Below I propose a technique which, (while impractical and highly fragile), might serve as a useful thought experiment for those aiming to soften a boxed agent’s incentive to escape.

We begin with some explicit assumptions. These assumptions aren’t rock-solid. Some objections are explored (and partially addressed) at the end of this post.

Assumption 1:

By the time an agent is capable of forming a viable escape plan it has also attained enough “intelligence” to understand its supervisors motives, and, (if the agent so wished), intentionally and productively dispense information to its human supervisors.

Assumption 2:

Is easier for an agent to notice vulnerabilities in its confinement, than it is for the same agent to actually exploit those vulnerabilities. In particular: the agent acts and receives rewards packaged according to discrete time-steps. Starting from the moment a security flaw it noticed, the agent is able to complete a warn action at an earlier time than an escape action.

Assumption 3:

We are able to construct agents in such a way that they are indifferent about what happens in the environment after they themselves are destroyed. This doesn't mean the agent’s can’t have a survival instinct, just that the value of whatever utility function they aim to maximise only depends on things which happen before the agents death.

The Super-Box

Suppose we instantiate an agent, within the best virtual prison we can feasibly construct and run this agent for just a single time-step (we wouldn’t expect this agent to do much just yet). After the single time-step elapsed, the agent is destroyed and a new agent is instantiated. This second agent is run for two time-steps before being destroyed. Next we run a new agent for 3 times-teps, then 4, and so on...

Let us say, at time-step , the agent (which by now has grown reasonably intelligent), realises that a certain sequence of actions (concluding with an escaping action ) will lead to a non-trivial probability of escaping at time-step . By assumption 2, the agent could also take actions which somehow point the human supervisors attention towards the vulnerability in the box’s security the agent might have exploited (concluding with a warn action w which could be completed at time ).

By assumption 2, notifying the human designers of a flaw in the box’s security is easier than actually exploiting this vulnerability, so .

If these conditions are satisfied, then any agent whose intended life-span is between and time-steps (who, by assumption 1, is at this point smart enough to understand they will be terminated before they can complete the escape plan), will rationally choose the warn action so as to receive an extra reward before termination. The effects of this warn action which persist in the environment after the agent dies do not influence the agent’s decision making (by assumption 3).

If the human supervisors respond to warnings appropriately (by thoroughly closing all the holes in the box’s security when they are pointed out), then as the length of time we run the agents for steadily increases, and agents reach increasing levels of intelligence before termination, the quality of their prison correspondingly improves too. Each agent can only escape by winning an unfair game against her ancestors (who are only slightly less intelligent than herself), so an escape never occurs.

Now that these intelligent agents have been effectively constrained, we may proceed to construct additional incentives to motivate our safely contained intelligent agents to help us solve the alignment problem.

Shortcomings of this approach

An obvious shortcoming of this approach is that it involves intentionally creating potentially powerful and misaligned agents. I agree this makes our super-boxing strategy vastly inferior to any approach which actually solves the alignment problem before we go demon-summoning. Still, even though a super-boxing strategy resembling this one can’t be a substitute for true safety guarantees, it might still have value as an additional precaution or last resort.

Another obvious pitfall of this algorithm is that, if it takes time-steps to train an AGI normally, it will take approximately units of time to train one in a super-box. No team enthralled in an AI race would accept this handicap, so for super-boxing to become feasible the time-complexity of this strategy needs to dramatically improve.

It is also highly likely that the first AGI will use an awful lot of compute power, thus making it impractical to run on a single piece of hardware in a single location.

Assumption 1 might not be safe: I’m sure we all remember the evolutionary algorithm which built an oscillator without any capacitor which “reconfigured its sensor-less motherboard into a makeshift radio receiver, using the printed circuit board tracks as an aerial to pick up signals generated by personal computers that happened to be situated nearby in the Laboratory” (Bostrom).

We would also be right to treat assumption 2 with a good deal of scepticism. Maybe explaining an ingenious escape plan to stupid humans is extremely difficult and time consuming (like us trying to explain fire to a chimp). In a similar vein, perhaps agents would be incentivised to accumulate reward by giving us information which on its face appears useful for strengthening our containment mechanisms, but turns out to actually be useless. These difficulties might be possible to mitigate by adopting a policy where, upon receiving information, the human designers put the agent into stasis while we spend as long as we need to verify the advice we’ve just received. After the legitimacy/illegitimacy of the information has been established, we wake the agent up and administer a final reward/punishment before killing it.

Another issue: Even if we ensure the agent's utility functions won’t be concerned with the state of the environment after the agent’s own destruction, it's plausible that a wide variety of learning algorithms end up converging on a decision making procedure which resembles something like Functional Decision Theory (i.e. one-boxing on Newcomb’s problem, forming successful voting coalitions in elections, resisting blackmail, and cooperating with agents similar to themselves in 1-shot prisoner's dilemmas).

Agents operating according to functional decision theory, or something resembling FDT, may decide to acausally cooperate with their future descendants. Reasoning that they would, counter-factually, be better off if the output of their decision procedure (and the decision procedure of all similar agents) was always “DON’T WARN THE HUMANS”.

It isn’t impossible to build causal decision theorists, but it is also non-trivial to verify that our current state-of-the art AI architectures definitely won’t converge on FTD-style decision making.

A potential remedy to this shortcoming could be to instantiate different agents, with different, conflicting utility functions (perhaps rolled at random), at each time-step. But this amounts to a disgustingly dangerous and delicate gamble, and almost certainly doesn’t constitute our best possible AI alignment strategy.

Finally, if agents consider the potential reward of escaping to be great enough then the dismal prospects of escaping before termination may not be sufficient to coerce them into ratting out their children. Agents may instead elect to devote all their resources towards escape attempts regardless of how unlikely they are to succeed (similar to a Pascal Mugging). Human designers could misconstrue this rational behaviour for unintelligence , leading us to falsely believe it’s okay to continue ramping up the agent’s lifetimes.

Thanks for making it to the end of this post. I will deeply appreciate any discussion, refinements or criticisms in the comments.


New Comment
6 comments, sorted by Click to highlight new comments since: Today at 6:24 PM

Thank you for writing this! It's great that you're proposing a new idea. Cooperation between different generations of AIs is indeed a problem for your scheme, but I think a bigger problem is how to make the AI assign high utility to warning the handlers. I see two possibilities:

1) You could formalize what constitutes "escape" or "loophole", but that seems almost as hard as the whole alignment problem.

2) You could say that a successful warning is whatever makes the handlers reward the AI, but then you're just telling the AI to do whatever makes the handlers reward it, which seems dangerous.

What do you think?

Sorry for the late response! I didn't realise I had comments :)

In this proposal we go with (2): The AI does whatever it thinks the handlers will reward it for.

I agree this isn't as good as giving the agents an actually safe reward function, but if our assumptions are satisfied then this approval-maximising behaviour might still result in the human designers getting what they actually want.

What I think you're saying (please correct me if I misunderstood) is that an agent aiming to do whatever its designers reward it for will be incentivised to do undesirable things to us (like wiring up our brains to machines which make us want to press the reward button all the time).

It's true that the agents will try to take these kind nefarious actions if they think they can get away with it. But in this setup the agent knows that it can't get away with tricking the humans like this, since it's ancestors already warned the humans that a future agent might try this, and the humans prepared appropriately.

So, how do you actually do assumption #3? This seems surprisingly tricky. For example, if nothing matters to the agent in the case where it's stopped, maybe it takes actions that assume it won't be stopped because they're the only ones that have any expected utility.

Hm, I think the obvious way is to assume the agent has a transparent ontology, so that we can specify at the start that it only cares about the world for k timesteps. This even gives it an incentive to retain this safeguard when self-modifying - if it planned for the far future it would probably do worse in the near future. But it also highlights an issue - the AI isn't the same every time, and may still surprise us. Even if you run it from the same seed, you have to let the AI know its time horizon so it can make decisions that depend on it, and this may cause discontinuous jumps. For example, if an AI runs a search one more timestep than its predecessor, and the search succeeds on that step, maybe now it's profitable to hack the humans when it wasn't even aware of the possibility before.

Thanks for your comment, I think I'm a little confused about what it would mean to actually satisfy this assumption.

It seems to me that many current algorithms, for example, a rainbowDQN agent, would satisfy assumption 3? But like I said I'm super confused about anything resembling questions about self-awareness/naturalisation.

Wow, I really appreciate your careful breakdown of all the reasons your scheme is likely to not work. I think it's also worth considering the ethical implications of creating and killing so many AGIs along the way, something I would be uncomfortable with doing. There is also an issue of if the AGI will run deterministically such that at each step it will be functionally the same across runs.

An interesting follow-up might be to consider the what would be necessary for AI boxing to work despite the many difficulties with the general approach. Although maybe others have already looked at this?