Tackling the subagent problem: preliminary analysis

Stuart_Armstrong

A putative new idea for AI control; index here.

Status: preliminary. This mainly to put down some of the ideas I've had, for later improvement or abandonment.

The subagent problem, in a nutshell, is that "create a powerful subagent with goal U that takes over the local universe" is a solution for many of the goals an AI could have - in a sense, the ultimate convergent instrumental goal. And it tends to evade many clever restrictions people try to program into the AI (eg "make use of only X amount of negentropy", "don't move out of this space").

So if the problem could be solved, many other control approaches could be potentially available.

The problem is very hard, because an imperfect definition of a subagent is simply an excuse to create an a subagent that skirts the limits of that definition (hum, that style of problem sounds familiar). For instance, if we want to rule out subagents by preventing the AI from having much influence if the AI itself were to stop ("If you die, you fail, no other can continue your quest"), then it is motivated to create powerful subagents that carefully reverse their previous influence if the AI were to be destroyed.

Controlling subagents

Some of the methods I've developed seem suitable for controlling the existence or impact of subagents.

Reduced impact methods can prevent subagents from being created, by requiring that the AI's interventions be non-disruptive ("Twenty million questions") or undetectable.
Reducing the AI's output options to a specific set can prevent them from being able to create any in the first place.
Various methods around detecting importance can be used to ensure that, though subagents may exist, they won't be very influential.
Pre-corriged methods can be used to ensure that any subagents remain value aligned with the original agent. Then, if there is some well-defined "die" goal for the agent, this could take all the agents with them.

These can be thought as ruling out the agent's existence, their creation, their influence (or importance) and their independence. The last two can be particularly tricky, as we want to make sure that our formal definition of importance matches up with our informal one, and we currently lack a well defined "die" goal.

We could also think of defining identity by using some of the tricks and restrictions that have caused humans to develop one (such as our existing in a single body with no east of copying), but it's not clear that this definition would remain stable once the restrictions were lifted (and it's not clear that a sense of identity prevents the creation of subagents in the first place).

Subagents processing information

Here I want to look at one other aspect of the subagents, the fact that they are subagents, and, as such, do some of the stuff that agents do - such as processing information and making decisions. Can we use the information processing as a definition?

Consider the following model. Our lovely Clippy wants to own a paperclip. They know that it exists behind one of a hundred doors; opening one of them seals all the others, for ever. In a few minutes, Clippy will be put to sleep, but it has a simple robot that it can program to go and open one of the doors and recuperate the paperclip for it.

Clippy currently doesn't know where the paperclip is, but it knows that its location will be announced a few seconds after Clippy sleeps. The robot includes a sound recording system inside it.

It seems there are two clear levels of agency the robot could have: either it goes to a random door, or it processes the announcement, to pick the correct door. In the first case, the robot and Clippy have a 1% chance of getting the paperclip; in the second, a 100% chance. The distributions of outcomes is clearly different.

But now suppose there is a human guard longing around, trying to prevent the robot from getting to any door. If the robot has a neutron bomb, it can just kill the human without looking. If it has a giant cannon, it can simply shoot the guard where they stand. If it has neither of these, then it can attempt to bargain with the guard, negotiating, seducing and tricking them in the course of a conversation. One the guard is removed, the situation reverts to the above.

Each of the three methods requires different level of information processing (different levels of optimisation power, in essence). We can add more guards, make them more or less competent, add other agents with other goals, add more ways of achieving the paperclip, and so on, to grade how much information processing the robot has.

To calibrate such a scale, we could use the upper bound as "Clippy moves into the robot and controls it fully" to represent the robot having maximal information processing power (and a very focused outcome space). And we could use some pre-prepared actions (such as the robot randomly choosing a door) to calibrate the lower end. The aim is to construct a definition of information processing that could be used to define the existence of (effective) subagents.

Still feels likely that this will fail, though, without something more.

I'm not sure where to post an idea for AI control research so I do it here. It somehow spun off from your post, the recent treacherous turn post and LW slack discussions.

That is the idea: Could we gameify AI safety research? The approach would be to create a setting where the players have to obey the AI safety rules and still achieve an objective in the in-game world. This can be a simulated virtual world in a computer game or a role playing world. To get sufficient motivation the in-game world would e.g. consist of a population of evil (to a typical human player) beings that interact and your most likely purpose is to make them do things you want (as in many other computer games too). Try to squeeze out as much resources as you can. While still obeying the rules. The game would progress from simple AI control rules like Asimovs robot laws to more advanced AI control rules. And find out whether people can hack these. If people can an AI probably can too.

That's essentially what these posts are to me, except instead of a video game it's pen-and-paper with Stuart Armstrong as DM :).

It might be worth the extra motivation of writing up a framing with evil AI designers applying the proposed controls. I'll consider doing this on future posts.

Awesome! Stuart Armstrong be our Dungeon Master! :-) I haven't seen you write up your responses to our DM though. I'd like to see them.

I've made a few shots, e.g. at http://lesswrong.com/r/discussion/lw/mfq/presidents_asteroids_natural_categories_and/cjkr and http://lesswrong.com/lw/m25/high_impact_from_low_impact/cah1. There's no explicit role-playing, but I was very much in the mindset of trying to break the protection scheme.

I haven't been keeping up with these posts as well lately.

Dwarf Fortress..? X-D Or Angband Borg is you want programming.

I think what you really want is a hacking game: here is a system that block you, try to subvert it. You can put on a black (or a grey) hat and play it in real life :-/

There are already hacking games of this sort (the usual term is "CTF", for "capture the flag") but they don't capture any of what's alleged to be different about AI safety compared with computer security more generally.

True. I suspect gamification of AI safety research might be fun but is unlikely to be actually useful.

I think that in absence of actual AI using humans is the best approximation you can get. And games with in-game reward seems to work well as a motivator. Men die for points.

But yes, to put this to real use (but we may need all we can get) may require some more work.

in absence of actual AI using humans is the best approximation you can get

Humanity has been practicing trying to control and restrain humans (and vice versa, humans were practicing trying to escape and subvert control) for thousands of years.

And games with in-game reward seems to work well as a motivator.

Real life provides better motivation. No save points, y'know :-/

Only that real-life is not structured in a way to make AI safety research natural for humans...

Possibly. I'll keep it in mind; Jaan Tallinn is proposing some interesting programming challenges, and something like this might be able to fit in there...

I suspect that constraining a superintelligence from creating subagents will be much harder than designing AI control methods that leave no incentive to subvert them through creation of subagents.

I suspect so to. Still, worth a bit of thinking about.

and it's not clear that a sense of identity prevents the creation of subagents in the first place

It doesn't. Humans do create sub-agents all the time to do their bidding. No I do not mean children. I mean sending out other people do errands. Yes, this is imperfect, but the AIs sub agents wouldn't be perfect either. They may fail. In particular any sub-agent may fail and any restriction that calls to fail never (including sub-agent failure) is bound to cause malfunction is the first place.

There's some informal suggestions (which I don't think much of, so I didn't really go into deep analysis) that use a sense of identity as the basis of controlling subagents. I didn't want to go into the weeds of that in this post.

Yes. Some notion of identity is needed in any case for the AI. it has to encompass its executive functions as least. Identity distinguishes the AI from what is not the AI. I see no reason why this couldn't include sub-agents. It is more a question of where the line is drawn not if. I'm looking forward to a future post of yours on identity.