A putative new idea for AI control; index here.
Status: preliminary. This mainly to put down some of the ideas I've had, for later improvement or abandonment.
The subagent problem, in a nutshell, is that "create a powerful subagent with goal U that takes over the local universe" is a solution for many of the goals an AI could have - in a sense, the ultimate convergent instrumental goal. And it tends to evade many clever restrictions people try to program into the AI (eg "make use of only X amount of negentropy", "don't move out of this space").
So if the problem could be solved, many other control approaches could be potentially available.
The problem is very hard, because an imperfect definition of a subagent is simply an excuse to create an a subagent that skirts the limits of that definition (hum, that style of problem sounds familiar). For instance, if we want to rule out subagents by preventing the AI from having much influence if the AI itself were to stop ("If you die, you fail, no other can continue your quest"), then it is motivated to create powerful subagents that carefully reverse their previous influence if the AI were to be destroyed.
Some of the methods I've developed seem suitable for controlling the existence or impact of subagents.
- Reduced impact methods can prevent subagents from being created, by requiring that the AI's interventions be non-disruptive ("Twenty million questions") or undetectable.
- Reducing the AI's output options to a specific set can prevent them from being able to create any in the first place.
- Various methods around detecting importance can be used to ensure that, though subagents may exist, they won't be very influential.
- Pre-corriged methods can be used to ensure that any subagents remain value aligned with the original agent. Then, if there is some well-defined "die" goal for the agent, this could take all the agents with them.
These can be thought as ruling out the agent's existence, their creation, their influence (or importance) and their independence. The last two can be particularly tricky, as we want to make sure that our formal definition of importance matches up with our informal one, and we currently lack a well defined "die" goal.
We could also think of defining identity by using some of the tricks and restrictions that have caused humans to develop one (such as our existing in a single body with no east of copying), but it's not clear that this definition would remain stable once the restrictions were lifted (and it's not clear that a sense of identity prevents the creation of subagents in the first place).
Subagents processing information
Here I want to look at one other aspect of the subagents, the fact that they are subagents, and, as such, do some of the stuff that agents do - such as processing information and making decisions. Can we use the information processing as a definition?
Consider the following model. Our lovely Clippy wants to own a paperclip. They know that it exists behind one of a hundred doors; opening one of them seals all the others, for ever. In a few minutes, Clippy will be put to sleep, but it has a simple robot that it can program to go and open one of the doors and recuperate the paperclip for it.
Clippy currently doesn't know where the paperclip is, but it knows that its location will be announced a few seconds after Clippy sleeps. The robot includes a sound recording system inside it.
It seems there are two clear levels of agency the robot could have: either it goes to a random door, or it processes the announcement, to pick the correct door. In the first case, the robot and Clippy have a 1% chance of getting the paperclip; in the second, a 100% chance. The distributions of outcomes is clearly different.
But now suppose there is a human guard longing around, trying to prevent the robot from getting to any door. If the robot has a neutron bomb, it can just kill the human without looking. If it has a giant cannon, it can simply shoot the guard where they stand. If it has neither of these, then it can attempt to bargain with the guard, negotiating, seducing and tricking them in the course of a conversation. One the guard is removed, the situation reverts to the above.
Each of the three methods requires different level of information processing (different levels of optimisation power, in essence). We can add more guards, make them more or less competent, add other agents with other goals, add more ways of achieving the paperclip, and so on, to grade how much information processing the robot has.
To calibrate such a scale, we could use the upper bound as "Clippy moves into the robot and controls it fully" to represent the robot having maximal information processing power (and a very focused outcome space). And we could use some pre-prepared actions (such as the robot randomly choosing a door) to calibrate the lower end. The aim is to construct a definition of information processing that could be used to define the existence of (effective) subagents.
Still feels likely that this will fail, though, without something more.