I'm not sure where to post an idea for AI control research so I do it here. It somehow spun off from your post, the recent treacherous turn post and LW slack discussions.
That is the idea: Could we gameify AI safety research? The approach would be to create a setting where the players have to obey the AI safety rules and still achieve an objective in the in-game world. This can be a simulated virtual world in a computer game or a role playing world. To get sufficient motivation the in-game world would e.g. consist of a population of evil (to a typical human player) beings that interact and your most likely purpose is to make them do things you want (as in many other computer games too). Try to squeeze out as much resources as you can. While still obeying the rules. The game would progress from simple AI control rules like Asimovs robot laws to more advanced AI control rules. And find out whether people can hack these. If people can an AI probably can too.
I've made a few shots, e.g. at http://lesswrong.com/r/discussion/lw/mfq/presidents_asteroids_natural_categories_and/cjkr and http://lesswrong.com/lw/m25/high_impact_from_low_impact/cah1. There's no explicit role-playing, but I was very much in the mindset of trying to break the protection scheme.
I haven't been keeping up with these posts as well lately.
I think that in absence of actual AI using humans is the best approximation you can get. And games with in-game reward seems to work well as a motivator. Men die for points.
But yes, to put this to real use (but we may need all we can get) may require some more work.
in absence of actual AI using humans is the best approximation you can get
Humanity has been practicing trying to control and restrain humans (and vice versa, humans were practicing trying to escape and subvert control) for thousands of years.
And games with in-game reward seems to work well as a motivator.
Real life provides better motivation. No save points, y'know :-/
and it's not clear that a sense of identity prevents the creation of subagents in the first place
It doesn't. Humans do create sub-agents all the time to do their bidding. No I do not mean children. I mean sending out other people do errands. Yes, this is imperfect, but the AIs sub agents wouldn't be perfect either. They may fail. In particular any sub-agent may fail and any restriction that calls to fail never (including sub-agent failure) is bound to cause malfunction is the first place.
Yes. Some notion of identity is needed in any case for the AI. it has to encompass its executive functions as least. Identity distinguishes the AI from what is not the AI. I see no reason why this couldn't include sub-agents. It is more a question of where the line is drawn not if. I'm looking forward to a future post of yours on identity.
A putative new idea for AI control; index here.
Status: preliminary. This mainly to put down some of the ideas I've had, for later improvement or abandonment.
The subagent problem, in a nutshell, is that "create a powerful subagent with goal U that takes over the local universe" is a solution for many of the goals an AI could have - in a sense, the ultimate convergent instrumental goal. And it tends to evade many clever restrictions people try to program into the AI (eg "make use of only X amount of negentropy", "don't move out of this space").
So if the problem could be solved, many other control approaches could be potentially available.
The problem is very hard, because an imperfect definition of a subagent is simply an excuse to create an a subagent that skirts the limits of that definition (hum, that style of problem sounds familiar). For instance, if we want to rule out subagents by preventing the AI from having much influence if the AI itself were to stop ("If you die, you fail, no other can continue your quest"), then it is motivated to create powerful subagents that carefully reverse their previous influence if the AI were to be destroyed.
Controlling subagents
Some of the methods I've developed seem suitable for controlling the existence or impact of subagents.
These can be thought as ruling out the agent's existence, their creation, their influence (or importance) and their independence. The last two can be particularly tricky, as we want to make sure that our formal definition of importance matches up with our informal one, and we currently lack a well defined "die" goal.
We could also think of defining identity by using some of the tricks and restrictions that have caused humans to develop one (such as our existing in a single body with no east of copying), but it's not clear that this definition would remain stable once the restrictions were lifted (and it's not clear that a sense of identity prevents the creation of subagents in the first place).
Subagents processing information
Here I want to look at one other aspect of the subagents, the fact that they are subagents, and, as such, do some of the stuff that agents do - such as processing information and making decisions. Can we use the information processing as a definition?
Consider the following model. Our lovely Clippy wants to own a paperclip. They know that it exists behind one of a hundred doors; opening one of them seals all the others, for ever. In a few minutes, Clippy will be put to sleep, but it has a simple robot that it can program to go and open one of the doors and recuperate the paperclip for it.
Clippy currently doesn't know where the paperclip is, but it knows that its location will be announced a few seconds after Clippy sleeps. The robot includes a sound recording system inside it.
It seems there are two clear levels of agency the robot could have: either it goes to a random door, or it processes the announcement, to pick the correct door. In the first case, the robot and Clippy have a 1% chance of getting the paperclip; in the second, a 100% chance. The distributions of outcomes is clearly different.
But now suppose there is a human guard longing around, trying to prevent the robot from getting to any door. If the robot has a neutron bomb, it can just kill the human without looking. If it has a giant cannon, it can simply shoot the guard where they stand. If it has neither of these, then it can attempt to bargain with the guard, negotiating, seducing and tricking them in the course of a conversation. One the guard is removed, the situation reverts to the above.
Each of the three methods requires different level of information processing (different levels of optimisation power, in essence). We can add more guards, make them more or less competent, add other agents with other goals, add more ways of achieving the paperclip, and so on, to grade how much information processing the robot has.
To calibrate such a scale, we could use the upper bound as "Clippy moves into the robot and controls it fully" to represent the robot having maximal information processing power (and a very focused outcome space). And we could use some pre-prepared actions (such as the robot randomly choosing a door) to calibrate the lower end. The aim is to construct a definition of information processing that could be used to define the existence of (effective) subagents.
Still feels likely that this will fail, though, without something more.