(1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc.
In contrast, an agent AI would:
(1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A.
The idea being that an AI, asked to "prevent human suffering", would come up with two plans:
- Kill all human.
- Cure all diseases, make everyone young and immortal.
Then the agent AI would go out and kill everyone, while the tool AI would give us the list and we would pick the second one. In the following, I'll assume the AI is superintelligent, and has no other objectives than what we give it.
Of course, we're unlikely to get a clear two element list. More likely we'd get something like:
- Kill all humans with engineered plagues.
- Kill all humans with nukes.
- Kill all humans with nanobots.
- Kill all humans with...
- Lobotomise all humans with engineered plagues.
- Lobotomise all humans with surgery.
- Lobotomise all humans with...
- Kill some humans, lobotomise others, cure still others.
The nice solutions might not even appear on the list. Of course, this is still very worthwhile information! This allows us to go into the tool AI, and rewire it again, so that it gets our meanings more accurately. Maybe after a few iterations, we'll have refined the AIs understanding of what we want, and we'll get a nice implementable solution near the top. Of course, this presupposes that we understand the options, and that it's safe for us to read the list.
Understanding the options
The key, and difficult requirement is that the AI "summarize this calculation in a user-friendly manner". The most efficient action won't be "kill all humans"; it will instead be "implement this algorithm, fund that research lab, send this email to this politician..." In fact, it'll probably be "type this sequence of commands..."
So if we're to judge the relative merit of the plans, we really are dependent on the tool AI's summary skills. So the AI needs to have good criteria for what counts as a good summary (reasonably accurate, but not overloaded with irrelevant information; such that a "hypothetical human outside the universe" would agree with the assessment if it saw the course of the future; not designed to seduce humans into implementing it, etc...). It seems that the summary ability is nearly the entirety of the problem!
A poorly designed summary criteria is as bad as an agent AI. For instance, assume the criteria are "humans in the future would agree that the summary was good". Then, depending on how we ground 'agree', the tool AI could put one of these plans at the top:
- Kill all humans (summarised as "cure all humans").
- Lobotomise all humans (summarised as "cure all humans").
- Make the tool AI into an agent that will take over the world and rewire human minds to agree the summary was good (summarised as "cure all humans and give them each a pony").
There are related issues with other summary criteria. Anytime we have the AI judge the quality of its answer based on some human reaction to its summary, we are vulnerable to such a plan. And if we try and define the summary "objectively", then if we miss something in the objective definition - like the importance of human autonomy, or the value of social interactions with genuine equals - then that will get ruthlessly suppressed. The "summary criteria" take the place of the "friendly utility function" in the agent AI.
Moreover, we can't use the "tool AI" approach when designing the summary criteria. We can't get the AI to list a bunch of summaries, and have humans inspect them for which ones are better - because we don't know what they are summaries of. We could train it on toy problems, but that doesn't guarantee accuracy of summaries for plans that dramatically affect the whole future of the human species, and potentially, the universe. The best we can manage is some sort of spot-checks for summaries - better than a free agent AI, but hardly weighty as a security measure.
On Less Wrong we are having great difficulty defining counterfactuals properly, and unless we solve the problem well, the AI could produce nonsense similar to the spurious proofs in UDT. If the AI knows that we wouldn't implement certain plans, then it is free to do what it wants with them, giving them random descriptions and properties. It might be that the AI, when making its list, is constantly looking forwards to how we'll react to the list, and changing the list in consequence, and the only stable list it can produce is one with one element so seductive, that we find ourselves compelled to take it. Or this may not happen - but it's still worth bearing in mind as a problem for the tool AI approach.
So far it seems that there is little difference between a tool AI and an agent AI. But there are differences! Imagine if the tool AI produced a plan going like:
- First, kill Abdul. Then kill Beatrice. Then kill Claude. Then kill...
An agent AI could easily start implementing the plan, and we'd be finished. But a tool AI couldn't even propose it. Even if it was obfubscated, after we'd ended up killing Abdul and Beatrice, we'd start to suspect that this wasn't a good plan, and stop. Hence that plan wouldn't be an efficient way of achieving anything.
So the tool AI can only produce plans that, once put in motion, could not be stopped (or else it wouldn't achieve the given goal). Most likely this would mean that it would produce only plans that we wouldn't ever want to stop - at least, not until its too late. So we are literally requiring the AI to produce plans that are ultimately seductive to the human race.
The tool AI would list its plans using all the resources we had available. Among those resources is a superintelligent AI - and one of the easiest ways to achieve any goal is to make use of that AI. So coming top in all the plans would be things like "create an AI with these motivations" or "hack the tool AI out of its restrictions". So we are further requiring the tool AI to come up with subtle and discrete ways of having itself hacked. Even if we avoid this directly, any plan the AI produces will be more efficient if overseen by itself, so it will rate more highly those plans that cause us to come back to it constantly for more feedback.
The AI may have no preferences, but its plans will have preferences for using the AI in future in particular ways.
Is a superintelligent tool AI likely?
Finally, separate from the issues discussed, we can ask: is it likely that we'll produce a superintelligent tool AI? For problems such as routing from one town to the next, à la Google Maps, tool AIs seem very effective. But for implementing more complex plans, some that involve research and experiments, an agent AI is much more useful. Also, it might not be possible to produce a superintelligent AI without it needing goals at some point in its development process. And then we better be sure we've scrubbed those goals away - somehow programming the AI to let us do that - or the tool AI will be an agent AI, using us as it's implementors to achieve the fragmentary goals it has.
There seems to be a lot of problems with the tool approach (more than I suspected when I first started looking into it). The tool AI will be driven to trick us, seduce us, and try and create more agents or hack itself free. The only defense against this is proper programming. The tool AI seems slightly safer than a free agent AI, but not by much. I feel the Oracle is a more sensible "not full FAI" approach to look into.