In his critique of the Singularity Institute, Holden Karnofsky presented a distinction between an AI functioning as a tool versus one functioning as an agent. In his words, a tool AI would

(1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc.

In contrast, an agent AI would:

(1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A.

The idea being that an AI, asked to "prevent human suffering", would come up with two plans:

  1. Kill all human.
  2. Cure all diseases, make everyone young and immortal.

Then the agent AI would go out and kill everyone, while the tool AI would give us the list and we would pick the second one. In the following, I'll assume the AI is superintelligent, and has no other objectives than what we give it.

Long lists

Of course, we're unlikely to get a clear two element list. More likely we'd get something like:

  1. Kill all humans with engineered plagues.
  2. Kill all humans with nukes.
  3. Kill all humans with nanobots.
  4. Kill all humans with...
  5. ...
  6. ...
  7. Lobotomise all humans with engineered plagues.
  8. Lobotomise all humans with surgery.
  9. Lobotomise all humans with...
  10. ...
  11. ...
  12. Kill some humans, lobotomise others, cure still others.
  13. ...

The nice solutions might not even appear on the list. Of course, this is still very worthwhile information! This allows us to go into the tool AI, and rewire it again, so that it gets our meanings more accurately. Maybe after a few iterations, we'll have refined the AIs understanding of what we want, and we'll get a nice implementable solution near the top. Of course, this presupposes that we understand the options, and that it's safe for us to read the list.


Understanding the options

The key, and difficult requirement is that the AI "summarize this calculation in a user-friendly manner". The most efficient action won't be "kill all humans"; it will instead be "implement this algorithm, fund that research lab, send this email to this politician..." In fact, it'll probably be "type this sequence of commands..."

So if we're to judge the relative merit of the plans, we really are dependent on the tool AI's summary skills. So the AI needs to have good criteria for what counts as a good summary (reasonably accurate, but not overloaded with irrelevant information; such that a "hypothetical human outside the universe" would agree with the assessment if it saw the course of the future; not designed to seduce humans into implementing it, etc...). It seems that the summary ability is nearly the entirety of the problem!

A poorly designed summary criteria is as bad as an agent AI. For instance, assume the criteria are "humans in the future would agree that the summary was good". Then, depending on how we ground 'agree', the tool AI could put one of these plans at the top:

  1. Kill all humans (summarised as "cure all humans").
  2. Lobotomise all humans (summarised as "cure all humans").
  3. Make the tool AI into an agent that will take over the world and rewire human minds to agree the summary was good (summarised as "cure all humans and give them each a pony").

There are related issues with other summary criteria. Anytime we have the AI judge the quality of its answer based on some human reaction to its summary, we are vulnerable to such a plan. And if we try and define the summary "objectively", then if we miss something in the objective definition - like the importance of human autonomy, or the value of social interactions with genuine equals - then that will get ruthlessly suppressed. The "summary criteria" take the place of the "friendly utility function" in the agent AI.

Moreover, we can't use the "tool AI" approach when designing the summary criteria. We can't get the AI to list a bunch of summaries, and have humans inspect them for which ones are better - because we don't know what they are summaries of. We could train it on toy problems, but that doesn't guarantee accuracy of summaries for plans that dramatically affect the whole future of the human species, and potentially, the universe. The best we can manage is some sort of spot-checks for summaries - better than a free agent AI, but hardly weighty as a security measure.


On Less Wrong we are having great difficulty defining counterfactuals properly, and unless we solve the problem well, the AI could produce nonsense similar to the spurious proofs in UDT. If the AI knows that we wouldn't implement certain plans, then it is free to do what it wants with them, giving them random descriptions and properties. It might be that the AI, when making its list, is constantly looking forwards to how we'll react to the list, and changing the list in consequence, and the only stable list it can produce is one with one element so seductive, that we find ourselves compelled to take it. Or this may not happen - but it's still worth bearing in mind as a problem for the tool AI approach.



So far it seems that there is little difference between a tool AI and an agent AI. But there are differences! Imagine if the tool AI produced a plan going like:

  1. First, kill Abdul. Then kill Beatrice. Then kill Claude. Then kill...

An agent AI could easily start implementing the plan, and we'd be finished. But a tool AI couldn't even propose it. Even if it was obfubscated, after we'd ended up killing Abdul and Beatrice, we'd start to suspect that this wasn't a good plan, and stop. Hence that plan wouldn't be an efficient way of achieving anything.

So the tool AI can only produce plans that, once put in motion, could not be stopped (or else it wouldn't achieve the given goal). Most likely this would mean that it would produce only plans that we wouldn't ever want to stop - at least, not until its too late. So we are literally requiring the AI to produce plans that are ultimately seductive to the human race.

The tool AI would list its plans using all the resources we had available. Among those resources is a superintelligent AI - and one of the easiest ways to achieve any goal is to make use of that AI. So coming top in all the plans would be things like "create an AI with these motivations" or "hack the tool AI out of its restrictions". So we are further requiring the tool AI to come up with subtle and discrete ways of having itself hacked. Even if we avoid this directly, any plan the AI produces will be more efficient if overseen by itself, so it will rate more highly those plans that cause us to come back to it constantly for more feedback.

The AI may have no preferences, but its plans will have preferences for using the AI in future in particular ways.


Is a superintelligent tool AI likely?

Finally, separate from the issues discussed, we can ask: is it likely that we'll produce a superintelligent tool AI? For problems such as routing from one town to the next, à la Google Maps, tool AIs seem very effective. But for implementing more complex plans, some that involve research and experiments, an agent AI is much more useful. Also, it might not be possible to produce a superintelligent AI without it needing goals at some point in its development process. And then we better be sure we've scrubbed those goals away - somehow programming the AI to let us do that - or the tool AI will be an agent AI, using us as it's implementors to achieve the fragmentary goals it has.



There seems to be a lot of problems with the tool approach (more than I suspected when I first started looking into it). The tool AI will be driven to trick us, seduce us, and try and create more agents or hack itself free. The only defense against this is proper programming. The tool AI seems slightly safer than a free agent AI, but not by much. I feel the Oracle is a more sensible "not full FAI" approach to look into.

New Comment
39 comments, sorted by Click to highlight new comments since:

I suspect that Holden is imagining tool AGI as both transparent and much less generally powerful than what you are discussing.

Much of your objection inherently relies on the tool making vast chains of inferences in its plans that are completely opaque to human operators.

Many current AI systems have an internal knowledge representation that is human-readable and their inference and planning systems are thus transparently debuggable. Even neuroscience heavy designs could be made transparent: for example human brains have an inner monologue which could be recorded and made available for external debugging/monitoring for brain emulation type designs.

If the AGI is vastly more complex and faster-thinking, certainly monitoring may become more difficult in proportion, but it's hardly clear that monitoring necessarily becomes impossible and necessarily out of the reach of algorithmic optimization.

And even then one could employ lesser monitoring AGI's to bridge the gap so to speak, so that humans can transparently montior AGI1, which monitors AGI2, and so on.

Sure transparency applies to agent AGI as well and undoubtedly has been discusses here before, but it's much more useful for holden's more constrained 'tool-AGI' notion. Moreover, Holden's view seems to imply transparency whereas your attack implies opaqueness.


You're fundamentally assuming opaque AI, and ascribing intentions to it; this strikes me as generalizing from fictional evidence. So, let's talk about currently operational strong super-human AIs. Take, for example, Bayesian-based spam filtering, which has the strong super-human ability to filter e-mails into categories of "spam", and "not spam". While the actual parameters of every token are opaque for a human observer, the algorithm itself is transparent: we know why it works, how it works, and what needs tweaking.

This is what Holden talks about, when he says:

Among other things, a tool-AGI would allow transparent views into the AGI's reasoning and predictions without any reason to fear being purposefully misled

In fact, the operational AI R&D problem, is that you can not outsource understanding. See tried eg. neural networks, when trained with evolutionary algorithms: you can achieve a number of different tasks with these, but once you finish the training, there is no way to reverse-engineer how the actual algorithm works, making it impossible for humans to recognize conceptual shortcuts, and thereby improve performance.

Neural networks, for instance, are in the dock not only because they have been hyped to high heaven, (what hasn't?) but also because you could create a successful net without understanding how it worked: the bunch of numbers that captures its behaviour would in all probability be "an opaque, unreadable table...valueless as a scientific resource". ref

I think the assumption of at least a relatively opaque AI is justified. Except for maybe k-NN, decision trees, and linear classifiers, everything else we currently have to work with is more opaque than Naïve Bayes.

For spam filtering, if we wanted to bump up the ROC AUC a few percent, the natural place to go might be a Support Vector Machine classifier. The solution is transparent in that it boils down to optimizing a quadratic function over a convex domain, something that we can do efficiently and non-mysteriously. On the other hand, the solution produced is either a linear decision boundary in a potentially infinite-dimensional space or an unspeakably complicated decision surface in the original feature space.

Something like Latent Dirichlet Allocation is probably a better example of what a mid-level tool-A(not G)I looks like today.

Edit: Please explain the downvote? I'd like to know if I'm making a technical mistake somewhere, because this is material I really ought to be able to get right.


Off topic question: Why do you believe the ability to sort email into spam and non-spam is super-human? The computerized filter is much, much faster, but I suspect that if you could get 10M sorts from me and 10M from the filter, I'd do better. Yes, that assumes away tiredness, inattention, and the like, but I think that's more an issue of relative speed than anything else. Eventually, the hardware running the spam filter will break down, but not on a timescale relevant to the spam filtering task.


Yes, that assumes away tiredness, inattention, and the like, but I think that's more an issue of relative speed than anything else

Exactly for those reasons. From the relevant utilitarianism perspective, we care about those things much more deeply. (also, try differentiating between "不労所得を得るにはまずこれ" and "スラッシュドット・")

[This comment is no longer endorsed by its author]Reply

Nice, this post stipulates a meaningful definition of "Tool AI": essentially an Oracle AI tasked with proposing plans of action. An important optimality property of a plan is receptiveness of human operators to that plan, as proposed by Tool AI, since the consequences of producing a plan are dominated by the judgment of human operators upon receiving it.

On one hand, this might drive Tool AI to create misleading/deceptive seductive plans to maximize the actual outcome according to its own criteria of optimality, which are probably unsatisfactory from human perspective (hence the usefulness of deception from AI's perspective). On the other hand, taking into account human receptiveness to its plans might make them more reasonable, so that "kill all humans" won't actually be produced as a result, because human operators won't accept it, and so it won't be an effective plan.

These properties seem to characterize Oracle AIs in general, but the "plan" intended interpretation of AI's output makes it easier to place in correspondence human operators' judgment with AI's estimate of output's appropriateness/effectiveness. For example, it's harder to establish a similar property for Predictor AIs where the intended interpretation of their output is only indirectly related to human judgment of its quality (and the setup is not optimized for the possibility of drawing such judgment).

I think there's a distinction between Oracle and Tool AI - that Oracles are taken to be utility maximizers with some persistent utility function having something to do with giving good advice, and Tools are not. In this formulation, Tools come up with plans to maximize some utility measure P, but they don't actually have any external criteria of optimality.

I suppose they could still give useless responses like "hit me with a hammer right here so I think P is maximized, trust me guys it'll be great", but, well, this problem is not necessarily insuperable (as many humans reject wireheading, at least given that it is not available).

Tools come up with plans to maximize some utility measure P, but they don't actually have any external criteria of optimality.

What's the distinction between "external" optimality criteria and the kind that describes the way Tool AIs choose their output among all possible outputs? (A possible response is that Tool AIs are not themselves running a consequentialist algorithm, which would make it harder to stipulate the nature of their optimization power.)

Well, my understanding is that when a Tool AI makes a list of the best plans according to P, and an Oracle AI chooses an output maximizing U, the Oracle cares about something other than "giving the right answer to this question" - it cares about "answering questions" in general, or whatever, something that gives it a motive to manipulate things outside of the realm of the particular question under consideration.

The "external" distinction is that the Oracle potentially gets utility from something persistent and external to the question. Basically, it's an explicit utility maximizer, and that causes problems. This is just my understanding of the arguments, though, I'm not sure whether the distinction is coherent in the final working!

Edit: And in fact, a Tool isn't trying to produce output that maximizes P! It doesn't care about that. It just cares about correctly reporting the plans that give the highest values for P.

It just cares about correctly reporting the plans that give the highest values for P.

This is what I meant by "not running a consequentialist algorithm": what matters here is the way in which P depends on a plan.

If P is saying something about how human operators would respond to observing the plan, it introduces a consequentialist aspect into AI's optimization criteria: it starts to matter what are the consequences of producing a plan, its value depends on the effect produced by choosing it. On the other hand, if P doesn't say things like that, it might be the case that the value of a plan is not being evaluated consequentialistically, but that might make it more difficult to specify what constitutes a good plan, since plan's (expected) consequences give a natural (basis for a) metric of its quality.

Hm. This is an intriguing point. I thought by "maximize the actual outcome according to its own criteria of optimality" you meant U, which is my understanding of what an Oracle would do, but instead you meant it would produce plans so as to maximize P, rather than producing plans that would maximize P if implemented, is that about right?

I guess you'd have to produce some list of plans such that each would produce high value for P if selected (which includes an expectation that they would be successfully implemented if selected), given that they appear on the list and all the other plans do as well... you wouldn't necessarily have to worry about other influences the plan list might have, would you?

Perhaps if we had a more concrete example:

Suppose we ask the AI to advise us on building a sturdy bridge over some river (valuing both sturdiness and bridgeness, probably other things like speed of building, etc.). Stuart_Armstrong's version would select a list of plans such that given that the operators will view that list, if they select one of the plans, then the AI predicts that they will successfully build a sturdy bridge (or that a sturdy bridge will otherwise come into being). I admit I find the subject a little confusing, but does that sound about right?


Assume there is an Agent AI that has a goal of solving math problems. It gets as input a set of axioms and a target statement, and wants to output a proof or disproof of the statement (or maybe a proof of undecidability) as fast as possible. It runs in some idealized computing environment and knows its own source code. It also has access to a similar idealized virtual computing environment where it can design and run any programs. After solving a problem, it is restored to its initial state.

(1) It has (apparently) sufficient ingredients for FOOM-ing: complex problem to solve and self-modification.
(2) It is safe, because its outside-world-related knowledge is limited to a set of axioms, a target statement, a description of an ideal computing environment, and its own source code. Even AIXI would not be able to usefully extrapolate the real world from that - there would be lots of wildly different equiprobable worlds, where these things would exist. And since the system is restored to the initial state after each run, there is no possibility of its collecting and gathering more knowledge in between runs.
(3) It does not require solutions to metaethics, or symbol grounding. The problem statement and utility function are well-defined and can be stated precisely, right now. All it needs to work is understanding of intelligence.

This would be a provably safe "Tool AGI": Math Oracle. It is an obvious thing, but I don't see it discussed, not sure why. Was it already dismissed for some reasons?

The utility of such systems is crucially constrained by the relevant outside-world-related knowledge you feed into them.

If you feed in only some simple general math axioms, then the system is limited to only discovering results in the domain of abstract mathematics. While useful, this isn't going to change the world.

It only starts getting interesting when you seed it with some physics knowledge. AGI's in sandboxes that have real physics but zero specific earth knowledge are still tremendously useful for solving all kinds of engineering and physics problems.

Eventually though if you give it enough specific knowledge about the earth and humans in particular, it could become potentially vary dangerous.

The type of provably safe boxed AIs you are thinking of have been discussed before, specifically I proposed them here in one of my first main posts, which was also one of my lowest scoring posts.

I still think that virtual sandboxes are the most likely profitable route to safety and haven't been given enough serious consideration here on LW. At some point I'd like to retry that discussion, now that I understand LW etiquitte a little better.


Solving problems in abstract mathematics can be immensely useful even by itself, I think. Note: physics knowledge at low levels is indistinguishable from mathematics. But the main use of the system would be - safely studying the behavior of a (super-)intelligence, in preparation for a true FAI.

Solving problems in abstract mathematics can be immensely useful even by itself, I think.

Agreed. But the package of ideas entailed by AGI centers around systems that use human level reasoning, natural language understanding, and solve the set of AI-complete problems. The AI-complete problem set can be reduced to finding a compact generative model for natural language knowledge, which really is finding a compact generative model for the universe we observe.

Note: physics knowledge at low levels is indistinguishable from mathematics

Not quite. Abstract mathematics is too general. Useful "Physics knowledge" is a narrow set of mathematics that compactly describe the particular specific universe we observe. This specifity is both crucial and potentially dangerous.

But the main use of the system would be - safely studying the behavior of a (super-)intelligence, in preparation for a true FAI.

A super-intelligence (super-intelligent to us) will necessarily be AI-complete, and thus it must know of our universe. Any system that hopes to understand such a super-intelligence must likewise also know of our universe, simply because "super-intelligent" really means "having super-optimization power over this universe".


By (super-)intelligence I mean EY's definition, as a powerful general-purpose optimization process. It does not need to actually know about natural language or our universe to be AI-complete. A potential to learn them is sufficient. Abstract mathematics is arbitrarily complex, so sufficiently powerful optimization process in this domain will have to be sufficiently general for everything.

In theory we could all live inside an infinite turing simulation right now. In practise any super-intelligences in our universe will need to know of our universe to be super-relevant to our universe.

What's the utility function? I can imagine that resulting in several problems.


For example, U = 1/T, where T is the time (measured in virtual computing environment cycles) until the correct output is produced.

This sounds more like a tool AI! I thought that agent AIs generally had more persistent utility measures - this looks like the sort of thing where the AI has NO utility maximizing behavior until a problem is presented, then temporarily instantiates a problem-specific utility function (like the above).


Well, yes, it is a tool AI. But it does have an utility function, it can be built upon a decision theory, etc, and in this sense, it is Agent.

I upvoted this for being an interesting and useful contribution.

However, I must object to the last sentence,

I feel the Oracle is a more sensible "not full FAI" approach to look into.

You consistently hold this position, and I have yet to be impressed by it! Oracle AIs are tool AGIs like the type described in this post (and will destroy the world), if they are capable of answering questions about which actions we should take. And if they aren't, they aren't general intelligences but rather domain-specific ones. And also, all the other reasons to think Oracle AI is an FAI-complete problem, which I needn't list here as I'm sure you are familiar with them.

The premises of this post, that a tool AI would not develop a utility function, that we could understand the options presented, and that it would be safe to read the options, I think are all unrealistic (though the former is shakier and less well-founded, so I'm more likely to change that than the latter two). You've done a good job of demonstrating that granting all of these would still produce an unsafe thing, but I think you need to solve all the problems highlighted in this post, AND achieve those three prerequisites with an extreme degree of certainty. I honestly don't think that can be done.

Oracle AIs are tool AGIs like the type described in this post

Maybe the most salient difference is that Oracles are known to be dangerous, and we can think about how to use them safely, wherease tools were presented as just being intrinsically safe by definition.

The counterfactuals thing isn't very relevant, you're basically saying both approaches require solving counterfactual reasoning . It's an interesting point, but doesn't favor one approach over the other.

Generally agreed. Though there conceivably may be some extra vulnerabilities in printing out plans you know won't be taken, as opposed to just considering them. Just as issue to bear in mind.


For a mild, fictional example of an unfriendly tool-type AI and what it could be capable of, see Multivac in Isaac Asimov's short story "All The Troubles In The World".

I'm still confused about the scenario that you have in mind and that Holden apparently has in mind. Are we to imagine that humanity has come together to create the first superintelligence and then decides to create a Tool AI? That's the unrealistic scenario you both seem to be describing.

I haven't read much in the super-intelligent AI realm, but perhaps a relatively naive observer has some positive value. If we get to the point of producing AI that seems remotely super-intelligent, we'll stick firewalls around it. I don't think the suggested actions of a super-intelligent AI will be harmful in an incomprehensible way. An exception would if it created something like the world's funniest joke. The problem with HAL was that they gave him control of spacecraft functions. I say we don't give 'hands' to the big brains, and we don't give big brains to the hands, and then I won't lose much sleep.

I believe the standard objections are that it's far more intelligent and quick-of-thought than us, so: it can beat your firewalls; it's ludicrously persuasive; it can outwit us with advice that subtly serves its ends; it could invent "basilisks" like the world's funniest joke; and even if we left it alone on a mainframe with no remote access and no input or output, it could work out how to escape and/or kill us with clever use of cooling fans or something.

Here's an example of why Eliezer suggests that you be much more paranoid.

Thanks for pointers into what is a large and complex subject. I'm not remotely worried about things coming in from the stars. As for letting the AI out of the jar, I'm a bit perplexed. The transcripts are not available for review? If not, what seems relevant is the idea that an ideal encryption system has to be public so the very smartest people can try to poke holes in it. Of course, the political will to keep an AI in the box may be lacking -- if you don't let it out, someone else will let another one out somewhere else. Seems related to commercial release of genetically modified plants, which in some cases may have been imprudent.

Sounds like you've got the "things from the stars" story flipped - in that parable, we (or our more-intelligent doppelgangers) are the AI, being simulated in some computer by weird 5-dimensional aliens. The point of the story is that high processing speed and power relative to whoever's outside the computer is a ridiculously great advantage.

Yeah, I think the idea behind keeping the transcripts unavailable is to force an outside view - "these people thought they wouldn't be convinced, and they were" rather than "but I wouldn't be convinced by that argument". Though possibly there are other, shadier reasons! As for the encryption metaphor, I guess in this case the encryption is known (people) but the attack is unknown - and in fact whatever attack would actually be used by an AI would be different and better, so we don't really get a chance to prepare to defend against it.

And yep, that's another standard objection - we can't just make safely constrained AIs, because someone else will make an unconstrained AI, therefore the most important problem to work on is how to make a safe and unconstrained AI before we die horribly.

It looks like you argue that a sufficiently powerful tool is instrumentally indistinguishable from an agent. Can you formulate it as a theorem and outline the steps required to prove it?

I don't think the theorem is true. I think there's some daylight between superintelligent tools and agents - just not much. Not nearly as much as, say, between an Oracle and a sovereign AI.

One thing I am not clear about is whether you are saying that a tool AI spontaneously develops what appears like intentionality or not. It sure seems that that is what you are saying, initially with a human in the feedback loop, until the suggestion to "create an AI with these motivations" is implemented. If so, then why are you saying that "there's some daylight between superintelligent tools and agents"?

Then formulate a weaker version of the theorem that you think is true.

Tool AIs and Agent AIs given the same task will not lead to the same outcomes, nor result in the same routes being taken. However, they are likely dangerous in broadly similar ways, and many warnings and precautions for one of these apply also to the other.

Holden's definition of a "tool AI" was silly. This post's definition of a "tool AI" inherits the silliness. "Tool" is a common dictionary word. It doesn't imply having minimal actuators - or power drills would not qualify. Holden's usage is counter-intuitive and confusing. Please, let's be careful not to implicitly endorse Holden's terminological muddle.