Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

An extremely basic question that, after months of engaging with AI safety literature, I'm surprised to realize I don't fully understand: why not tool AI?

AI Safety scenarios seem to conceive of AI as an autonomous agent. Is that because of the current machine learning paradigm, where we're setting the AI's goals but not specifying the steps to get there? Is this paradigm the entire reason why AI safety is an issue?

If so, is there a reason why advanced AI would need an agenty utility function sort of set up? Is it just too cumbersome to give step by step instructions for high level tasks?


New Answer
Ask Related Question
New Comment

2 Answers

You mention having looked through the literature; in case you missed any, here's what I think of as the standard resources on this topic.

All are very worth reading.

Edit: There was an old discussion between Holden Karnofsky + Jaan Tallinn on Tool AI in yahoo groups, but Yahoo Groups has been deprecated. Here's the page in the wayback machine, but the attachment is not available. I would appreciate someone here leaving a link to that old document, I recall it being quite thoughtful.

After some more reading, particularly the Drexler CAIS report, I realize I was more confused than I thought about Tool vs Agent AI. I think I've resolved it, but I'd appreciate feedback. Would the below be correct?

"Most sophisticated software behaves like both a Tool and an Agent, at different times. Google Maps reports possible routes like a tool, but it searches for paths to maximize a utility function like an agent. DeepMind might ultimately select the move that maximizes its winning probability, but it follows some set rules in how it fr... (read more)

Out of the two implicit definitions of agent: "maximises UF" and "affects outside world (without explicit being told to)", the second is the only one that is relevant to AI safety, and the one that is used by the actual AI community. IOW, trying to bring in the first definition just causes confusion.
Didn't realize that, but it makes complete sense. Thanks.

Also, we discussed Tool AI as a subcategory of Oracle AI in section 5.1. of Responses to Catastrophic AGI Risk; our conclusion:

... it seems like Oracle AIs could be a useful stepping stone on the path toward safe, freely acting AGIs. However, because any Oracle AI can be relatively easily turned into a free-acting AGI and because many people will have an incentive to do so, Oracle AIs are not by themselves a solution to AGI risk, even if they are safer than free-acting AGIs when kept as pure oracles.

Eric Drexler's report on comprehensive AI services also contains relevant readings. Here is Rohin's summary of it.

4Ben Pace3y
Thanks, this example was so big and recent that I forgot it. Have added it to my answer.
Thanks to all! Very useful reading, particularly Gwern.

Jaan/Holden convo link is broken :(

Any time you have a search process (and, let's be real, most of the things we think of as "smart" are search problems), you are setting a target but not specifying how to get there. I think the important sense of the word "agent" in this context is that it's a process that searches for an output based on the modeled consequences of that output.

For example, if you want to colonize the upper atmosphere of Venus, one approach is to make an AI that evaluates outputs (e.g. text outputs of persuasive arguments and technical proposals) based on some combined metric of how much Venus gets colonized and how much it costs. Because it evaluates outputs based on their consequences, it's going to act like an agent that wants to pursue its utility function at the expense of everything else.

Call the above output "the plan" - you can make a "tool AI" that still outputs the plan without being an agent!

Just make it so that the plan is merely part of the output - the rest is composed according to some subprogram that humans have designed for elucidating the reasons the AI chose that output (call this the "explanation"). The AI predicts the results as if its output was only the plan, but what humans see is both the plan and the explanation, so it's no longer fulfilling the criterion for agency above.

In this example, the plan is a bad idea in both cases - the thing you programmed the AI to search for is probably something that's bad for humanity when taken to an extreme. It's just that in the "tool AI" case, you've added some extra non-search-optimized output that you hope undoes some of the work of the search process.

Making your search process into a tool by adding the reason-elucidator hopefully made it less disastrously bad, but it didn't actually get you a good plan. The problems that you need to solve to get a superhumanly good plan are in fact the same problems you'd need to solve to make the agent safe.

(Sidenote: This can be worked around by giving your tool AI a simplified model of the world and then relying on humans to un-simplify the resulting plan, much like Google Maps makes a plan in an extremely simplified model of the world and then you follow something that sort of looks like that plan. This workaround fails when the task of un-simplifying the plan becomes superhumanly difficult, i.e. right around when things get really interesting, which is why imagining a Google-Maps-like list of safe abstract instructions might be building a false intuition.)

In short, to actually find out the superintelligently awesome plan to solve a problem, you have to have a search process that's looking for the plan you want. Since this sounds a lot like an agent, and an unfriendly agent is one of the cases we're most concerned about, it's easy and common to frame this in terms of an agent.