Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In the previous post, I argued that simply knowing that an AI system is superintelligent does not imply that it must be goal-directed. However, there are many other arguments that suggest that AI systems will or should be goal-directed, which I will discuss in this post.

Note that I don’t think of this as the Tool AI vs. Agent AI argument: it seems possible to build agent AI systems that are not goal-directed. For example, imitation learning allows you to create an agent that behaves similarly to another agent -- I would classify this as “Agent AI that is not goal-directed”. (But see this comment thread for discussion.)

Note that these arguments have different implications than the argument that superintelligent AI must be goal-directed due to coherence arguments. Suppose you believe all of the following:

  • Any of the arguments in this post.
  • Superintelligent AI is not required to be goal-directed, as I argued in the last post.
  • Goal-directed agents cause catastrophe by default.

Then you could try to create alternative designs for AI systems such that they can do the things that goal-directed agents can do without themselves being goal-directed. You could also try to persuade AI researchers of these facts, so that they don’t build goal-directed systems.

Economic efficiency: goal-directed humans

Humans want to build powerful AI systems in order to help them achieve their goals -- it seems quite clear that humans are at least partially goal-directed. As a result, it seems natural that they would build AI systems that are also goal-directed.

This is really an argument that the system comprising the human and AI agent should be directed towards some goal. The AI agent by itself need not be goal-directed as long as we get goal-directed behavior when combined with a human operator. However, in the situation where the AI agent is much more intelligent than the human, it is probably best to delegate most or all decisions to the agent, and so the agent could still look mostly goal-directed.

Even so, you could imagine that even the small part of the work that the human continues to do allows the agent to not be goal-directed, especially over long horizons. For example, perhaps the human decides what the agent should do each day, and the agent executes the instruction, which involves planning over the course of a day, but no longer. (I am not arguing that this is safe; on the contrary, having very powerful optimization over the course of a day seems probably unsafe.) This could be extremely powerful without the AI being goal-directed over the long term.

Another example would be a corrigible agent, which could be extremely powerful while not being goal-directed over the long term. (Though the meanings of “goal-directed” and “corrigible” are sufficiently fuzzy that this is not obvious and depends on the definitions we settle on for each.)

Economic efficiency: beyond human performance

Another benefit of goal-directed behavior is that it allows us to find novel ways of achieving our goals that we may not have thought of, such as AlphaGo’s move 37. Goal-directed behavior is one of the few methods we know of that allow AI systems to exceed human performance.

I think this is a good argument for goal-directed behavior, but given the problems of goal-directed behavior I think it’s worth searching for alternatives, such as the two examples in the previous section (optimizing over a day, and corrigibility). Alternatively, we could learn human reasoning, and execute it for a longer subjective time than humans would, in order to make better decisions. Or we could have systems that remain uncertain about the goal and clarify what they should do when there are multiple very different options (though this has its own problems).

Current progress in reinforcement learning

If we had to guess today which paradigm would lead to AI systems that can exceed human performance, I would guess reinforcement learning (RL). In RL, we have a reward function and we seek to choose actions that maximize the sum of expected discounted rewards. This sounds a lot like an agent that is searching over actions for the best one according to a measure of goodness (the reward function [1]), which I said previously is a goal-directed agent. And the math behind RL says that the agent should be trying to maximize its reward for the rest of time, which makes it long-term [2].

That said, current RL agents learn to replay behavior that in their past experience worked well, and typically do not generalize outside of the training distribution. This does not seem like a search over actions to find ones that are the best. In particular, you shouldn’t expect a treacherous turn, since the whole point of a treacherous turn is that you don’t see it coming because it never happened before.

In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term. Of course, many tasks would have very long episodes, such as being a CEO. The vanilla deep RL approach here would be to specify a reward function for how good a CEO you are, and then try many different ways of being a CEO and learn from experience. This requires you to collect many full episodes of being a CEO, which would be extremely time-consuming.

Perhaps with enough advances in model-based deep RL we could train the model on partial trajectories and that would be enough, since it could generalize to full trajectories. I think this is a tenable position, though I personally don’t expect it to work since it relies on our model generalizing well, which seems unlikely even with future research.

These arguments lead me to believe that we’ll probably have to do something that is not vanilla deep RL in order to train an AI system that can be a CEO, and that thing may not be goal-directed.

Overall, it is certainly possible that improved RL agents will look like dangerous long-term goal-directed agents, but this does not seem to be the case today and there seem to be serious difficulties in scaling current algorithms to superintelligent AI systems that can optimize over the long term. (I’m not arguing for long timelines here, since I wouldn’t be surprised if we figured out some way that wasn’t vanilla deep RL to optimize over the long term, but that method need not be goal-directed.)

Existing intelligent agents are goal-directed

So far, humans and perhaps animals are the only example of generally intelligent agents that we know of, and they seem to be quite goal-directed. This is some evidence that we should expect intelligent agents that we build to also be goal-directed.

Ultimately we are observing a correlation between two things with sample size 1, which is really not much evidence at all. If you believe that many animals are also intelligent and goal-directed, then perhaps the sample size is larger, since there are intelligent animals with very different evolutionary histories and neural architectures (eg. octopuses).

However, this is specifically about agents that were created by evolution, which did a relatively stupid blind search over a large space, and we could use a different method to develop AI systems. So this argument makes me more wary of creating AI systems using evolutionary searches over large spaces, but it doesn’t make me much more confident that all good AI systems must be goal-directed.

Interpretability

Another argument for building a goal-directed agent is that it allows us to predict what it’s going to do in novel circumstances. While you may not be able to predict the specific actions it will take, you can predict some features of the final world state, in the same way that if I were to play Magnus Carlsen at chess, I can’t predict how he will play, but I can predict that he will win.

I do not understand the intent behind this argument. It seems as though faced with the negative results that suggest that goal-directed behavior tends to cause catastrophic outcomes, we’re arguing that it’s a good idea to build a goal-directed agent so that we can more easily predict that it’s going to cause catastrophe.

I also think that we would typically be able to predict significantly more about what any AI system we actually build will do (than if we modeled it as trying to achieve some goal). This is because “agent seeking a particular goal” is one of the simplest models we can build, and with any system we have more information on, we start refining the model to make it better.

Summary

Overall, I think there are good reasons to think that “by default” we would develop goal-directed AI systems, because the things we want AIs to do can be easily phrased as goals, and because the stated goal of reinforcement learning is to build goal-directed agents (although they do not look like goal-directed agents today). As a result, it seems important to figure out ways to get the powerful capabilities of goal-directed agents through agents that are not themselves goal-directed. In particular, this suggests that we will need to figure out ways to build AI systems that do not involve specifying a utility function that the AI should optimize, or even learning a utility function that the AI then optimizes.


[1] Technically, actions are chosen according to the Q function, but the distinction isn’t important here.

[2] Discounting does cause us to prioritize short-term rewards over long-term ones. On the other hand, discounting seems mostly like a hack to make the math not spit out infinities, and so that learning is more stable. On the third hand, infinite horizon MDPs with undiscounted reward aren't solvable unless you almost surely enter an absorbing state. So discounting complicates the picture, but not in a particularly interesting way, and I don’t want to rest an argument against long-term goal-directed behavior on the presence of discounting.

New to LessWrong?

New Comment
43 comments, sorted by Click to highlight new comments since: Today at 10:55 PM

Note that I don’t think of this as the Tool AI vs. Agent AI argument: it seems possible to build agent AI systems that are not goal-directed. For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.

I'm not very convinced by this example, or alternatively I'm not getting the distinction you're drawing between "agent" and "goal-directed". Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)

Your post reminded me of Paul Christiano's approval-directed agents which was also about trying to find an alternative to goal-directed agents. Looking at it again, it actually sounds a lot like applying imitation learning to humans (except imitating a speeded-up human):

Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating.

Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?

On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)

I definitely endorse this point, think that it's an important aspect, and that it alone justifies the claim that I was making about non-goal-directed Agent AI being possible.

That said, I do have an intuition that agents whose goal-directedness comes from other agents shouldn't be considered goal-directed, at least if it happens in a particular way. Let's say that I'm pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at "whatever goal Rohin has", and this feels distinctly less goal-directed to me. (In particular, my AI agent would not have all of the convergent instrumental subgoals in this setting, so it is really different in kind from an AI agent that was simply pursuing X to the best of its ability.)

"Goal-directed" may not be the right word to capture the property I'm thinking about. It might be something like "thing that pursues the standard convergent instrumental subgoals", or "thing that pursues a goal that is not defined in terms of someone else's goal".

Your post reminded me of Paul Christiano's approval-directed agents which was also about trying to find an alternative to goal-directed agents.

Yeah, that idea was a big influence on the views that caused me to write this post.

Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?

It's not exactly the same, but it is very similar. You could think of approval-direction as imitation of a particular weird kind of human, who deliberates for a while before choosing any action.

They feel different enough to me that there probably are safety-relevant differences, but I don't know of any off the top of my head. Initially I was going to say that myopia was a safety-relevant difference, but thinking about it more I don't think that's an actual difference. Approval-directed agents are more explicitly myopic, but I think imitation learning could be myopic in the same way.

Btw, this post also views Paul's agenda through the lens of constructing imitations of humans.

For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.

Let's say that I'm pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at "whatever goal Rohin has"

What causes the agent to switch from X to Y?

Are you thinking of the "agent" as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?

I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you'd likely get an agent that continues to pursue goal X even after you've switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal . Is that what you want?)

Are you thinking of the "agent" as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?

I was imagining something more like B for the imitation learning case.

I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you'd likely get an agent that continues to pursue goal X even after you've switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)

That analysis seems right to me.

With respect to whether it is what I want, I wouldn't say that I want any of these things in particular, I'm more pointing at the existence of systems that aren't goal-directed, yet behave like an agent.

With respect to whether it is what I want, I wouldn't say that I want any of these things in particular, I'm more pointing at the existence of systems that aren't goal-directed, yet behave like an agent.

Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn't when the human doesn't)?

In which case, would it be fair to summarize (part of) your argument as:

1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.

2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.

3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.

?

I don't think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you'll be uncertain about what the human is going to do. If you're uncertain in this way, and you are getting your goals from a human, then you don't do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)

It may be that "goal-directed" is the wrong word for the property I'm talking about, but I'm predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.

To clarify, you do do the human's instrumental sub-goals though, just not extra ones for yourself, right?

If you've seen the human acquire resources, then you'll acquire resources in the same way.

If there's now some new resource that you've never seen before, you may acquire it if you're sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn't actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.

Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.

Right, so I think I wasn't really making a new observation, but just clearing up a confusion on my own part, where for a long time I didn't understand how the idea of approval-directed agency fits into IDA because people switched from talking about approval-directed agency to imitation learning (or were talking about them interchangeably) and I didn't catch the connection. So at this point I understand Paul's trajectory of views as follows:

goal-directed agent => approval-directed agent => use IDA to scale up approval-direct agent => approval-directed agency as a form of imitation learning / generalize to other forms of imitation learning => generalize IDA to safely scale up other (including more goal-directed / consequentialist) forms of ML (see An Unaligned Benchmark which I think represents his current views)

(Someone please chime in if this still seems wrong or confused.)

They feel different enough to me that there probably are safety-relevant differences

It looks like imitation learning isn't one thing but a fairly broad category in ML which even includes IRL. But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you're optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process. I wonder when you wrote the OP, which form of imitation learning did you have in mind?

ETA: From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?

But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you're optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process.

But if there are safety problems in approval, wouldn't there also be safety problems in the human's behavior, which imitation learning would copy?

Similarly, if there are safety problems in the estimation process, wouldn't there also be safety problems in the prediction of what action a human would take?

From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?

I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I'll add a pointer to this discussion to the post.

But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?

The human's behavior could be safer because a human mind doesn't optimize so much as to move outside of the range of inputs where approval is safe, or it has a "proposal generator" that only generates possible actions that with high probability stay within that range.

Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?

Same here, if you just predict what action a human would take, you're less likely to optimize so much that you likely end up outside of where the estimation process is safe.

I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure.

Ok, I'd be interested to hear more if you clarify your thoughts.

Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?

I found an old comment from Paul that answers this:

I think that the only reason to be interested in approval-directed agents rather than straightforward imitation learners is that it may be harder to effectively imitate behavior than to solve the same task in a very different way.

Your post reminded me of Paul Christiano's approval-directed agents which was also about trying to find an alternative to goal-directed agents. Looking at it again, it actually sounds a lot like applying imitation learning to humans (except imitating a speeded-up human):

It seems like approval direction allows for creative actions that the human operator approves of but would not have thought of doing themselves. Not sure if imitation learning does this.

That's a good question. It looks like imitation learning actually covers a number of ML techniques (see this) none of which exactly matches approval-directed agents. But the category seems broad enough that I think approval-directed agents can be considered to be a form of imitation learning. In particular, IRL is considered a form of imitation learning and IRL would also be able to perform actions that the human would not have thought of doing themselves.

^ Yes to all of this.

A little bit of nuance: IRL is considered to be a form of imitation learning because in many cases the inferred reward in IRL is only meant to reproduce the human's performance and isn't expected to generalize outside of the training distribution.

There are versions of IRL which are meant to go beyond imitation. For example, adversarial IRL was trying to infer a reward that would generalize to new environments, in which case it would be doing something more than imitation.

Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)

I'm not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:

  • Imitation learning applied to humans produces goal-directed behavior only insofar humans are goal-directed
  • Imitation learning applied to humans produces agents no more capable than humans. (I think IDA goes beyond this by adding amplification steps, which are separate. And IRL goes beyond this by trying to correct "errors" that the humans make.)

Regarding the second point, there's a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human's capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator's capabilities, it's only supposed to become even better at imitating the unscaled human.

I will list - just for my own understanding - the no-goal-oriented types of agents.

1. Universal library. This is an agent which create all significant solutions to all possible significant problems and then stops. An example of it is the past biological evolution which invented enormous amount of adaptations: flying solutions, proteins etc, - and could be used for inspiration for the technological progress. Past human history or some unconscious processes in the brain, like dreaming, may be another possible examples.

2. Human-mimicking neural net - this is an example of an agent which is mimicking another agent.

3. Obviously, AI Oracles and AI Tools.

4. "Homeostatic" superintelligence. An example of such system is OS like Windows, which doesn't do anything in a goal-directed sense, but just supports processes. Most national states also work in this way (except ideologically driven like USSR or Iran).

5. Drexeler's superintelligence as a sum of narrow services, e.g. Google's web services.

6. Swarm intelligences which compete to solve a task. If one create a prize for X, many people will compete to get it. The whole swarm is not a goal oriented agent, while its elements are such agents. Scott's Moloh is a bad example of such swarm behaviour.

Thanks for doing this--it's helpful for me as well. I have some questions/quibbles:

Isn't #2 as goal-directed as the human it mimics, in all the relevant ways? If I learn that a certain machine runs a neural net that mimics Hitler, shouldn't I worry that it will try to take over the world? Maybe I don't get what you mean by "mimics."

What exactly is the difference between an Oracle and a Tool? I thought an Oracle was a kind of Tool; I thought Tool was a catch-all category for everything that's not a Sovereign or a Genie.

I'm skeptical of this notion of "homeostatic" superintelligence. It seems to me that nations like the USA are fully goal-directed in the relevant senses; they exhibit the basic AI drives, they are capable of things like the treacherous turn, etc. As for Windows, how is it an agent at all? What does it do? Allocate memory resources across currently-being-run programs? How does it do that--is there an explicit function that it follows to do the allocation (e.g. give all programs equal resources), or does it do something like consequentialist reasoning?

On #6, it seems to me that it might actually be correct to say that the swarm is an agent--it's just that the swarm has different goals than each of its individual members. Maybe Moloch is an agent after all! On the other hand, something seems not quite right about this--what is Moloch's utility function? Whatever it is, Moloch seems particularly uninterested in self-preservation, which makes it hard to think of it as an agent with normal-ish goals. (Argument: Suppose someone were to initiate a project that would, with high probability, kill Moloch forever in 100 years time. Suppose the project has no other effects, such that almost all humans think it's a good idea. And everyone knows about it. All it would take to stop the project is a million people voting against it. Now, is there a sense in which Moloch would resist it or seek to undermine the project? It would maaaybe incentivize most people not to contribute to the project (tragedy of the commons!) but that's it. So either Moloch isn't an agent, or it's an agent that doesn't care about dying, or it's an agent that doesn't know it's going to die, or it's a very weak agent--can't even stop one project!)

Something could exhibit goal-like behaviour for the outside viewers without having internal structure of an agent. For example, a brick is falling to the ground - we could say that it is aimed on the specific point on the ground, but it is not an agent. The same way an infectious disease can take over the world without being an agent. Moreover, even some humans sometimes are not agent.

In my opinion, Oracle AI output only answers to questions, and Tool AI can do some other staff, like continuous data stream transformation or controlling mechanisms.

National states, human body and OSs - all of them are good and even clever in preserving homeostatic state (except the time of government shutdown) - but they typically achieve it not via high level agential reasoning.

Swarm of agents could exhibit behaviour different from the behaviour or goals of any separate agent.

To clarify the definition of "goal-directed" used here: is AlphaGo (Zero) goal-directed?

Yes, as long as you keep doing the MCTS + training. The value/policy networks by themselves are not goal-directed.

I get why the MCTS is important, but what about the training? It seems to me that if we stop training AlphaGo (Zero) and I play a game against it, it's goal-directed even though we have stopped training it.

Yeah, I agree that even without the training it would be goal-directed, that comes from the MCTS.

Note though that if we stop training and also stop using MCTS and you play a game against it, it will beat you and yet I would say that it is not goal-directed.

Humans want to build powerful AI systems in order to help them achieve their goals -- it seems quite clear that humans are at least partially goal-directed. As a result, it seems natural that they would build AI systems that are also goal-directed.

This is really an argument that the system comprising the human and AI agent should be directed towards some goal. The AI agent by itself need not be goal-directed as long as we get goal-directed behavior when combined with a human operator. However, in the situation where the AI agent is much more intelligent than the human, it is probably best to delegate most or all decisions to the agent, and so the agent could still look mostly goal-directed.

Even so, you could imagine that even the small part of the work that the human continues to do allows the agent to not be goal-directed, especially over long horizons.

An additional issue is that if you have a competitive situation, there may be an incentive to minimize the amount of human involvement in the system, in order to speed up response time and avoid losing ground to competitors. I discussed this a bit in Disjunctive Scenarios of Catastrophic AI Risk:

... the U.S. military is seeking to eventually transition to a state where the human operators of robot weapons are “on the loop” rather than “in the loop” (Wallach & Allen 2013). In other words, whereas a human was previously required to explicitly give the order before a robot was allowed to initiate possibly lethal activity, in the future humans are meant to merely supervise the robot’s actions and interfere if something goes wrong. While this would allow the system to react faster, it would also limit the window that the human operators have for overriding any mistakes that the system makes. For a number of military systems, such as automatic weapons defense systems designed to shoot down incoming missiles and rockets, the extent of human oversight is already limited to accepting or overriding a computer’s plan of actions in a matter of seconds, which may be too little to make a meaningful decision in practice (Human Rights Watch 2012).

Sparrow (2016) reviews three major reasons which incentivize major governments to move toward autonomous weapon systems and reduce human control:

  1. Currently existing remotely piloted military “drones,” such as the U.S. Predator and Reaper, require a high amount of communications bandwidth. This limits the amount of drones that can be fielded at once, and makes them dependent on communications satellites which not every nation has, and which can be jammed or targeted by enemies. A need to be in constant communication with remote operators also makes it impossible to create drone submarines, which need to maintain a communications blackout before and during combat. Making the drones autonomous and capable of acting without human supervision would avoid all of these problems.
  2. Particularly in air-to-air combat, victory may depend on making very quick decisions. Current air combat is already pushing against the limits of what the human nervous system can handle: further progress may be dependent on removing humans from the loop entirely.
  3. Much of the routine operation of drones is very monotonous and boring, which is a major contributor to accidents. The training expenses, salaries, and other benefits of the drone operators are also major expenses for the militaries employing them.

Sparrow’s arguments are specific to the military domain, but they demonstrate the argument that “any broad domain involving high stakes, adversarial decision making, and a need to act rapidly is likely to become increasingly dominated by autonomous systems” (Sotala & Yampolskiy 2015, p. 18). Similar arguments can be made in the business domain: eliminating human employees to reduce costs from mistakes and salaries is something that companies would also be incentivized to do, and making a profit in the field of high-frequency trading already depends on outperforming other traders by fractions of a second. While the currently existing AI systems are not powerful enough to cause global catastrophe, incentives such as these might drive an upgrading of their capabilities that eventually brought them to that point.

In the absence of sufficient regulation, there could be a “race to the bottom of human control” where state or business actors competed to reduce human control and increased the autonomy of their AI systems to obtain an edge over their competitors (see also Armstrong et al. 2016 for a simplified “race to the precipice” scenario). This would be analogous to the “race to the bottom” in current politics, where government actors compete to deregulate or to lower taxes in order to retain or attract businesses.

AI systems being given more power and autonomy might be limited by the fact that doing this poses large risks for the actor if the AI malfunctions. In business, this limits the extent to which major, established companies might adopt AI-based control, but incentivizes startups to try to invest in autonomous AI in order to outcompete the established players. In the field of algorithmic trading, AI systems are currently trusted with enormous sums of money despite the potential to make corresponding losses—in 2012, Knight Capital lost $440 million due to a glitch in their trading software (Popper 2012, Securities and Exchange Commission 2013). This suggests that even if a malfunctioning AI could potentially cause major risks, some companies will still be inclined to invest in placing their business under autonomous AI control if the potential profit is large enough.

U.S. law already allows for the possibility of AIs being conferred a legal personality, by putting them in charge of a limited liability company. A human may register a limited liability corporation (LLC), enter into an operating agreement specifying that the LLC will take actions as determined by the AI, and then withdraw from the LLC (Bayern 2015). The result is an autonomously acting legal personality with no human supervision or control. AI-controlled companies can also be created in various non-U.S. jurisdictions; restrictions such as ones forbidding corporations from having no owners can largely be circumvented by tricks such as having networks of corporations that own each other (LoPucki 2017). A possible start-up strategy would be for someone to develop a number of AI systems, give them some initial endowment of resources, and then set them off in control of their own corporations. This would risk only the initial resources, while promising whatever profits the corporation might earn if successful. To the extent that AI-controlled companies were successful in undermining more established companies, they would pressure those companies to transfer control to autonomous AI systems as well.

Here are a few more reasons for humans to build goal-directed agents:

  1. Goal directed AI is a way to defend against value drift/corruption/manipulation. People might be forced to build goal directed agents if they can't figure out another way to do that.

  2. Goal directed AI is a way to cooperate and thereby increase economic efficiency and/or military competitiveness. (A group of people can build a goal directed agent that they can verify represents an aggregation of their values.) People might be forced to build or transfer control to goal directed agents in order to participate in such cooperation to remain competitive, unless they can figure out another way to cooperate that is as efficient as this.

  3. Goal directed AI is a way to address other human safety problems. People might trust an AI with explicit and verifiable values more than an AI that is controlled by a distant stranger.

As I understand it, the first one is an argument for value lock in, and the third one is an argument for interpretability, does that seem right to you?

For the first one, I guess I would use "argument for defense against value drift" instead since you could conceivably use a goal-directed AI to defend against value drift without lock in, e.g., by doing something like Paul Christiano's 2012 version of indirect normativity (which I don't think is feasible but maybe there's something like it that is, like my hybrid approach, if you consider that goal-directed).

For the third one, I guess interpretability is part of it, but a bigger problem is that it seems hard to make a sufficiently trustworthy human overseer even if we could "interpret" them. In other words, interpretability for a human might just let us see exactly why we shouldn't trust them.

That said, current RL agents learn to replay behavior that in their past experience worked well, and typically do not generalize outside of the training distribution. This does not seem like a search over actions to find ones that are the best.

What is stopping AI researchers from using RL to (end-to-end) train agents that do search over actions to find ones that are the best? It seems like an obvious next step to take in order to build agents that generalize better than current RL agents, doesn't it? Is it just that the challenges they've attempted so far haven't required going beyond building agents that are essentially just lossy compressions of behaviors that work well on the training distribution, or is there a fundamental reason why using RL to train goal-directed agents would be hard?

What is stopping AI researchers from using RL to (end-to-end) train agents that do search over actions to find ones that are the best?

That technique is called model-based RL, and in practice, given sufficient data and compute, it ends up performing worse than model-free RL. (It does perform better in low-data regimes, and my guess is that it will also generalize slightly better but not much.) In model-based RL, you learn a model of the world, and then search over sequences of actions and take the one that seems best.

Speculation on why it doesn't work: In practice, your model of the world only makes good predictions for states and actions that you have already experienced. So searching over actions for the best one either gives you something you have already experienced, or some nonsense action (sort of like an adversarial example for the world model).

It is worth noting that this isn't end-to-end: the model is trained "end-to-end", but the action selection is typically some hardcoded function like "sample 1000 trajectories from the model, choose the trajectory that gives the best reward, and take the first action of that trajectory". I don't know how you would train an agent end-to-end such that it explicitly learns to search over actions (as opposed to an implicit search that model-free RL algorithms might already be doing).

When you are given an accurate model of the world, then you can in fact search over actions and do much better, see for example value iteration or policy iteration. (Those are for very small environments, but you could create approximate versions for more complex environments.)

Speculation on why it doesn’t work: In practice, your model of the world only makes good predictions for states and actions that you have already experienced. So searching over actions for the best one either gives you something you have already experienced, or some nonsense action (sort of like an adversarial example for the world model).

Interesting, I wonder how humans avoid generating nonsense actions like this.

I don’t know how you would train an agent end-to-end such that it explicitly learns to search over actions

I was thinking you could train the world model separately at first, manually implement an initial action selection method as a neural network or some other kind of differentiable program, and then let RL act on the agent to optimize it as a whole.

(as opposed to an implicit search that model-free RL algorithms might already be doing)

What kind of implicit search are model-free RL algorithms already doing? If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?

Interesting, I wonder how humans avoid generating nonsense actions like this.

Some hypotheses that are very speculative:

  • Something something explicit reasoning?
  • Our environment is sufficiently harsh and complex that everything is in-distribution
  • Our brains are so small and our environment is so harsh and complex that the only way that they can get good performance is to have structured, modular representations, which lead to worse performance in distribution but better generalization
  • Some system that lets us know what we know, and only generates actions for consideration where we know what the consequences will be
What kind of implicit search are model-free RL algorithms already doing?

I don't know. This is mostly an expression of uncertainty about what model-free RL agents are doing. Maybe some of the multiplications and additions going on in there turn out to be equivalent to a search over actions. Maybe not.

My intuition says "nah, our current environments are all simple enough that you can solve them by using heuristics to compute actions, and the training process is going to distill those heuristics into the policy rather than turning the policy into a search algorithm". But even if I trust that intuition, there is some level of environment complexity at which this would stop being true, and I don't trust my intuition on what that level is.

If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?

Plausibly, but plausibly not. I have conflicting not-well-formed intuitions that pull in both directions.

Or we could have systems that remain uncertain about the goal and clarify what they should do when there are multiple very different options (though this has its own problems).

I'd be interested to hear more about the problems with this, if anyone has a link to an overview or just knows of problems off the top of their head.

In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term.

Is this true? Since ML generally doesn't choose an algorithm directly but runs a search over a parameter space, it seems speculative to assume that the resulting model, if it is a mesa-optimizer and goal-directed, only cares about its episode. If it learned that optimizing for X is good for reward, it seems at least conceivable that it won't understand that it shouldn't care about instances of X that appear in future episodes.

A few points:

1. It's not clear that the current deep RL paradigm would lead to a mesa optimizer. I agree it could happen, but I would like to see an argument as to why it is likely to happen. (I think there is probably a stronger case that any general intelligence we build will need to be a mesa optimizer and therefore goal-directed, and if so that argument should be added to this list.)

2. Even if we did get a mesa optimizer, the base optimizer (e.g. gradient descent) would plausibly select for mesa optimizers that care only up till the end of the episode. A mesa optimizer that wasn't myopic in this way might spend the entire episode learning and making money that it can use in the future, and as a result get no training reward, and so would be selected against by the outer optimizer.

Then you could try to create alternative designs for AI systems such that they can do the things that goal-directed agents can do without themselves being goal-directed. You could also try to persuade AI researchers of these facts, so that they don’t build goal-directed systems.

I'm not sure this strategy is net positive. If dangerous AI (dangerous at least as Slaughterbots) is developed before alignment is solved, the world is probably better off if the first visibly-dangerous-AI is goal-directed rather than, say, an Oracle. The former would probably be a much weaker optimization process and probably won't result in an existential catastrophe; and perhaps will make some governance solutions more feasible.

Can you clarify the argument? Are you optimizing for an obvious AI disaster to happen as soon as possible so people take the issue more seriously?

I'm not optimizing for raising awareness via an "obvious AI disaster" due to multiple reasons, including the huge risk to the reputation of the AI safety community and the unilateralist's curse.

I do think that when considering whether to invest in an effort which might prevent recoverable near-term AI accidents, one should consider the possibility that the effort would prevent pivotal events (e.g. one that would have enabled useful governance solutions resulting in more time for alignment research).

Efforts that prevent recoverable near-term AI accidents might be astronomically net-positive if they help make AI alignment more mainstream in the general ML community.

(anyone who thinks I shouldn't discuss this publicly is welcome to let me know via a PM or anonymously here)

In this scenario, wouldn't you eventually build a sufficiently powerful goal-directed AI that leads to an existential catastrophe?

Perhaps the hope is that when everyone sees that the first goal-directed AI is visibly dangerous then they actually believe that goal-directed AI is dangerous. But in the scenario where we are building alternatives to goal-directed AI and they are actually getting used, I would predict that we have convinced most AI researchers that goal-directed AI is dangerous.

(Also, I think you can level this argument at nearly all AI safety research agendas, with possibly the exception of Agent Foundations.)

I think I didn't articulate my argument clearly, I tried to clarify it in my reply to Jessica.

I think my argument might be especially relevant to the effort of persuading AI researchers not to build goal-directed systems.

If a result of this effort is convincing more AI researchers in the general premise that x-risk from AI is something worth worrying about, then that's a very strong argument in favor of carrying out the effort (and I agree this result should correlate with convincing AI researchers not to build goal-directed systems - if that's what you argued in your comment).

Yeah, I was imagining that we would convince AI researchers that goal-directed systems are dangerous, and that we should build the non-goal-directed versions instead.

Building a non goal directed agent is like building a cart out of non-wood materials. Goal directed behavior is relatively well understood. We know that most goal directed designs don't do what we want. Most arrangements of wood do not form a functioning cart.

I suspect that a randomly selected agent from the space of all non goal directed agents is also useless or dangerous, in much the same way that a random arrangement of non wood materials is.

Now there are a couple of regions of design space that are not goal directed and look like they contain useful AI's. We might be better off making our cart from Iron, but Iron has its own problems.

I suspect that a randomly selected agent from the space of all non goal directed agents is also useless or dangerous, in much the same way that a random arrangement of non wood materials is.

Sure. We aren't going to choose an agent randomly.

Now there are a couple of regions of design space that are not goal directed and look like they contain useful AI's. We might be better off making our cart from Iron, but Iron has its own problems.

Agreed, but maybe those problems are easier to solve.