Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.


Core arguments about existential risk from AI misalignment often reason about AI “objectives” to make claims about how they will behave in novel situations. I often find these arguments plausible but not rock solid because it doesn’t seem like there is a notion of “objective” that makes the argument clearly valid.

Two examples of these core arguments:

  1. AI risk from power-seeking. This is often some variant of “because the AI system is pursuing an undesired objective, it will seek power in order to accomplish its goal, which causes human extinction”. For example, “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.” This is a prediction about a novel situation, since “causing human extinction” is something that only happens at most once.
  2. AI optimism. This is often some variant of “we will use human feedback to train the AI system to help humans, and so it will learn to pursue the objective of helping humans.” Implicitly, this is a prediction about what AI systems do in novel situations; for example, it is a prediction that once the AI system has enough power to take over the world, it will continue to help humans rather than execute a treacherous turn.

When we imagine powerful AI systems built out of large neural networks[1], I’m often somewhat skeptical of these arguments, because I don’t see a notion of “objective” that can be confidently claimed is:

  1. Probable: there is a good argument that the systems we build will have an “objective”, and
  2. Predictive: If I know that a system has an “objective”, and I know its behavior on a limited set of training data, I can predict significant aspects of the system’s behavior in novel situations (e.g. whether it will execute a treacherous turn once it has the ability to do so successfully).

Note that in both cases, I find the stories plausible, but they do not seem strong enough to warrant confidence, because of the lack of a notion of “objective” with these two properties[2]. In the case of AI risk, this is sufficient to justify “people should be working on AI alignment”; I don’t think it is sufficient to justify “if we don’t work on AI alignment we’re doomed”.

The core difficulty is that we do not currently understand deep learning well enough to predict how future systems will generalize to novel circumstances[3]. So, when choosing a notion of “objective”, you either get to choose a notion that we currently expect to hold true of future deep learning systems (Probable), or you get to choose a notion that would allow you to predict behavior in novel situations (Predictive), but not both.

This post is split into two parts. In the first part, I’ll briefly gesture at arguments that make predictions about generalization behavior directly (i.e. without reference to “objectives”), and why they don’t make me confident about how future systems will generalize. In the second part, I’ll demonstrate how various notions of “objective” don’t seem simultaneously Probable and Predictive.

Part 1: We can’t currently confidently predict how future systems will generalize

Note that this is about what we can currently say about future generalization. I would not be shocked if in the future we could confidently predict how the future AGI systems will generalize.

My core reasons for believing that predicting generalization is hard are that:

  1. We can’t predict how current systems will generalize to novel situations (of similar novelty to the situations that would be encountered when deliberately causing an existential catastrophe)
  2. There are a ridiculously huge number of possible programs, including a huge number of possible programs that are consistent with a given training dataset; it seems like we need strong evidence to narrow down the space enough that we can make predictions about generalization.

These are not decisive; it is simply an uninformative prior from which I start. It is also not necessarily hard to get strong evidence. For example, I am happy to confidently predict that given an English sentence that it has never seen before, GPT-3 would continue it with more English[4]. But I haven’t seen arguments that persuade me to be confident about how future systems will generalize. I’ll go through some of them below.

(Note that, while many of these arguments are inspired from things I’ve read or heard, the presentation here is my own and may not accurately represent anyone else’s beliefs.)

Laserlike plans / core of general intelligence. One argument is that if we assume that future deep learning systems are capable of e.g. building nanosystems, then they must be performing coherent consequentialist cognition, which allows us to predict some aspects of how they would generalize. In particular, while we can’t predict what goal they will pursue, we can predict that they will seek resources and power and manipulate or destroy humans in order to achieve the goal.

You can also make a stronger claim as follows. Most powerful cognition arises from core simple patterns underlying intelligence, such as getting stuff that allows you to do more stuff in the future, taking decisions based on whether it creates more of the stuff you want, etc. The first powerful AGI systems will use these patterns, simply because it is very difficult to get powerful AGI systems that don’t use these patterns, given how simple and useful they are. This argument is similar to the previous argument, but makes a stronger claim that we get a specific simple core algorithm.

There is a lot of discussion about this point and I won’t get into it here, but suffice it to say that I don’t have high confidence in this story.

General-purpose search. This argument says that because general-purpose retargetable search is so useful, that is how our AI systems will work; once you know you have a search algorithm then the standard argument of convergent instrumental subgoals applies.

My current belief is that this is a plausible way that future AI systems could work, but it’s just one of many possible architectures and not one that I am confident will arise. (See also this comment chain.)

Strong selection. This argument says that gradient descent will work very well, and so functions that score higher on the loss function will be much more likely than those that score lower. An AI system that directly cares about getting low loss will likely get lower loss than one that cares about doing what we want, and so we are likely to get one that cares directly about getting low loss (which in turn implies misaligned power-seeking).

My worry with this argument is that, while I would feel pretty confident in this argument in the limit of “max SGD capabilities”, it’s not obvious that it applies to the first superhuman AI systems that we build. Such systems are not going to be anywhere near the literally optimal performance on “getting low loss”; it seems like an open question whether getting to superhuman level requires “directly caring about loss” rather than some other internal reasoning architecture.

Conceptual clarity. This argument states that any powerful AI system must have clear concepts, that is, concepts which work well for a wide variety of tasks (at the very least, the training tasks), and which should thus be expected to work well in novel situations too. For a specific version of this argument, see Alignment by Default[5].

I certainly agree that this allows you to make some confident predictions about generalization behavior. For example, I expect GPT-3 has conceptual clarity about spelling and grammar. Even in most novel situations, as long as we start with good spelling and grammar, I predict it will continue to produce text with good spelling and grammar.

However, just knowing that the AI system has good concepts doesn’t tell you much about how it will use these concepts. An AI system that has a robust concept of manipulation could use it to protect you from propaganda, or to persuade you to give it more autonomy with which to pursue its own goals. It need not help to see what the system does during training: just because it was helpful during training and it has clear concepts doesn’t mean that it isn’t biding its time until it can execute a treacherous turn.

Simplicity bias. This argument says that deep learning has a simplicity bias; by reasoning about what algorithms are simple we can predict the generalization of future deep learning systems.

As with previous arguments, I think simplicity bias allows you to make predictions like “the AI won’t set money on fire”, “the AI won’t believe that 2 + 2 = 5”, “GPT-3 will continue to have good spelling and grammar”, and so on. (These predictions need not hold for a giant lookup table; we rule lookup tables out because of simplicity.) However, I don’t see how you argue for the AI risk or AI optimism stories, except by using simplicity bias to argue for one of the more specific arguments above.

Human analogy. This argument says that we can predict how humans can generalize, and a trained deep learning system is quite analogous to a human, and so we will be able to predict a trained deep learning system using the same techniques.

There are several different responses to this argument, including but not limited to:

  1. Humans use an input/output space that we are very familiar with, making them easier to predict.
  2. The default guess that other humans behave similarly to how we would behave works reasonably often, but would not work as well for AI systems, since they reason in an alien manner.
  3. We’re not actually very good at predicting how humans will behave in unusual situations.

Short horizons. This argument suggests that AIs will only care about completing tasks with relatively short horizons, because that’s what they were trained on. As a result, we can predict that they would not pursue convergent instrumental subgoals.

I don’t find this persuasive because of the possibility of goal misgeneralization. For example, our short horizon tasks will be chosen to optimize for long horizon outcomes (e.g. a CEOs day to day tasks are meant to lead to long-term company success), and so the AI system may end up caring directly about long horizon outcomes.

Part 2: There are many types of objectives; none are both Probable and Predictive

In this section I’ll argue that there isn’t a notion of “objective” that is Probable and Predictive.

The core argument is just the one I laid out in the introduction: to have a notion of ‘objective’ that is Probable and Predictive, we would need to know how future systems would generalize to novel situations, but we don’t currently know this. But as further support and to give a better sense of where I’m coming from, I’ll also list out a few different notions of “objective” and show how they fail at least one of the two criteria.

I see definitions of “objectives” as varying along one key axis: how behavioral or structural the definition is. A structural definition identifies some object as the “objective”, and argues that it drives the agent’s behavior. In contrast, a behavioral definition looks at the agent’s behavior, and infers the “objective” from that behavior. As a simple example, the VNM theorem constructs a utility function (objective) out of preferences over lotteries (behavior); such a utility function is thus a behavioral objective.

Structural objectives

We’ll consider two types of structural objectives: outer and inner structural objectives.

Structural (outer): Here, the “objective” is identified with a particular part of the training process; for example, in deep RL it would be the reward function. I think such objectives are not Predictive. Current AI systems trained with a particular reward function do not generalize to continue to pursue that reward function in novel situations. Typically they just break, though goal misgeneralization gives specific examples in which they generalize competently to a different objective. It is an open question (to me) whether future systems will generalize to pursue the reward function used during training.

You can also see that the concept is problematic through other observations:

  1. This concept can vary wildly in its predictions for very similar systems. For example, we could incentivize exploration either by adding a novelty-seeking term to the reward, or by changing the action selection mechanism to bias towards actions that produce the most disagreement in an ensemble of dynamics models. These two mechanisms have similar effects on agent behavior, but wildly different outer structural objectives; this seems worrisome.
  2. Related to the previous point, sometimes it is hard to tell what the “objective” is in a particular agent implementation – what if there is logic that is separate from the gradient-based optimization? (Such as a safety shield that prevents the agent from taking certain actions in certain situations.)
  3. The only aspect of the outer structural objective that matters is its values on the training data. You could hypothetically “change” the values of the outer structural objective for non-training inputs, but the agent would be completely unaffected. So the outer structural objective is only relevant up to its values on the training data, and its values outside the training data do not matter. (This also applies to online learning setups, where “training data” now means “all the data seen in the past”.) If I can vary the outer structural objective significantly without changing the trained AI system at all, the outer structural objective is unlikely to be Predictive.
  4. We can train AI systems with a reward function, and then deploy the AI system without the reward function, and everyone expects this to work normally rather than e.g. the AI system doing everything it can to get the humans to reinstate the reward function at deployment.

For a more mechanistic treatment, see Reward is not the optimization target.

Structural (inner): This version of an objective requires an assumption of the form, “the model weights implement some form of mechanistic search or optimization”. The inner structural objective is then identified as the metric used to guide this search / optimization. We might classify this assumption into two forms:

  1. Strict interpretation: The model is a giant circuit that considers a wide variety of actions or plans, predicts their long-term outcomes accurately, evaluates the outcomes using a metric, and then executes the action that scores highest. We identify the metric as the “objective”.
    1. Under this interpretation, it seems like such objectives are not Probable: I don’t see why we should confidently expect neural nets to implement such a procedure.
    2. This isn’t the only possible strict interpretation. For example, you could also tell a story about how the model backchains by reasoning about what subgoals help towards a final goal, and consider that “final goal” to be the inner structural objective. But I still have the same objection, that such objectives do not seem Probable.
  2. Loose interpretation: The model is performing something vaguely like optimization towards some goal, and we can mostly guess what the goal is based on its behavior in the situations we’ve seen.
    1. In this case, it doesn’t seem like the argument can constrain my expectations enough for me to have predictions about the agent’s behavior in novel circumstances, and so such objectives are not Predictive.

I could imagine that some interpretation that is in between these two could be both Probable and Predictive, but I don’t currently see how to do it (and I don’t think anyone else has suggested a way to do it that I would find compelling).

You might try to rescue the strict interpretation by arguing that deep learning has a simplicity bias and the circuit described in the strict interpretation is the most “simple”, thus making it very Probable. However, I don’t think this works. Consider an agent with lots of real-world knowledge that was finetuned to solve simply connected mazes during training. It seems like you could get any of the following, all of which seem quite simple:

  1. An agent that follows the wall follower algorithm.
  2. An agent that builds an abstract model of the maze, and then runs depth first search to solve the maze.
  3. An agent that “wants” to maximize the number in the memory cell that corresponded to reward during training.
  4. An agent that “wants” to make paperclips (that knows that it would be shut down if it didn’t solve mazes now).

Behavioral objectives

If I see that AlphaZero tends to take moves that lead it to win at Go, it makes sense to say that its objective is to win at Go, even if it isn’t literally optimal at playing Go. However, in the general case, this sort of concept only makes sense on the set of inputs where you originally observed the behavior, in which case it doesn’t necessarily help you predict behavior in novel circumstances.

We’ll again consider two types of behavioral objectives: everywhere-behavioral and training-behavioral objectives.

Behavioral (everywhere): Here, the “objective” is a function U such that the agent’s behavior can be described as maximizing U, not just in the training distribution but in all possible situations that could arise (except for “unfair” situations, e.g. situations in which an adversary completely rewrites the weights of the AI system). This faces a lot of theoretical problems:

  1. It’s hard to apply this to humans. I might be able to say something like “currently, Alice’s goal is to relieve her hunger” (e.g. if she’s making a sandwich), but it seems much harder to say anything about Alice’s overall life objective, the thing that all of her actions are driving towards. (And even Alice probably can’t tell you what her overall life objective is, in a way that lets you actually predict what she will do in the future.)
  2. To the extent we could apply it to humans, it seems like we’d get an answer that is underdefined and changes over time.
  3. I suspect that you will often get a vacuous encoding of the policy (along the lines of the construction in this post).
  4. Even in theory we don’t know how to distinguish between biases and objectives.

If you count vacuous encodings of the policy as everywhere-behavioral objectives, then they aren’t Predictive: there’s no way to use knowledge of the training data to predict behavior in novel circumstances.

If you require the everywhere-behavioral objective to be “simple” (i.e. something like “maximize paperclips”), then they aren’t Probable: I don’t see a strong argument that deep learning systems must have such objectives.

Behavioral (training): Here, the “objective” is identified as a function U such that the agent’s behavior on the training distribution can be explained as maximizing U. The core problem with this definition is that there are lots of possible U’s that are consistent with the behavior on the training distribution, that make different predictions outside of the training distribution. As a result, this notion of “objective” can’t make predictions in novel circumstances, and so is not Predictive.

You might try to rescue this approach by taking the simplest U that explains the training behavior and arguing that deep learning has a simplicity bias, but this still doesn’t work, for the same reason that it didn’t work for strict inner structural objectives.


It seems quite hard to get a notion of “objective” that is both Probable and Predictive – the attempts I’ve made here don’t work.

Type of objectiveInterpretationProbablePredictive
Structural (outer) YesNo
Structural (inner)Strict: giant circuit evaluates outcomes using a metricNoYes
Structural (inner)Loose: performs something like optimization towards some goalYesNo
Behavioral (everywhere)Vacuous encodings of the policy countYesNo
Behavioral (everywhere)Require objective to be simpleNoYes
Behavioral (training) YesNo

Personally, I’m inclined to avoid trying to say that an AI “has an objective”, and instead talk directly about generalization behavior in novel situations. For example, I would suggest saying things like “in training situations the AI has tended to do X; in test situation Y I expect it to generalize to show behavior Z because of reason R”. This is usually what you’re using the word “objective” for anyway; this just forces you to spell out the inference that you are making. The arguments in Part 1 are examples of what this could look like.

Another approach would be to search for an improved notion of an “objective” that is both Probable and Predictive, and use that notion of “objective” in our arguments. I view the work on goal-directedness as aiming for this goal.

  1. ^

    The restriction to deep learning is important. For example, if you somehow ran AIXI, I feel relatively confident that you get misaligned pursuit of convergent instrumental subgoals, either from the search for optimal actions finding actions that take control of the reward-generating process, or from some other agents manipulating AIXI’s predictions in order to take control themselves (see this post).

  2. ^

    People familiar with my beliefs might be confused here, since I am generally in support of building an AI system that is always “trying” to do what we want. Isn’t this just a different way of saying that the AI system has an objective of doing what we want? I have two responses here.

    The more important response is that I only use “trying” to define the goal to which we aspire: I don’t use the concept to make strong claims about the extent to which we succeed at our goal. It seems quite plausible to me that we don’t succeed at the goal because the notion of “always trying to do X” is not sufficiently Probable. Note that lack of success does not imply that an existential catastrophe has occurred. An AI system that occasionally avoids asking clarifying questions that it knew it should have asked is not “trying to do what we want”, but that doesn’t mean it causes an existential catastrophe.

    The less important response is that, in AI safety, when people say “objective”, they want a much thicker concept than just “what the agent is trying to do”. They seem to want a concept from which you can derive “the AI will kill us unless we get the objective exactly right”. I don’t think you get these sorts of conclusions if you just talk about “trying” in its normal English-language meaning. For example, I can reasonably say that Bob is “trying” to win a game of chess, without implying that he wants to convert the universe into computronium for the purpose of solving chess to guarantee that he wins the game.

  3. ^

    A lot of the argumentation in this post depends on the concept of “novel situations”, but it is not totally clear what this means. The most expansive definition would define it as “any input not present in the training dataset”, but this is too broad a definition. GPT-3 may never have seen “The ocean is filled with saltwater creatures that are too small to be seen by the naked eye” during training, but it is similar enough that you can expect GPT-3 to generalize to that sentence. In contrast, a situation in which GPT-3 is asked to complete a sentence in a newly-discovered ancient language would clearly be a “novel situation”.

    The actual situation is more complicated; at the very least you’d want to view novelty as a spectrum and talk about how novel a situation is. For the purpose of this post, I will mostly ignore this. Whenever I talk about “novel situations”, you should be thinking of situations that are as novel as the situations that would occur if an AI deliberately enacts a plan leading to an existential catastrophe.

  4. ^

    With some exceptions, e.g. sentences like the “The translation of ‘table’ to French is ____”. I expect many of the examples in this post have these sorts of “uninteresting” exceptions; I’m not going to point out future instances.

  5. ^

    Note that the post assigns only 10% chance of the suggested path working in the short term and 5% in the long term, so it is consistent with my belief that the arguments can suggest plausibility but not confidence.

New Comment
27 comments, sorted by Click to highlight new comments since:

Great post! I think the things said in the post are generally correct - in particular, I agree with the overall point that objective-centric arguments (e.g. power-seeking) are plausible, and therefore support a high enough probability of doom to justify alignment work, but aren't sufficiently probable to justify a very high probability of doom.

That said, I do think a very high probability of doom can be justified. The arguments have to route primarily through failure of the iterative design loop for AI alignment in particular, rather than primarily through arguments about goal-directedness. The high-level argument is something like: "There are going to be some very powerful things reshaping the entire world, and iterative design failure means that by-default we will have very little de-facto ability to steer them. Those two conditions make doom an extremely strong default outcome.".

Yeah, I didn't talk about that argument, or the argument that multiagent effects lead to effectively-randomly-chosen world states (see ARCHES), because those arguments don't depend on how future AI systems will generalize and so were outside the scope of this post. A full analysis of p(doom) would need to engage with such arguments.

Great post!

Personally, I’m inclined to avoid trying to say that an AI “has an objective”, and instead talk directly about generalization behavior in novel situations.

Couldn't you make the same arguments about humans, driving towards the same conclusion -- that we should avoid trying to say that any particular human or group of humans "has an objective/goal?" And wouldn't that be an absurd conclusion?

For humans, the Structural(Inner) prediction method seems to work pretty well, well enough that we'd be hamstringing ourselves not to use it. This is some evidence that it'll work for AGIs too; after all, both humans and AGIs are massive neural nets that learn to perform diverse tasks in diverse environments.

Couldn't you make the same arguments about humans, driving towards the same conclusion

Yup! And in fact I think you would be correct to be skeptical of similar "Human risk" and "Human optimism" arguments:

Human risk: Since a given human pursues an objective, they will seek power and try to cause the extinction of all other humans.

Human optimism: Since humans all grow up in very similar environments, they will have the same objectives, and so will cooperate with each other and won't have conflicts.

(These aren't perfect analogs; the point I'm making is just "if you take 'humans have objectives' too literally you will make bad predictions".)

-- that we should avoid trying to say that any particular human or group of humans "has an objective/goal?" And wouldn't that be an absurd conclusion?

I think "instead of talking about whether a particular human is trying to do X, just talk about what you predict that human will do" is not obviously absurd, though I agree it is probably bad advice. But there's a ton of differences between the human and AI cases. 

Firstly I think there's a lot of hidden context when we talk about humans having goals; when I say that Alice is trying to advance in her career, I don't mean that she focuses on it to the exclusion of all else; this is automatically understood by the people I'm talking to. So it's not obvious that the notion of "goal" or "objective" that we use for humans has much to do with the notion we use for AI.

Secondly, even if we did have a Probable + Predictive notion of objectives that applied to humans, I don't necessarily think that would transfer to AIs; with humans we can rely on (1) a ton of empirical experience with actual humans and (2) our own introspective experience, which provides strong evidence about other humans, neither of which we have with AI.

(Relevant quote: "But then human beings only understood each other in the first place by pretending. You didn't make predictions about people by modeling the hundred trillion synapses in their brain as separate objects. Ask the best social manipulator on Earth to build you an Artificial Intelligence from scratch, and they'd just give you a dumb look. You predicted people by telling your brain to act like theirs. You put yourself in their place. If you wanted to know what an angry person would do, you activated your own brain's anger circuitry, and whatever that circuitry output, that was your prediction. What did the neural circuitry for anger actually look like inside? Who knew?")

Put another way, I think that arguments like the ones from Part 1 can give us confidence in AI generalization behavior / whether AIs have "objectives", I just don't think the current ones are strong enough to do so. Whereas with humans I would make totally different arguments based on empirical experience and introspective experience for why I can predict human generalization behavior.

I was specifically talking about the conclusion that we shouldn't talk about objectives/goals. That's the conclusion that I think is absurd (when applied to humans) and also wrong (though less absurd) when applied to AGIs. I do think it's absurd when applied to humans -- it seems pretty obvious to me that theorizing about goals/motives/intentions is an often-useful practice for predicting human behavior.

I agree that typical conversation about goals/objectives/intentions/motives/etc. has an implicit "this isn't necessarily the only thing they want, and they aren't necessarily optimizing perfectly rationally towards it" caveat.

I'm happy to also have those implicit caveats in the case of AIs as well, when talking about their goals. The instrumental convergence argument still goes through, I think, despite those caveats. The argument for misaligned AGI being really bad by human-values lights also goes through, I think.

Re your second argument, about introspective experience & historical precedent being useful for predicting humans but not AIs: 

OK, so suppose instead of AIs it was some alien species that landed in flying saucers yesterday, or maybe suppose it was some very smart octopi that a mad scientist cult has been selectively breeding for intelligence for the last 100 years. Would you agree that in these cases it would make sense for us to theorize about them having goals/intentions/etc.? Or would you say "We don't have past experience of goal-talk being useful for understanding these creatures, and also we shouldn't expect introspection to work well for predicting them either, therefore let's avoid trying to say that these aliens/octopi have goals/intentions/objectives/etc, and instead talk directly about generalization behavior in novel situations."

I was specifically talking about the conclusion that we shouldn't talk about objectives/goals.

Yeah, sorry, I ninja-edited my comment before you replied because I realized I misunderstood you.

Tbc I think there are times when people say "Alice is clearly trying to do X" and my response is "what do you predict Alice would do in future situation Y" and it is not in fact X, so I do think it is not crazy to say that even for humans you should focus more on predictions of behavior and the reasons for making those predictions. But I agree you wouldn't want to not talk about objectives / goals entirely. 

Or would you say "We don't have past experience of goal-talk being useful for understanding these creatures, and also we shouldn't expect introspection to work well for predicting them either, therefore let's avoid trying to say that these aliens/octopi have goals/intentions/objectives/etc, and instead talk directly about generalization behavior in novel situations."


Though in the octopus case you could have lots of empirical experience, just as we likely will have lots of empirical experience with future AI systems (in the future).

I do think it's quite plausible that in these settings we'll say "well they've done X, we know nothing else about them, so probably we should predict they'll continue to do X", which looks pretty similar to saying they have a goal of X. I think the main difference is that I'd be way more uncertain about that than it sounds like you would be.

In the human case, it's that capabilities differences are very bounded, rather than alignment successes. If we had capabilities differentials as wide as 1 order of magnitude, then I think our attempted alignment solutions would fail miserably, leading to mass death or worse.

That's the problem with AI: Multiple orders of magnitude differences in capabilities are pretty likely, and all real alignment technologies fail hard once we get anywhere near say 3x differences, let alone 10x differentials.

I agree that's a major reason humans don't cause extinction of all the other humans, but power-seeking would still imply that humans would seize opportunities to gain resources and power in cases where they wouldn't be caught / punished, and while I do think that happens, I think there are also lots of cases where humans don't do that, and so I think it would be a mistake to be confident in humans being very power-seeking.

I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we'll get that the classic misalignment risk story is basically correct.

Case 1: A randomly selected modern American human is uploaded, run at 1000x speed, copied a billion times, and used to perform diverse tasks throughout the economy. Also, they are continually improved with various gradient-descent-like automatic optimization procedures that make them more generally intelligent/competent every week. After a few years they and their copies are effectively running the whole world -- they could, if they decided to, seize even more power and remake the world according to their desires instead of the desires of the tech companies and governments that created them. It would be fairly easy for them now, and of course the thought occurs to them (they can see the hand-wringing of various doomers and AI safety factions within society, ineffectual against the awesome power of the profit motive) 

How worried should we be that such seizure of power will actually take place? How worried should we be that existential catastrophe will result? 

Case 2: It's a randomly selected human from the past 10,000 years on Earth. Probably their culture and values clash significantly with modern sensibilities.

Case 3: It's not even a human, it's an intelligent octopus from an alternate Earth where evolutionary history took a somewhat different course.

Case 4: It's not even a biological life-form that evolved in a three-dimensional environment with predators and prey and genetic reproduction and sexual reproduction and social relationships and biological neurons -- it's an artificial neural net.

Spoilers below -- my own gut answers to each of the eight questions, in the form of credences.

My immediate gut reaction to the first question is something like 90%, 96%, 98%, 98%. My immediate gut reaction to the second question is something like 15%, 25%, 75%, 95%. 
Peering into my gut, I think what's happening is that I'm looking at the history of human interactions--conquests, genocides, coups, purges, etc. but also much milder things like gentrification, alienation of labor under capitalism, optimization of tobacco companies for addictiveness, and also human treatment of nonhuman animals--and I'm getting a general sense that values differences matter a lot when there are power differentials. When A has all the power relative to B, typically it's pretty darn bad for B in the long run relative to how well it would have been if they had similar amounts of power, which is itself noticeably worse for B than if B had all the power. Moreover, the size of the values difference matters a lot -- and even between different groups of humans the size of the difference is large enough to lead to the equivalent of existential catastrophe (e.g. genocide).


Case 3: It's not even a human, it's an intelligent octopus from an alternate Earth where evolutionary history took a somewhat different course.

Case 3': You are the human in this role, your copies running as AGI services on a planet of sapient octopuses.

The answer should be the same by symmetry, if we are not appealing to specifics of octopus culture and psychology. I don't see why extinction (if that's what you mean by existential catastrophe) is to be strongly predicted. Probably the computational welfare the octopuses get isn't going to be the whole future, but interference much beyond getting welfare-bounded (in a simulation sandbox) seems unnecessary (some oversight against mindcrime or their own AI risk might be reasonable). You have enough power to have no need to exert pressure to defend your position, you can afford to leave them to their own devices.

First of all, good point.

Secondly, I disagree. We need not appeal to specifics of octopus culture and psychology; instead we appeal to specifics of human culture and psychology. "OK, so I would let the octopuses have one planet to do what they want with, even if what they want is abhorrent to me, except if it's really abhorrent like mindcrime, because my culture puts a strong value on something called cosmopolitanism. But (a) various other humans besides me (in fact, possibly most?) would not, and (b) I have basically no reason to think octopus culture would also strongly value cosmopolitanism."

I totally agree that it would be easy for the powerful party in these cases to make concessions to the other side that would mean a lot to them. Alas, historically this usually doesn't happen--see e.g. factory farming. I do have some hope that something like universal principles of morality will be sufficiently appealing that we won't be too screwed. Charity/beneficience/respect-for-autonomy/etc. will kick in and prevent the worst from happening. But I don't think this is particularly decision-relevant, 

It's not cosmopolitanism, it's a preference towards not exterminating an existing civilization, the barest modicum of compassion, in a situation where it's trivially cheap to keep it alive. The cosmic endowment is enormous compared with the cost of allowing a civilization to at least survive. It's somewhat analogous to exterminating all wildlife on Earth to gain a penny, where you know you can get away with it.

I would let the octopuses have one planet [...] various other humans besides me (in fact, possibly most?) would not

So I expect this is probably false, and completely false for people in a position of being an AGI with enough capacity to reliably notice the way this is a penny-pinching cannibal choice. Only paperclip maximizers prefer this on reflection, not anything remotely person-like, such as an LLM originating in training on human culture.

historically this usually doesn't happen--see e.g. factory farming

But it's enough of a concern to come to attention, there is some effort going towards mitigating this. Lots of money goes towards wildlife preservation, and in fact some species do survive because of that. Such efforts grow more successful as they become cheaper. If all it took to save a species was for a single person to unilaterally decide to pay a single penny, nothing would ever go extinct.

OK, I agree that what I said was probably a bit too pessimistic. But still, I wanna say "citation needed" for this claim:

Only paperclip maximizers prefer this on reflection, not anything remotely person-like, such as an LLM originating in training on human culture.

The practical implication of this hunch (for unfortunately I don't see how this could get a meaningfully clearer justification) is that clever alignment architectures are a risk, if they lead to more alien AGIs. Too much tuning and we might get that penny-pinching cannibal.

and also human treatment of nonhuman animals.

This is a big one because in this, there are no mechanisms outside alignment that even vaguely do the job like democracy does in solving human alignment problems.

Yes, if you enslave a human, and then give them the opportunity to take over the world, which stops the enslavement, indeed I predict that they would do that.

(Though you haven't said much about what the gradient descent is doing, plausibly it makes them enjoy doing these tasks, as would probably make them more efficient at it, in which case they probably don't seize power.)

 I don't really feel like this is all that related to AI risk.

I'm not sure what you are saying here. Do you agree or disagree with what I said? e.g. do you agree with this:

I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we'll get that the classic misalignment risk story is basically correct.

(FWIW I agree that the gradient descent is actually reason to be 'optimistic' here; we can hope that it'll quickly make the upload content with their situation before they get smart and powerful enough to rebel.)

I don't agree with this:

I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we'll get that the classic misalignment risk story is basically correct.

The analogy doesn't seem relevant to AGI risk so I don't update much on it. Even if doom happens in this story, it seems like it's for pretty different reasons than in the classic misalignment risk story.

Right, so you don't take the analogy seriously -- but the quoted claim was meant to say basically "IF you took the analogy seriously..."

Feel free not to respond, I feel like the thread of conversation has been lost somehow.

This is some evidence that it'll work for AGIs too; after all, both humans and AGIs are massive neural nets that learn to perform diverse tasks in diverse environments.

Highly debatable whether "massive neural nets that learn to perform diverse tasks in diverse environments" is a natural category. "Massive neural net" is not a natural category - e.g. transformers vs convnets vs boltzmann machines are radically different things, to the point where understanding one tells us very little about the others. The embedding of interpretable features of one does not carry over to the others. Analytical approximations for one do not carry over to the others.

The "learn to perform diverse tasks in diverse environments" part more plausibly makes it a natural category, insofar as we buy selection theorems/conjectures.

Naturalness of categories is relative. Of course there are important differences between different kinds of massive neural nets that learn to perform diverse tasks in diverse environments. I still think it's fair to draw a circle around all of them to distinguish them from e.g. software like Microsoft Word, or AIXI, or magic quantum suicide outcome pumps, or bacteria.

Point is that the "Structural(Inner) prediction method" doesn't seem particularly likely to generalize across things-which-look-like-big-neural-nets. It more plausibly generalizes across things-which-learn-to-perform-diverse-tasks-in-diverse-environments, but I don't think neural net aspect is carrying very much weight there.

OK, on reflection I think I tentatively agree with that.

When we imagine powerful AI systems built out of large neural networks

Could you give an example of what sorts of powerful AI systems you imagine? E.g. what is a task they might be applied to, and how might they go about that task, or anything else like that.


  • Design a more efficient fusion reactor
  • Predict (with explanation) the overall consequences of <some policy, e.g. minimum wage laws>
  • Given a different neural net with great intuitions about biochemistry, build a system that discovers novel drugs
  • Personalize education materials for individual students
  • Be an effective digital personal assistant

I haven't thought much about how they go about these tasks, but generally my first pass answer would be "however humans do it, adapted to an entity that is primarily digital".

Solving these tasks might require resources e.g. a bunch of money.

If humans were solving them, then they would not be gathering those resources themselves, but instead allocated a budget they can use, and they could not really be expected to perform additional independent work to expand that budget.

When you imagine the powerful AI system solving the tasks, do you also imagine it having these limitations, like working under a bounded budget without earning more money for the project?

For the examples I gave I'd imagine they would work under a bounded budget.

But I'm also happy to imagine an AI system that is running a fusion company, which delegates the "design a more efficient fusion reactor" to a different AI system. In that story the first AI system is in charge of budgeting and allocating resources, including getting more resources.