Wiki Contributions



Hm, I’d probably disagree.

A couple thoughts here:

First: To me, it seems one important characteristic of “planners” is that they can improve their decisions/behavior even without doing additional learning. For example, if I’m playing chess, there might be some move that (based on my previous learning) initially presents itself as the obvious one to make. But I can sit there and keep running mental simulations of different games I haven’t yet played (“What would happen if I moved that piece there…?”) and arrive at better and better decisions.

It doesn’t seem like that’d be true of a deployed version of AlphaGo without MCTS. If you present it with some board state, it seems like it will just take whatever action (or distribution of actions) is already baked into its policy. There’s not a sense, I think, in which it will keep improving its decision. Unlike in the MCTS case, you can’t tweak some simple parameter and give it more ‘time to think’ and allow it to make a better decision. So that’s one sense in which AlphaGo without MCTS doesn’t seem, to me, like it could exhibit planning.

However, second: A version of AlphaGo without explicit MCTS might still qualify as a “planner” on a thinner conception of “planning.” In this case, I suppose the hypothesis would be that: when we do a single forward pass through the network, we carry out some computations that are roughly equivalent to the computations involved in (e.g.) MCTS. I suppose that can’t be ruled out, although I’m also not entirely sure how to think about it. One thing we could still say, though, is that insofar as planning processes tend to involve a lot of sequential steps, the number of layers in MCTS-less AlphaGo would seriously limit the amount of ‘planning’ it can do. Eight layers don’t seem like nearly enough for a forward pass to correspond to any meaningful amount of planning.

So my overall view is: For a somewhat strict conception of “planning,” it doesn’t seem like feedforward networks can plan. For a somewhat loose conception of “planning,” it actually is conceivable that a feedforward network could plan — but (I think) only if it had a really huge number of layers. I’m also not sure that there would a tendency for the system to start engaging in this kind of “planning” as layer count increases; I haven't thought enough to have a strong take.[1]

  1. Also, to clarify: I think that the question of whether feedforward networks can plan probably isn’t very practically relevant, in-and-of-itself — since they’re going to be less important than other kinds of networks. I’m interested in this question mainly as a way of pulling apart different conceptions of “planning,” noticing ambiguities and disagreements, etc. ↩︎


I really appreciate you taking the time both to write this report and solicit/respond to all these reviews! I think this is a hugely valuable resource, that has helped me to better understand AI risk arguments and the range of views/cruxes that different people have.

A couple quick notes related to the review I contributed:

First, .4% is the credence implied by my credences in individual hypotheses — but I was a little surprised by how small this number turned out to be. (I would have predicted closer to a couple percent at the time.) I’m sympathetic to the possibility that the high level of conjuctiveness here created some amount of downward bias, even if the argument does actually have a highly conjunctive structure.

Second (only of interest to anyone who looked at my review): My sense is we still haven’t succeeded in understanding each other’s views about the nature and risk-relevance of planning capabilities. For example, I wouldn’t necessarily agree with this claim in your response to the section on planning:

Presumably, after all, a fixed-weight feedforward network could do whatever humans do when we plan trips to far away places, think about the best way to cut down different trees, design different parts of a particle collider, etc -- and this is the type of cognition I want to focus on.

Let’s compare a deployed version of AlphaGo with and without Monte Carlo tree search. It seems like the version with Monte Carlo tree search could be said to engage in planning: roughly speaking, it simulates the implications of different plays, and these simulations are used to arrive at better decisions. It doesn’t seem to me like there’s any sense in which the version of AlphaGo without MCTS is doing this. [1] Insofar as Go-playing humans simulate the implications of different plays, and use the simulations to arrive at better decisions, I don’t think a plain fixed-weight feedforward Go-playing network could be said to be doing the same sort of cognition as people. It could still play as well as humans, if it had been trained well enough, but it seems to me that the underlying cognition would nonetheless be different.

I feel like I have a rough sense of the distinction between these two versions of AlphaGo and a rough sense of how this distinction might matter for safety. But if both versions engage in “planning,” by some thinner conception of “planning,” then I don’t think I have a good understanding of what this version of the “planning”/“non-planning” distinction is pointing at — or why it matters.

It might be interesting to try to more fully unpack our views at some point, since I do think that differences in how people think about planning might be an underappreciated source of disagreement about AI risk (esp. around 'inner alignment').

  1. One way of pressing this point: There’s not really a sense in which you could give it more ‘time to think,’ in a given turn, and have its ultimate decision keep getting better and better. ↩︎

Answer by bmg40

A tricky thing here is that it really depends how quickly a technology is adopted, improved, integrated, and so on.

For example, it seems like computers and the internet caused a bit of a surge in American productivity growth in the 90s. The surge wasn't anything radical, though, for at least a few reasons:

  1. Continued technological progress is necessary just to sustain steady productivity growth.

  2. It's apparently very hard, in general, to increase aggregate productivity.

  3. The adoption, improvement, integration, etc., of information technology was a relatively gradual process.

If we instead suddenly jumped from a world where no company has information technology to one where every company is using 2020-level information technology (and using it with 2020-level tacit knowledge, IT-enabling capital investments, IT-adapted business practices, complementary technologies, etc.), then the productivity growth rate for that year would probably have been very high. But the gradualness of everything flattened out the surge. Given how slowly diffusion happens globally, I actually wouldn't be surprised if the surge was totally invisible at the global level.

So if we want to predict that some technology (e.g. fusion power) will help surge the growth rate above some high threshold, we will also typically need to predict that its aggregrate impact will be unusually sudden.


Since neural networks are universal function approximators, it is indeed the case that some of them will implement specific search algorithms.

I don't think this specific point is true. It seems to me like the difference between functions and algorithms is important. You can also approximate any function with a sufficiently large look-up table, but simply using a look-up table to choose actions doesn't involve search/planning.* In this regard, something like a feedforward neural network with frozen weights also doesn't seem importantly different than a look-up table to me.

One naive perspective: Systems like AlphaGo and MuZero do search, because they implement Monte-Carlo tree search algorithms, but if you were to remove their MCTS components then they simply wouldn't do search. Search algorithms can be used to update the weights of neural networks, but neural networks don't themselves do search.

I think this naive perspective may be wrong, because it's possible that recurrence is sufficient for search/planning processes to emerge (e.g. see this paper). But then, if that's true, I think that the power of recurrence is the important thing to emphasize, rather than the fact that neural networks are universal function approximators.

*I'm thinking of search algorithms as cognitive processes, rather than input-output behaviors (which could be produced via a wide range of possible algorithms). If you're thinking of them as behaviors, then my point no longer holds. Although I've interpreted the mesa-optimization paper (and most other discussions of mesa-optimization) as talking about cognitive processes.


I do agree that OT and ICT by themselves, without any further premises like "AI safety is hard" and "The people building AI don't seem to take safety seriously, as evidenced by their public statements and their research allocation" and "we won't actually get many chances to fail and learn from our mistakes" does not establish more than, say, 1% credence in "AI will kill us all," if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.

I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over "and various other premises"). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.

I think that "discontinuity + OT + ICT," rather than "OT + ICT" alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:

An existential risk is one that threatens to cause the extinction of Earth-originating intelligent life or to otherwise permanently and drastically destroy its potential for future desirable development. Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.

First, we discussed how the initial superintelligence might obtain a decisive strategic advantage. This superintelligence would then be in a position to form a singleton and to shape the future of Earth-originating intelligent life. What happens from that point onward would depend on the superintelligence’s motivations.

Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.

Third, the instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources.

Taken together, these three points thus indicate that the first superintelligence may shape the future of Earth-originating life, could easily have non-anthropomorphic final goals, and would likely have instrumental reasons to pursue open-ended resource acquisition. If we now reflect that human beings consist of useful resources (such as conveniently located atoms) and that we depend for our survival and flourishing on many more local resources, we can see that the outcome could easily be one in which humanity quickly becomes extinct.

There are some loose ends in this reasoning, and we shall be in a better position to evaluate it after we have cleared up several more surrounding issues. In particular, we need to examine more closely whether and how a project developing a superintelligence might either prevent it from obtaining a decisive strategic advantage or shape its final values in such a way that their realization would also involve the realization of a satisfactory range of human values. (Bostrom, p. 115-116)

If we drop the 'likely discontinuity' premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)

I'd also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions ("maximize paperclips," "make me happy," etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you're quite careful in various regards (e.g. even if you implement "boxing" strategies). At least if we drop the discontinuity premise, though, I don't think they're compelling enough to bump us up to a high credence in doom.


I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it's too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that "most" minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices.

I still think, though, that the ICT only gets us a relatively small portion of the way to believing that extinction-level alignment failures are likely. A couple of thoughts I have are:

  1. It may be useful to distinguish between "power-seeking behavior" and omnicide (or equivalently harmful behavior). We do want AI systems to pursue power-seeking behaviors, to some extent. Making sure not to lock yourself in the bathroom, for example, qualifies as a power-seeking behavior -- it's akin to avoiding "State 2" in your diagram -- but it is something that we'd want any good house-cleaning robot to do. It's only a particular subset of power-seeking behavior that we badly want to avoid (e.g. killing people so they can't shut you off.)

    This being said, I imagine that, if we represented the physical universe as an MDP, and defined a reward function over states, and used a sufficiently low discount rate, then the optimal policy for most reward functions probably would involve omnicide. So the result probably does port over to this special case. Still, I think that keeping in mind the distinction between omnicide and "power-seeking behavior" (in the context of some particular MDP) does reduce the ominousness of the result to some degree.

  2. Ultimately, for most real-world tasks, I think it's unlikely that people will develop RL systems using hand-coded reward functions (and then deploy them). I buy the framing in (e.g.) the DM "scalable agent alignment" paper, Rohin's "narrow value learning" sequence, and elsewhere: that, over time, the RL development process will necessarily look less-and-less like "pick a reward function and then let an RL algorithm run until you get a policy that optimizes the reward function sufficiently well." There's seemingly just not that much that you can do using hand-written reward functions. I think that these more sophisticated training processes will probably be pretty strongly attracted toward non-omnicidal policies. At a higher level, engineers will also be attracted toward using training processes that produce benign/useful policies. They should have at least some ability to notice or foresee issues with classes of training processes, before any of them are used to produce systems that are willing and able to commit omnicide. Ultimately, in other words, I think it's reasonable to be optimistic that we'll do much better than random when producing the policies of advanced AI systems.

    I do still think that the ICT is true, though, and I do still think that it matters: it's (basically) necessary for establishing a high level of misalignment risk. I just don't think it's sufficient to establish a high level of risk (and am skeptical of certain other premises that would be sufficient to establish this).


I think we can interpret it as a burden-shifting argument; "Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you'd better have some pretty solid arguments that everything's going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important)." As far as I know no one has come up with any such arguments, and in fact it's now the consensus in the field that no one has found such an argument.

I suppose I disagree that at least the orthogonality thesis and instrumental convergence, on their own, shift the burden. The OT basically says: "It is physically possible to build an AI system that would try to kill everyone." The ICT basically says: "Most possible AI systems within some particular set would try to kill everyone." If we stop here, then we haven't gotten very far.

To repurpose an analogy: Suppose that you lived very far back in the past and suspected the people would eventually try to send rockets with astronauts to the moon. It's true that it's physically possible to build a rocket that shoots astronauts out aimlessly into the depths of space. Most possible rockets that are able to leave earth's atmosphere would also send astronauts aimlessly out into the depths of space. But I don't think it'd be rational to conclude, on these grounds, that future astronauts will probably be sent out into the depths of space. The fact that engineers don't want to make rockets that do this, and are reasonably intelligent, and can learn from lower-stakes experiences (e.g. unmanned rockets and toy rockets), does quite a lot of work. If you're not worried about just one single rocket trajectory failure, but systematically more severe trajectory failures (e.g. people sending larger and larger manned rockets out into the depths of space), then the rational degree of worry becomes increasingly low.

Even sillier example: It's possible to make poisons, and there are way more substances that are deadly to people than there are substances that inoculate people are against coronavirus, but we don't need to worry much about killing everyone in the process of developing and deploying coronavirus vaccines. This is true even if it turned out that we don't currently know how to make an effective coronavirus vaccine.

I think the OT and ICT on their own almost definitely aren't enough to justify an above 1% credence in extinction from AI. To get the rational credence up into (e.g) the 10%-50% range, I think that stuff like mesa-optimization concerns, discontinuity premises, explanations of how plausible development techniques/processes could go badly wrong, and explanations of dynamics around AI unnoticed deceptive tendencies still need to do almost all of the work.

(Although a lot depends on how high a credence we're trying to justify. A 1% credence in human extinction from misaligned AI is more than enough, IMO, to justify a ton of research effort, although it also probably has pretty different prioritization implications than a 50% credence.)


for example, the "Universal prior is malign" stuff shows that in the limit GPT-N would likely be catastrophic,

If you have a chance, I'd be interested in your line of thought here.

My initial model of GPT-3, and probably the model of the OP, is basically: GPT-3 is good at producing text that it would have been unsurprising to find on the internet. If we keep training up larger and larger models, using larger and larger datasets, it will produce text that it would be less-and-less surprising to find on the internet. Insofar as there are safety concerns, these mostly have to do with misuse -- or with people using GPT-N as a starting point for developing systems with more dangerous behaviors.

I'm aware that people who are more worried do have arguments in mind, related to stuff like inner optimizers or the characteristics of the universal prior, but I don't feel I understand them well -- and am, perhaps unfairly, beginning from a place of skepticism.

It's unfair to complain about GPT-3's lack of ability to simulate you to get out of the box, etc. since it's way too stupid for that, and the whole point of AI safety is to prepare for when AI systems are smart.

I think that OP's question is sort about whether this way of speaking/thinking about GPT-3 makes sense, in the first place.

Intentionally silly example: Suppose that people were expressing concern about the safety of graphing calculators, saying things like: "OK, the graphing calculator that you own is safe. But that's just because it's too stupid to recognize that it has an incentive to murder you, in order to achieve its goal of multiplying numbers together. The stupidity of your graphing calculator is the only thing keeping you alive. If we keep improving our graphing calculators, without figuring out how to better align their goals, then you will likely die at the hands of graphing-calculator-N."

Obviously, something would be off about this line of thought, although it's a little hard to articulate exactly what. In some way, it seems, the speaker's use of certain concepts (like "goals" and "stupidity") is probably to blame. I think that it's possible that there is an analogous problem, although certainly a less obvious one, with some of the safety discussion around GPT-3.