Transcript of a presentation on catastrophic risks from AI

RobertM

This is an approximate outline/transcript of a presentation I'll be giving for a class of COMP 680, "Advanced Topics in Software Engineering", with footnotes and links to relevant source material.

Assumptions and Content Warning

This presentation assumes a couple things. The first is that something like materialism/reductionism (non-dualism) is true, particularly with respect to intelligence - that intelligence is the product of deterministic phenomena which we are, with some success, reproducing. The second is that humans are not the upper bound of possible intelligence.

This is also a content warning that this presentation includes discussion of human extinction.

If you don't want to sit through a presentation with those assumptions, or given that content warning, feel free to sign off for the next 15 minutes.

Preface

There are many real issues with current AI systems, which are not the subject of this presentation:

bias in models used for decision-making (financial, criminal justice, etc)
economic displacement and potential IP infringement
enabling various kinds of bad actors (targeted phishing, cheaper spam/disinformation campaigns, etc)

This is about the unfortunate likelihood that, if we create sufficiently intelligent AI using anything resembling the current paradigm, then we will all die.

Why? I'll give you a more detailed breakdown, but let's sketch out some basics first.

What is intelligence?

One useful way to think about intelligence is that it's what lets us imagine that we'd like the future to be a certain way, and then - intelligently - plan and take actions to cause that to happen. The default state of nature is entropy. The reason that we have nice things is because we, intelligent agents optimizing for specific goals, can reliably cause things to happen in the external world by understanding it and using that understanding to manipulate it into a state we like more.

We know that humans are not anywhere near the frontier of intelligence. We have many reasons to believe this, both theoretical and empirical:

very short span of time, in an evolutionary sense, between earlier primates (i.e. chimpanzees) and humans, and therefore very little time for evolution to have optimized as hard on intelligence as it has for other things
evolution optimizes for inclusive genetic fitness, which intelligence is useful for, but is not a perfect proxy - many other things are important too
evolution dealing with many constraints, i.e. birth canals constraining skull size
evolution is a dumb hill-climbing process that can't take "shortcuts", like seeing beneficial mutations and just sticking them into every single descendent in the next generation - takes a long time for beneficial things to reach fixation. no "lookahead", so "designs" are kludgy, etc.
very large differences in intelligence between individual humans, i.e. John von Neumann
in many domains we've blown far past human ability. AlphaZero, in 24 hours of self-play (with no human-generated training data), reached superhuman levels of ability in chess, go, and shogi, beating not just the best humans but all previous state-of-the-art software (which was already superhuman).
- The thing that AlphaZero being more intelligent (in these domains) than us means is that we can very safely predict that it will win in those games. Of course, we don't know how it would win; if you could accurately model the strategy it would use, it would by definition not be more intelligent than you. But we can have a very high confidence prediction about the end state.

Instrumental Convergence

Terminal goals are those things that you value for their own sake. Some examples in humans: aesthetic experiences, success in overcoming challenges, the good regard of friends and family.

Instrumental goals are those things that you value for the sake of achieving other things (such as other instrumental or terminal goals). A common human example: making money. The case of human goals is complicated by the fact that many goals are dual-purpose - you might value the good regard of your friends not just for its own sake, but also for the sake of future benefits that might accrue to you as a result. (How much less would you value the fact that your friends think well of you, if you never expected that to make any other observable difference in your life?)

The theory of instrumental convergence says that sufficiently intelligent agents will converge to a relatively small set of instrumental goals, because those goals are useful for a much broader set of terminal goals.

A narrow example of this has been demonstrated, both formally and experimentally, by Turner in Parametrically retargetable decision-makers tend to seek power.

Convergent instrumental goals include power seeking, resource acquisition, goal preservation, and avoiding shut-off.

Orthogonality Thesis

Arbitrarily intelligent agents can have arbitrary goals.

Most goals are not compatible with human flourishing; even goals that seem "very close" to ideal when described informally will miss enormous amounts of relevant detail. (Intuition pump: most possible configurations of atoms do not satisfy human values very well.) Human values are a very complicated manifold in a high-dimensional space. We, ourselves, have not yet figured out how to safely optimize for own values yet - all ethical systems to date have degenerate outcomes (i.e. the repugnant conclusion with total utilitarianism).

We don't currently know how to use existing ML architectures to create models with any specific desired goal, let alone avert future problems that might crop up if we do solve that problem (such as ontological shifts).

The reason that humans care about other humans is because of evolutionary pressures that favored cooperative strategies, and "actually caring" turned out to outcompete other methods of achieving cooperation. Needless to say, this is not how AIs are trained.

Takeover

A sufficiently intelligent agent would have little difficulty in taking over. No canonical source, but:

Offense generally beats defense
Reality seems to have "exploits", and it seems very unlikely that we've found all of them
- Long-distance communications
- Nuclear power
- Biology (as existence proof for nanotechnology)
- Digital computing
The leading AI labs don't have a security mindset when it comes to building something that might behave adversarially, so a future AGI won't even need to deal with the slightly harder task of escaping sandboxing and operating secretly until it's ready; it'll just get deployed to the internet with minimal supervision (as we're currently doing).

The Builders are Worried

A bunch of very smart people are trying very hard to build AGI. As the amount of available compute grows, the cleverness required to do so drops. Those people claim to think that this is a real risk, but are not taking it as seriously as I would expect.

Sam Altman:
1. "Development of superhuman machine intelligence (SMI) is probably the greatest threat to the continued existence of humanity"^[1] (2015)
2. "AI will probably most likely lead to the end of the world, but in the meantime, there'll be great companies."^[2] (2015)
3. "Some people in the AI field think the risks of AGI (and successor systems) are fictitious; we would be delighted if they turn out to be right, but we are going to operate as if these risks are existential."^[3] (2023)
Shane Legg:
1. "Eventually, I think human extinction will probably occur, and technology will likely play a part in this. But there's a big difference between this being within a year of something like human level AI, and within a million years. As for the former meaning...I don't know. Maybe 5%, maybe 50%. I don't think anybody has a good estimate of this."^[4] (2011, responding to the question "What probability do you assign to the possibility of negative/extremely negative consequences as a result of badly done AI?")
2. "It's my number 1 risk for this century, with an engineered biological pathogen coming a close second (though I know little about the latter)." (same interview as above, responding to "Do possible risks from AI outweigh other possible existential risks, e.g. risks associated with the possibility of advanced nanotechnology?")
Dario Amodei:
1. "I think at the extreme end is the Nick Bostrom style of fear that an AGI could destroy humanity. I can’t see any reason and principle why that couldn’t happen."^[5] (2017)
Other AI luminaries who are not attempting to build AGI, such as Stuart Russell and Geoffrey Hinton, also think it's a real risk.
1. Geoffrey Hinton:
  1. "It's somewhere between, um, naught percent and 100 percent. I mean, I think, I think it's not inconceivable."^[6] (2023, responding to “...what do you think the chances are of AI just wiping out humanity? Can we put a number on that?”)
  2. “Well, here’s a subgoal that almost always helps in biology: get more energy. So the first thing that could happen is these robots are going to say, ‘Let’s get more power. Let’s reroute all the electricity to my chips.’ Another great subgoal would be to make more copies of yourself. Does that sound good?”^[7] (2023)
  3. Quit his job at Google 3 days ago to talk about AI safety “without having to worry about how it interacts with Google’s business”^[8] (2023).

Open Research Questions

Specifying the correct training objective, i.e. one that accurately represents what you wanted (“outer alignment”):
- The typical example is the story of a genie granting someone’s wish, in the most literal sense possible. We don’t know how to specify what we want very well, which is very dangerous if we’re going to ask something much smarter than us to give it to us. There are many problems, including both simply specifying the wrong thing, and specifying the “right” thing, but not correctly understanding the consequences of a more powerful optimizer attempting to solve that problem for you.
Ensuring that training with that objective actually results in internal cognition pointed at that objective, rather than something else (“inner alignment”):
- Even if we did know how to perfectly specify what we wanted, in a way that still gave you results that you thought were good if something much smarter than you optimized for it, we don’t actually have any idea how to create an agent that actually cares about any particular thing. We can make reward functions, but reward is not the optimization target! Per Turner^[9]:

Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent's network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.

Robustness:
- Out-of-distribution generalization.
  - This is actually based on a well-known problem in machine learning, where many training methods work well when training and testing data are independent and identically distributed, but see sharp drops in performance when the distribution of inputs shifts. Why does this happen? Because in most cases, there are many possible ways to fit data, and whatever model your training regime finds first will not be the one that accurately describes the underlying distribution, by default. The problem is that with superintelligence, even if you can be confident that the system will make reasonable decisions in those cases where there is enough representative training data, it making an “unreasonable” decision in a novel situation would be quite bad, since those situations seem much more likely to be characterized by statistical extrema. Of course, this problem is one that you only get to deal with if you solve all the other problems that come up before you get there. We don’t even know how to do the “easy” part yet, which is to have an AI internalize our “in-distribution” values.
- Ontological shifts.
  - First, what is an ontology? An ontology is a model of reality, mostly concerned with how you categorize things, and how things relate to other things. One specific open problem that has been posed in AI alignment is the “diamond maximizer” problem. The challenge is to state an unbounded (in computational terms) program that will turn as much of the physical universe into diamonds as possible. The goal of making diamonds was chosen because it could be described in very concrete physical terms - “the amount of diamond is the number of carbon atoms covalently bound to four other carbon atoms”, which skips the additional difficult challenge of trying to load human values into this system.
  - The first problem that you run into is actually of even figuring out where the concept of “diamonds” is in the system’s model of the world. This is the problem of ontology identification. If you make something very smart, it will know about a lot of things, not just diamonds. You need to be able to robustly point at the very specific concept of diamonds inside of a model of the world that might be very different from ours. You also have the problem that your own ontology might not be correct. Do we understand physics well enough to reliably translate “atoms” into the deepest fundamental structure in reality that can be pointed at? If we don’t, we need to point at this higher-level abstraction of atoms, and hope this abstraction isn’t “leaky”.
  - The system’s ontology will shift when it learns some new fact about the world that causes it to discard an existing ontology that it’s been using - because it turned out to be false - and adopt a new one. Did you design the system’s values in terms of the original ontology? Haha, whoops.
Ensuring that a model will allow correction and/or shutdown by humans (“corrigibility”)
Understanding what current systems are doing at a deep level (“mechanistic interpretability”)

LESSWRONG
LW