I’ve argued that AI systems could defeat all of humanity combined, if (for whatever reason) they were directed toward that goal.
Here I’ll explain why I think they might - in fact - end up directed toward that goal. Even if they’re built and deployed with good intentions.
In fact, I’ll argue something a bit stronger than that they might end up aimed toward that goal. I’ll argue that if today’s AI development methods lead directly to powerful enough AI systems, disaster is likely1 by default (in the absence of specific countermeasures).
Unlike other discussions of the AI alignment problem,3 this post will discuss the likelihood4 of AI systems defeating all of humanity (not more general concerns about AIs being misaligned with human intentions), while aiming for plain language, conciseness, and accessibility to laypeople, and focusing on modern AI development paradigms. I make no claims to originality, and list some key sources and inspirations in a footnote.5
Summary of the piece:
My basic assumptions. I assume the world could develop extraordinarily powerful AI systems in the coming decades. I previously examined this idea at length in the most important century series.
Furthermore, in order to simplify the analysis:
AI “aims.” I talk a fair amount about why we might think of AI systems as “aiming” toward certain states of the world. I think this topic causes a lot of confusion, because:
Dangerous, unintended aims. I’ll examine what sorts of aims AI systems might end up with, if we use AI development methods like today’s - essentially, “training” them via trial-and-error to accomplish ambitious things humans want.
Limited and/or ambiguous warning signs. The risk I’m describing is - by its nature - hard to observe, for similar reasons that a risk of a (normal, human) coup can be hard to observe: the risk comes from actors that can and will engage in deception, finding whatever behaviors will hide the risk. If this risk plays out, I do think we’d see some warning signs - but they could easily be confusing and ambiguous, in a fast-moving situation where there are lots of incentives to build and roll out powerful AI systems, as fast as possible. Below, I outline how this dynamic could result in disaster, even with companies encountering a number of warning signs that they try to respond to.
FAQ. An appendix will cover some related questions that often come up around this topic.
I’ll be making a number of assumptions that some readers will find familiar, but others will find very unfamiliar.
Some of these assumptions are based on arguments I’ve already made (in the most important century series). Some are for the sake of simplifying the analysis, for now (with more nuance coming in future pieces).
Here I’ll summarize the assumptions briefly, and you can click to see more if it isn’t immediately clear what I’m assuming or why.
In the most important century series, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.
I focus on a hypothetical kind of AI that I call PASTA, or Process for Automating Scientific and Technological Advancement. PASTA would be AI that can essentially automate all of the human activities needed to speed up scientific and technological advancement.
Using a variety of different forecasting approaches, I argue that PASTA seems more likely than not to be developed this century - and there’s a decent chance (more than 10%) that we’ll see it within 15 years or so.
I argue that the consequences of this sort of AI could be enormous: an explosion in scientific and technological progress. This could get us more quickly than most imagine to a radically unfamiliar future.
I’ve also argued that AI systems along these lines could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
For more, see the most important century landing page. The series is available in many formats, including audio; I also provide a summary, and links to podcasts where I discuss it at a high level.
It’s hard to talk about risks from transformative AI because of the many uncertainties about when and how such AI will be developed - and how much the (now-nascent) field of “AI safety research” will have grown by then, and how seriously people will take the risk, etc. etc. etc. So maybe it’s not surprising that estimates of the “misaligned AI” risk range from ~1% to ~99%.
This piece takes an approach I call nearcasting: trying to answer key strategic questions about transformative AI, under the assumption that such AI arrives in a world that is otherwise relatively similar to today's.
You can think of this approach like this: “Instead of asking where our ship will ultimately end up, let’s start by asking what destination it’s pointed at right now.”
That is: instead of trying to talk about an uncertain, distant future, we can talk about the easiest-to-visualize, closest-to-today situation, and how things look there - and then ask how our picture might be off if other possibilities play out. (As a bonus, it doesn’t seem out of the question that transformative AI will be developed extremely soon - 10 years from now or faster.6 If that’s the case, it’s especially urgent to think about what that might look like.)
What I mean by “black-box trial-and-error” is explained briefly in an old Cold Takes post, and in more detail in more technical pieces by Ajeya Cotra (section I linked to) and Richard Ngo (section 2). Here’s a quick, oversimplified characterization:
With this assumption, I’m generally assuming that AI systems will do whatever it takes to perform as well as possible on their training tasks - even when this means engaging in complex, human-like reasoning about topics like “How does human psychology work, and how can it be exploited?” I’ve previously made my case for when we might expect AI systems to become this advanced and capable.
Future pieces will relax this assumption, but I think it is an important starting point to get clarity on what the default looks like - and on what it would take for a countermeasure to be effective.
(I also think there is, unfortunately, a risk that there will in fact be very few efforts to address the concerns I’ll be raising below. This is because I think that the risks will be less than obvious, and there could be enormous commercial (and other competitive) pressure to move forward quickly. More on that below.)
“Ambition” assumption: people use black-box trial-and-error to continually push AI systems toward being more autonomous, more creative, more ambitious, and more effective in novel situations (and the pushing is effective). This one’s important, so I’ll say more:
I think this implies pushing in a direction of figuring out whatever it takes to get to certain states of the world and away from carrying out the same procedures over and over again.
The resulting AI systems seem best modeled as having “aims”: they are making calculations, choices, and plans to reach particular states of the world. (Not necessarily the same ones the human designers wanted!) The next section will elaborate on what I mean by this.
When people talk about the “motivations” or “goals” or “desires” of AI systems, it can be confusing because it sounds like they are anthropomorphizing AIs - as if they expect AIs to have dominance drives ala alpha-male psychology, or to “resent” humans for controlling them, etc.9
I don’t expect these things. But I do think there’s a meaningful sense in which we can (and should) talk about things that an AI system is “aiming” to do. To give a simple example, take a board-game-playing AI such as Deep Blue (or AlphaGo):
Nothing about this requires Deep Blue “desiring” checkmate the way a human might “desire” food or power. But Deep Blue is making calculations, choices, and - in an important sense - plans that are aimed toward reaching a particular sort of state.
Throughout this piece, I use the word “aim” to refer to this specific sense in which an AI system might make calculations, choices and plans selected to reach a particular sort of state. I’m hoping this word feels less anthropomorphizing than some alternatives such as “goal” or “motivation” (although I think “goal” and “motivation,” as others usually use them on this topic, generally mean the same thing I mean by “aim” and should be interpreted as such).
Now, instead of a board-game-playing AI, imagine a powerful, broad AI assistant in the general vein of Siri/Alexa/Google Assistant (though more advanced). Imagine that this AI assistant can use a web browser much as a human can (navigating to websites, typing text into boxes, etc.), and has limited authorization to make payments from a human’s bank account. And a human has typed, “Please buy me a great TV for a great price.” (For an early attempt at this sort of AI, see Adept’s writeup on an AI that can help with things like house shopping.)
As Deep Blue made choices about chess moves, and constructed a plan to aim for a “checkmate” position, this assistant might make choices about what commands to send over a web browser and construct a plan to result in a great TV for a great price. To sharpen the Deep Blue analogy, you could imagine that it’s playing a “game” whose goal is customer satisfaction, and making “moves” consisting of commands sent to a web browser (and “plans” built around such moves).
I’d characterize this as aiming for some state of the world that the AI characterizes as “buying a great TV for a great price.” (We could, alternatively - and perhaps more correctly - think of the AI system as aiming for something related but not exactly the same, such as getting a high satisfaction score from its user.)
In this case - more than with Deep Blue - there is a wide variety of “moves” available. By entering text into a web browser, an AI system could imaginably do things including:
I haven’t yet argued that it’s likely for such an AI system to engage in deceiving/manipulating humans, finding and exploiting security vulnerabilities, or running its own AI systems.
And one could reasonably point out that the specifics of the above case seem unlikely to last very long: if AI assistants are sending deceptive emails and writing dangerous code when asked to buy a TV, AI companies will probably notice this and take measures to stop such behavior. (My concern, to preview a later part of the piece, is that they will only succeed in stopping the behavior like this that they’re able to detect; meanwhile, dangerous behavior that accomplishes “aims” while remaining unnoticed and/or uncorrected will be implicitly rewarded. This could mean AI systems are implicitly being trained to be more patient and effective at deceiving and disempowering humans.)
But this hopefully shows how it’s possible for an AI to settle on dangerous actions like these, as part of its aim to get a great TV for a great price. Malice and other human-like emotions aren’t needed for an AI to engage in deception, manipulation, hacking, etc. The risk arises when deception, manipulation, hacking, etc. are logical “moves” toward something the AI is aiming for.
Furthermore, whatever an AI system is aiming for, it seems likely that amassing more power/resources/options is useful for obtaining it. So it seems plausible that powerful enough AI systems would form habits of amassing power/resources/options when possible - and deception and manipulation seem likely to be logical “moves” toward those things in many cases.
From the previous assumptions, this section will argue that:
Say that I train an AI system like this:
Here’s a problem: at some point, it seems inevitable that I’ll ask it a question that I myself am wrong/confused about. For example:
If and when I do this, I am now - unintentionally - training the AI system to engage in deceptive behavior. That is, I am giving negative reinforcement for the behavior “Answer a question honestly and accurately,” and positive reinforcement for the behavior: “Understand the human judge and their psychological flaws; give an answer that this flawed human judge will think is correct, whether or not it is.”
Perhaps mistaken judgments in training are relatively rare. But now consider an AI system that is learning a general rule for how to get good ratings. Two possible rules would include:
The unintended rule would do just as well on questions where I (the judge) am correct, and better on questions where I’m wrong - so overall, this training scheme is (in the long run) specifically favoring the unintended rule over the intended rule.
If we broaden out from thinking about a question-answering AI to an AI that makes and executes plans, the same basic dynamics apply. That is: an AI might find plans that end up making me think it did a good job when it didn’t - deceiving and manipulating me into a high rating. And again, if I train it by giving it positive reinforcement when it seemed to do a good job and negative reinforcement when it seemed to do a bad one, I’m ultimately - unintentionally - training it to do something like “Deceive and manipulate Holden when this would work well; just do the best job on the task you can when it wouldn’t.”
As noted above, I’m assuming the AI will learn whatever rule gives it the best performance possible, even if this rule is quite complex and sophisticated and requires human-like reasoning about e.g. psychology (I’m assuming extremely advanced AI systems here, as noted above).
One might object: “Why would an AI system learn a complicated rule about manipulating humans when a simple rule about telling the truth performs almost as well?”
One answer is that “telling the truth” is itself a fuzzy and potentially complex idea, in a context where many questions will be open-ended and entangled with deep values and judgment calls. (How should I think about the “truthfulness” of a statement about whether “pre-agriculture gender relations were bad?”) In many cases, what we are really hoping an AI system will learn from its training is something like “Behave as a human would want you to behave if the human understood all the considerations that you can see,” which could easily be more complex than something like “Behave in whatever way a human literally rewards.” Some links to more on this topic are in a footnote.11
But also, with capable enough systems, it probably is worth learning even a more complex rule to get better performance. If we picture humans in the place of AI systems - learning how to get good ratings from their supervisors - it seems likely that they develop models of how they’re being judged, rather than stick to a simple “tell the truth” rule.
In fact, the situation I’m describing seems reasonably similar to challenges faced by teachers trying to stop students from cheating; employers trying to get employees to help the company rather than simply creating a misleading appearance of helping the company; etc. That is: an employer can try to reward actions that help the company, but what they’re ultimately rewarding is actions that seem to help the company as far as the employer can tell. For employees who aren’t very creative, this might be good enough; for employees that are sufficiently good at deception and manipulation, it can be exactly the wrong incentive.
It’s a bit of a leap to imagine AI systems capable of doing the same kinds of reasoning that humans do in these situations, but that’s the premise I’m proceeding from.
For an early example of an AI learning deceptive behavior, see the “Challenges” section of this research writeup, in which “a robot which was supposed to grasp items instead positioned its [hand] in between the camera and the object so that it only appeared to be grasping it”:
Returning to the simplified training setup, how might I try to counteract this problem?
I could try conducting audits: every 100 questions, I might pull in a panel of other people to give an answer extra scrutiny. Then, if it turns out the AI system was preying on my personal mistakes, we could give negative reinforcement for that behavior. But this doesn’t really solve the problem - at least not for a capable enough AI system - because now we’re effectively training it to give answers that seem good to the panel, and we’re still rewarding any successful attempts to deceive or manipulate the panel.
There are a lot of other things I might try, and I’m not going to go through all the details here. I’ll simply claim that the problem of “training an AI to do a task well” rather than “training an AI to deceive and manipulate me as needed to create the appearance of doing a task well” seems like a deep one with no easy countermeasure. If you’re interested in digging deeper, I suggest Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover and Eliciting Latent Knowledge.
Above, I talk about my expectation that AI systems will be “best modeled as having ‘aims’ … making calculations, choices, and plans to reach particular states of the world.”
The previous section illustrated how AI systems could end up engaging in deceptive and unintended behavior, but it didn’t talk about what sorts of “aims” these AI systems would ultimately end up with - what states of the world they would be making calculations to achieve.
Here, I want to argue that it’s hard to know what aims AI systems would end up with, but there are good reasons to think they’ll be aims that we didn’t intend them to have.
An analogy that often comes up on this topic is that of human evolution. This is arguably the only previous precedent for a set of minds [humans], with extraordinary capabilities [e.g., the ability to develop their own technologies], developed essentially by black-box trial-and-error [some humans have more ‘reproductive success’ than others, and this is the main/only force shaping the development of the species].
You could sort of12 think of the situation like this: “An AI13 developer named Natural Selection tried giving humans positive reinforcement (making more of them) when they had more reproductive success, and negative reinforcement (not making more of them) when they had less. One might have thought this would lead to humans that are aiming to have reproductive success. Instead, it led to humans that aim - often ambitiously and creatively - for other things, such as power, status, pleasure, etc., and even invent things like birth control to get the things they’re aiming for instead of the things they were ‘supposed to’ aim for.”
Similarly, if our main strategy for developing powerful AI systems is to reinforce behaviors like “Produce technologies we find valuable,” the hoped-for result might be that AI systems aim (in the sense described above) toward producing technologies we find valuable; but the actual result might be that they aim for some other set of things that is correlated with (but not the same as) the thing we intended them to aim for.
There are a lot of things they might end up aiming for, such as:
I think it’s extremely hard to know what an AI system will actually end up aiming for (and it’s likely to be some combination of things, as with humans). But by default - if we simply train AI systems by rewarding certain end results, while allowing them a lot of freedom in how to get there - I think we should expect that AI systems will have aims that we didn’t intend. This is because:
So by default, it seems likely that just about any black-box trial-and-error training process is training an AI to do something like “Manipulate humans as needed in order to accomplish arbitrary goal (or combination of goals) X” rather than to do something like “Refrain from manipulating humans; do what they’d want if they understood more about what’s going on.”
I think a powerful enough AI (or set of AIs) with any ambitious, unintended aim(s) poses a threat of defeating humanity. By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply “containing” us in some way, such that we can’t interfere with AIs’ aims.
A previous piece argues that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply “containing” us in some way, such that we can’t interfere with AIs’ aims.
One way this could happen would be via “superintelligence” It’s imaginable that a single AI system (or set of systems working together) could:
But even if “superintelligence” never comes into play - even if any given AI system is at best equally capable to a highly capable human - AI could collectively defeat humanity. The piece explains how.
The basic idea is that humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves. From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.
More: AI could defeat all of us combined
A simple way of summing up why this is: “Whatever your aims, you can probably accomplish them better if you control the whole world.” (Not literally true - see footnote.15)
This isn’t a saying with much relevance to our day-to-day lives! Like, I know a lot of people who are aiming to make lots of money, and as far as I can tell, not one of them is trying to do this via first gaining control of the entire world. But in fact, gaining control of the world would help with this aim - it’s just that:
Another saying that comes up a lot on this topic: “You can’t fetch the coffee if you’re dead.”16 For just about any aims an AI system might have, it probably helps to ensure that it won’t be shut off or heavily modified. It’s hard to ensure that one won’t be shut off or heavily modified as long as there are humans around who would want to do so under many circumstances! Again, defeating all of humanity might seem like a disproportionate way to reduce the risk of being deactivated, but for an AI system that has the ability to pull this off (and lacks our ethical constraints), it seems like likely default behavior.
Controlling the world, and avoiding being shut down, are the kinds of things AIs might aim for because they are useful for a huge variety of aims. There are a number of other aims AIs might end up with for similar reasons, that could cause similar problems. For example, AIs might tend to aim for things like getting rid of things in the world that tend to create obstacles and complexities for their plans. (More on this idea at this discussion of “instrumental convergence.”)
To be clear, it’s certainly possible to have an AI system with unintended aims that don't push it toward trying to stop anyone from turning it off, or from seeking ever-more control of the world.
But as detailed above, I’m picturing a world in which humans are pushing AI systems to accomplish ever-more ambitious, open-ended things - including trying to one-up the best technologies and companies created by other AI systems. My guess is that this leads to increasingly open-ended, ambitious unintended aims, as well as to habits of aiming for power, resources, options, lack of obstacles, etc. when possible. (Some further exploration of this dynamic in a footnote.17)
(I find the arguments in this section reasonably convincing, but less so than the rest of the piece, and I think more detailed discussions of this problem tend to be short of conclusive.18)
Here’s something that would calm me down a lot: if I believed something like “Sure, training AI systems recklessly could result in AI systems that aim to defeat humanity. But if that’s how things go, we’ll see that our AI systems have this problem, and then we’ll fiddle with how we’re training them until they don’t have this problem.”
The problem is, the risk I’m describing is - by its nature - hard to observe, for similar reasons that a risk of a (normal, human) coup can be hard to observe: the risk comes from actors that can and will engage in deception, finding whatever behaviors will hide the risk.
To sketch out the general sort of pattern I worry about, imagine that:
One way of making this sort of future less likely would be to build wider consensus today that it’s a dangerous one.
Above, I give the example of AI systems that are aiming to get lots of “digital representations of human approval”; others have talked about AIs that maximize paperclips. How could AIs with such silly goals simultaneously be good at deceiving, manipulating and ultimately overpowering humans?
My main answer is that plenty of smart humans have plenty of goals that seem just about as arbitrary, such as wanting to have lots of sex, or fame, or various other things. Natural selection led to humans who could probably do just about whatever we want with the world, and choose to pursue pretty random aims; trial-and-error-based AI development could lead to AIs with an analogous combination of high intelligence (including the ability to deceive and manipulate humans), great technological capabilities, and arbitrary aims.
(Also see: Orthogonality Thesis)
This does seem possible, but counting on it would make me very nervous.
First, because it’s possible that AI systems developed in lots of different places, by different humans, still end up with lots in common in terms of their aims. For example, it might turn out that common AI training methods consistently lead to AIs that seek “digital representations of human approval,” in which case we’re dealing with a large set of AI systems that share dangerous aims in common.
Second: even if AI systems end up with a number of different aims, it still might be the case that they coordinate with each other to defeat humanity, then divide up the world amongst themselves (perhaps by fighting over it, perhaps by making a deal). It’s not hard to imagine why AIs could be quick to cooperate with each other against humans, while not finding it so appealing to cooperate with humans. Agreements between AIs could be easier to verify and enforce; AIs might be willing to wipe out humans and radically reshape the world, while humans are very hard to make this sort of deal with; etc.
It doesn’t; in fact, I’ve said nothing about consciousness anywhere in this piece. I’ve used a very particular conception of an “aim” (discussed above) that I think could easily apply to an AI system that is not human-like at all and has no conscious experience.
Today’s game-playing AIs can make plans, accomplish goals, and even systematically mislead humans (e.g., in poker). Consciousness isn’t needed to do any of those things, or to radically reshape the world.
I think there’s a common confusion when discussing this topic, in which people think that the challenge of “AI alignment” is to build AI systems that are perfectly aligned with human values. This would be very hard, partly because we don’t even know what human values are!
When I talk about “AI alignment,” I am generally talking about a simpler (but still hard) challenge: simply building very powerful systems that don’t aim to bring down civilization.
If we could build powerful AI systems that just work on cures for cancer (or even, like, put two identical19 strawberries on a plate) without posing existential danger to humanity, I’d consider that success.
I’ve focused on trial-and-error training in this post because most modern AI development fits in this category, and because it makes the risk easier to reason about concretely.
“Trial-and-error training” encompasses a very wide range of AI development methods, and if we see transformative AI within the next 10-20 years, I think the odds are high that at least a big part of AI development will be in this category.
My overall sense is that other known AI development techniques pose broadly similar risks for broadly similar reasons, but I haven’t gone into detail on that here. It’s certainly possible that by the time we get transformative AI systems, there will be new AI methods that don’t pose the kinds of risks I talk about here. But I’m not counting on it.
If we assume that building these sorts of AI systems is possible, then I’m very skeptical that the whole world would voluntarily refrain from doing so indefinitely.
To quote from a more technical piece by Ajeya Cotra with similar arguments to this one:
Powerful ML models could have dramatically important humanitarian, economic, and military benefits. In everyday life, models that [appear helpful while ultimately being dangerous] can be extremely helpful, honest, and reliable. These models could also deliver incredible benefits before they become collectively powerful enough that they try to take over. They could help eliminate diseases, reduce carbon emissions, navigate nuclear disarmament, bring the whole world to a comfortable standard of living, and more. In this case, it could also be painfully clear to everyone that companies / countries who pulled ahead on this technology could gain a drastic competitive advantage, either economically or militarily. And as we get closer to transformative AI, applying AI systems to R&D (including AI R&D) would accelerate the pace of change and force every decision to happen under greater time pressure.
If we can achieve enough consensus around the risks, I could imagine substantial amounts of caution and delay in AI development. But I think we should assume that if people can build more powerful AI systems than the ones they already have, someone eventually will.
In general, this is not an area where it’s easy to get a handle on what “expert opinion” says. I previously wrote that there aren’t clear, institutionally recognized “experts” on the topic of when transformative AI systems might be developed. To an even greater extent, there aren’t clear, institutionally recognized “experts” on whether (and how) future advanced AI systems could be dangerous.
I previously cited one (informal) survey implying that opinion on this general topic is all over the place: “We have respondents who think there's a <5% chance that alignment issues will drastically reduce the goodness of the future; respondents who think there's a >95% chance; and just about everything in between.” (Link.)
This piece, and the more detailed piece it’s based on, are an attempt to make progress on this by talking about the risks we face under particular assumptions (rather than trying to reason about how big the risk is overall).
I don’t think the argument in this piece relies on lots of different specific claims being true.
If you start from the assumptions I give about powerful AI systems being developed by black-box trial-and-error, it seems likely (though not certain!) to me that (a) the AI systems in question would be able to defeat humanity; (b) the AI systems in question would have aims that are both ambitious and unintended. And that seems to be about what it takes.
Something I’m happy to concede is that there’s an awful lot going on in those assumptions!
As in more than 50/50. ↩
Or persuaded (in a “mind hacking” sense) or whatever. ↩
Specifically, I argue that the problem looks likely by default, rather than simply that it is possible. ↩
I think the earliest relatively detailed and influential discussions of the possibility that misaligned AI could lead to the defeat of humanity came from Eliezer Yudkowsky and Nick Bostrom, though my own encounters with these arguments were mostly via second- or third-hand discussions rather than particular essays.
My colleagues Ajeya Cotra and Joe Carlsmith have written pieces whose substance overlaps with this one (though with more emphasis on detail and less on layperson-compatible intuitions), and this piece owes a lot to what I’ve picked from that work.
I’ve also found Eliciting Latent Knowledge (Christiano, Xu and Cotra 2021; relatively technical) very helpful for my intuitions on this topic.
The alignment problem from a deep learning perspective (Ngo 2022) also has similar content to this piece, though I saw it after I had drafted most of this piece. ↩
E.g., Ajeya Cotra gives a 15% probability of transformative AI by 2030; eyeballing figure 1 from this chart on expert surveys implies a >10% chance by 2028. ↩
E.g., this work by Anthropic, an AI lab my wife co-founded and serves as President of. ↩
First, because this work is relatively early-stage and it’s hard to tell exactly how successful it will end up being. Second, because this work seems reasonably likely to end up helping us read an AI system’s “thoughts,” but less likely to end up helping us “rewrite” the thoughts. So it could be hugely useful in telling us whether we’re in danger or not, but if we are in danger, we could end up in a position like: “Well, these AI systems do have goals of their own, and we don’t know how to change that, and we can either deploy them and hope for the best, or hold off and worry that someone less cautious is going to do that.”
That said, the latter situation is a lot better than just not knowing, and it’s possible that we’ll end up with further gains still. ↩
That said, I think they usually don’t. I’d suggest usually interpreting such people as talking about the sorts of “aims” I discuss here. ↩
This isn’t literally how training an AI system would look - it’s more likely that we would e.g. train an AI model to imitate my judgments in general. But the big-picture dynamics are the same; more at this post. ↩
Ajeya Cotra explores topics like this in detail here; there is also some interesting discussion of simplicity vs. complexity under the “Strategy: penalize complexity” heading of Eliciting Latent Knowledge. ↩
This analogy has a lot of problems with it, though - AI developers have a lot of tools at their disposal that natural selection didn’t! ↩
Or I guess just “I” ¯\_(ツ)_/¯ ↩
With some additional caveats, e.g. the ambitious “aim” can’t be something like “an AI system aims to gain lots of power for itself, but considers the version of itself that will be running 10 minutes from now to be a completely different AI system and hence not to be ‘itself.’” ↩
This statement isn’t literally true.
These sorts of aims just don’t seem likely to emerge from the kind of AI development I’ve assumed in this piece - developing powerful systems to accomplish ambitious aims via trial-and-error. This isn’t a point I have defended as tightly as I could, and if I got a lot of pushback here I’d probably think and write more. (I’m also only arguing for what seems likely - we should have a lot of uncertainty here.) ↩
From Human Compatible by AI researcher Stuart Russell. ↩
Stylized story to illustrate one possible relevant dynamic:
These writeups generally stay away from an argument made by Eliezer Yudkowsky and others, which is that theorems about expected utility maximization provide evidence that sufficiently intelligent (compared to us) AI systems would necessarily be “maximizers” of some sort. I have the intuition that there is something important to this idea, but despite a lot of discussion (e.g., here, here, here and here), I still haven’t been convinced of any compactly expressible claim along these lines. ↩
“Identical at the cellular but not molecular level,” that is. … ¯\_(ツ)_/¯ ↩
See my most important century series, although that series doesn’t hugely focus on the question of whether “trial-and-error” methods could be good enough - part of the reason I make that assumption is due to the nearcasting frame. ↩
Your argument requires the assumption of “malign priors” - that is, a highly capable AI rates dangerous goal directed behaviour highly enough a priori to converge to this behaviour through training. This requirement is not invalidated by the presence of errors in the training data. This assumption has been defended, but I think its status remains speculative. If AI is too biased towards misaligned behaviour, then I would expect ordinary non deceptive goodharting to be an insurmountable problem. It’s not obvious to me that “sufficiently benign to avoid regular goodharting, but not enough to avoid deception” is where things are likely to settle by default.
It’s my view that not making this assumption explicit is an oversight.
I'd love to hear someone explain their disagreement (edit: thanks Daniel!)
I am one of the people who upvoted your comment but disagreement-downvoted it. I think you are unfairly attempting to shift the burden of proof here: "Your argument requires the assumption of “malign priors” - that is, a highly capable AI rates dangerous goal directed behaviour highly enough a priori to converge to this behaviour through training."It's more like, "One way this argument could be wrong is if the surprising hypothesis of "benign priors" was true - that is, powerful goal-directed behavior is extremely low-prior in the learning algorithm, such that the training process can't find this strategy/behavior/policy even though it would in fact lead to higher reward."Why would ordinary nondeceptive goodharting be an insurmountable problem?
So I think we agree that the assumption is required. I don't fully agree with your summary: it's not that it doesn't find the behaviour, it's that it doesn't prefer the behaviour, and the reward bonus isn't enough to shift its preference.
Here are two defences of the malign priors assumption:
If we assume that a powerful AI's behaviour can be described by some simplicity prior over objectives then deceptive behaviour is likely.
By an informal count, there are more deceptive goals than nondeceptive ones
The counting argument is really just another measure argument - deceptive goals outnumber nondeceptive ones by enough that "most" priors over goals will give them a lot more weight.
Now, you might think these arguments are really solid, but I think it's important to recognise their limitations. First: AIs learn behaviours, not goals. A "natural prior" over behaviours that appears to exhibit good behaviour at low levels of capability might look like a strange prior over goals. The observation that an advanced AI must act in ways that looks goal-directed doesn't contradict this - the fact that you sometimes look goal-directed does not imply that, once all things are considered, your goals don't end up looking very strange.
Secondly, the design of AIs is partly constrained by mathematical convenience, but within those constraints people are going to pick designs that seem to work well. Now, deception is not the same as "seeming to do well". Seeming to do well requires that similar but lesser models successfully carry out less complex tasks. The prior for the potentially deceptive model is chosen by iterating on the design of nondeceptive models. This is probably, from most points of view, a weird prior! It is not clear to me that the objective counting argument is relevant here - it might be, but it might not be.
Thirdly, the most impressive AI systems we have today do not operate according to reinforcement learning on a mathematically convenient prior. The prior employed by a reinforcement learning built on top of a large language model is not mathematically convenient; rather, it's some kind of approximation of the distribution of texts that people produce.
The point about nondeceptive goodharting: suppose we have some training environment and a signal suitable for training AI (for no particular reason, I am thinking about "self-driving cars" and "passenger star ratings"). Suppose we have an AI not good enough to be effectively deceptive. We can consider two classes of behaviour A: aligned behaviour that gets good reward, B obviously misaligned behaviour that gets good reward. My guess is that B≫A. We want our cars to go for good ratings while obeying a whole lot of side constraints - road rules, picking up passengers fairly, not cheating the system etc. If we have an AI where counting arguments are conclusive with regard to its eventual behaviour, I think we get a really bad taxi.
Now, maybe these can be dealt with by putting a lot more effort into the reward signal (penalising for road rule breaking, adding fares as well as star ratings, penalising attempts to cheat in every way you can imagine...). This would, at a minimum, entail a lot more effort than business-as-ususal reinforcement learning, and my guess is that if behviour counting arguments still apply then it flat out wouldn't work. That's what I mean by "insurmountable".
Alternatively, maybe we deal with these problems by picking a prior that promotes A compared to B. In fact, this seems to be a more realistic way of constructing a self-driving taxi that gets good passenger ratings -- first, make it a safe car, then adjust its behaviour (within limits!) to get better ratings from passengers.
Now, it's possible that even though we solve the B≫A problem with better priors, with higher capability the set C of objectives that yield deceptively misaligned behaviour outnumbers B by so much that the better priors still don't help. However, I think this is once again speculative and if if it's an assumption underpinning your argument you need to say so.
(Sure, in some sense we agree that the assumption is required, but I think that's a misleading way of putting it, but whatever)Thank you for the detailed and lengthy explanation! I agree with your first point probably, it seems to me to be similar to what the shard theory people are exploring and yes this is a promising line of research which may if we are lucky overturn the default hypothesis that misaligned-but-deceptive AIs are most likely. I say similar things about the second point I guess. Both points are just basically saying "we don't know what the prior is like" so sure but they aren't positive arguments that the prior will be benign. Not sure whether I agree with the third point but anyhow it also just seems to be a warning that we are ignorant about the prior, not an argument that the prior is benign.I don't think I understand your more detailed argument that begins with "the point about nondescriptive goodharting." I'm tired now so will go away but hopefully will return and try to think more deeply about it. I strongly encourage you to write up a post on it, with emphasis on clarity. I really hope you are right!
LW bug(?) report: All of the inline footnotes are pointing to the Cold Takes references, even though the backref links in the Notes section are pointing to the LW post (screenshot of what I see on hover). Same issue with the alignment forum post. I'm wondering if there's any way to fix that?
Thanks for pointing that out! The post was imported and unfortunately I don't think there's any easy or quick fix for this
Post summary (feel free to suggest edits!):The author argues that if today’s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). This is because there is good economic reason to have AIs ‘aim’ at certain outcomes - eg. We might want an AI that can accomplish goals such as ‘get me a TV for a great price’. Current methods train AIs to do this via trial and error, but because we ourselves are often misinformed, we can sometimes negatively reinforce truthful behavior and positively reinforce deception that makes it look like things are going well. This can mean AIs learn an unintended aim, which if ambitious enough, is very dangerous. There are also intermediate goals like ‘don’t get turned off’ and ‘control the world’ that are useful for almost any ambitious aim.
Warning signs for this scenario are hard to observe, because of the deception involved. There will likely still be some warning signs, but in a situation with incentives to roll out powerful AI as fast as possible, responses are likely to be inadequate.
(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
Are these summaries from ChatGPT?
Currently it's all manually, but the ChatGPT summaries are pretty decent, I'm looking into which types of posts it does well.