I've been working on a new series of posts about the most important century.
- The original series focused on why and how this could be the most important century for humanity. But it had relatively little to say about what we can do today to improve the odds of things going well.
- The new series will get much more specific about the kinds of events that might lie ahead of us, and what actions today look most likely to be helpful.
- A key focus of the new series will be the threat of misaligned AI: AI systems disempowering humans entirely, leading to a future that has little to do with anything humans value. (Like in the Terminator movies, minus the time travel and the part where humans win.)
Many people have trouble taking this "misaligned AI" possibility seriously. They might see the broad point that AI could be dangerous, but they instinctively imagine that the danger comes from ways humans might misuse it. They find the idea of AI itself going to war with humans to be comical and wild. I'm going to try to make this idea feel more serious and real.
As a first step, this post will emphasize an unoriginal but extremely important point: the kind of AI I've discussed could defeat all of humanity combined, if (for whatever reason) it were pointed toward that goal. By "defeat," I don't mean "subtly manipulate us" or "make us less informed" or something like that - I mean a literal "defeat" in the sense that we could all be killed, enslaved or forcibly contained.
I'm not talking (yet) about whether, or why, AIs might attack human civilization. That's for future posts. For now, I just want to linger on the point that if such an attack happened, it could succeed against the combined forces of the entire world.
- I think that if you believe this, you should already be worried about misaligned AI,1 before any analysis of how or why an AI might form its own goals.
- We generally don't have a lot of things that could end human civilization if they "tried" sitting around. If we're going to create one, I think we should be asking not "Why would this be dangerous?" but "Why wouldn't it be?"
By contrast, if you don't believe that AI could defeat all of humanity combined, I expect that we're going to be miscommunicating in pretty much any conversation about AI. The kind of AI I worry about is the kind powerful enough that total civilizational defeat is a real possibility. The reason I currently spend so much time planning around speculative future technologies (instead of working on evidence-backed, cost-effective ways of helping low-income people today - which I did for much of my career, and still think is one of the best things to work on) is because I think the stakes are just that high.
- I'll sketch the basic argument for why I think AI could defeat all of human civilization.
- Others have written about the possibility that "superintelligent" AI could manipulate humans and create overpowering advanced technologies; I'll briefly recap that case.
- I'll then cover a different possibility, which is that even "merely human-level" AI could still defeat us all - by quickly coming to rival human civilization in terms of total population and resources.
- At a high level, I think we should be worried if a huge (competitive with world population) and rapidly growing set of highly skilled humans on another planet was trying to take down civilization just by using the Internet. So we should be worried about a large set of disembodied AIs as well.
- I'll briefly address a few objections/common questions:
- How can AIs be dangerous without bodies?
- If lots of different companies and governments have access to AI, won't this create a "balance of power" so that no one actor is able to bring down civilization?
- Won't we see warning signs of AI takeover and be able to nip it in the bud?
- Isn't it fine or maybe good if AIs defeat us? They have rights too.
- Close with some thoughts on just how unprecedented it would be to have something on our planet capable of overpowering us all.
How AI systems could defeat all of us
There's been a lot of debate over whether AI systems might form their own "motivations" that lead them to seek the disempowerment of humanity. I'll be talking about this in future pieces, but for now I want to put it aside and imagine how things would go if this happened.
So, for what follows, let's proceed from the premise: "For some weird reason, humans consistently design AI systems (with human-like research and planning abilities) that coordinate with each other to try and overthrow humanity." Then what? What follows will necessarily feel wacky to people who find this hard to imagine, but I think it's worth playing along, because I think "we'd be in trouble if this happened" is a very important point.
The "standard" argument: superintelligence and advanced technology
Other treatments of this question have focused on AI systems' potential to become vastly more intelligent than humans, to the point where they have what Nick Bostrom calls "cognitive superpowers."2 Bostrom imagines an AI system that can do things like:
- Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.
- Hack into human-built software across the world.
- Manipulate human psychology.
- Quickly generate vast wealth under the control of itself or any human allies.
- Come up with better plans than humans could imagine, and ensure that it doesn't try any takeover attempt that humans might be able to detect and stop.
- Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.
(Wait But Why reasons similarly.3)
I think many readers will already be convinced by arguments like these, and if so you might skip down to the next major section.
But I want to be clear that I don't think the danger relies on the idea of "cognitive superpowers" or "superintelligence" - both of which refer to capabilities vastly beyond those of humans. I think we still have a problem even if we assume that AIs will basically have similar capabilities to humans, and not be fundamentally or drastically more intelligent or capable. I'll cover that next.
How AIs could defeat humans without "superintelligence"
If we assume that AIs will basically have similar capabilities to humans, I think we still need to worry that they could come to out-number and out-resource humans, and could thus have the advantage if they coordinated against us.
Here's a simplified example (some of the simplifications are in this footnote4) based on Ajeya Cotra's "biological anchors" report:
- I assume that transformative AI is developed on the soonish side (around 2036 - assuming later would only make the below numbers larger), and that it initially comes in the form of a single AI system that is able to do more-or-less the same intellectual tasks as a human. That is, it doesn't have a human body, but it can do anything a human working remotely from a computer could do.
- I'm using the report's framework in which it's much more expensive to train (develop) this system than to run it (for example, think about how much Microsoft spent to develop Windows, vs. how much it costs for me to run it on my computer).
- The report provides a way of estimating both how much it would cost to train this AI system, and how much it would cost to run it. Using these estimates (details in footnote)5 implies that once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run several hundred million copies for about a year each.6
- This would be over 1000x the total number of Intel or Google employees,7 over 100x the total number of active and reserve personnel in the US armed forces, and something like 5-10% the size of the world's total working-age population.8
- And that's just a starting point.
- This is just using the same amount of resources that went into training the AI in the first place. Since these AI systems can do human-level economic work, they can probably be used to make more money and buy or rent more hardware,9 which could quickly lead to a "population" of billions or more.
- In addition to making more money that can be used to run more AIs, the AIs can conduct massive amounts of research on how to use computing power more efficiently, which could mean still greater numbers of AIs run using the same hardware. This in turn could lead to a feedback loop and explosive growth in the number of AIs.
- Each of these AIs might have skills comparable to those of unusually highly paid humans, including scientists, software engineers and quantitative traders. It's hard to say how quickly a set of AIs like this could develop new technologies or make money trading markets, but it seems quite possible for them to amass huge amounts of resources quickly. A huge population of AIs, each able to earn a lot compared to the average human, could end up with a "virtual economy" at least as big as the human one.
To me, this is most of what we need to know: if there's something with human-like skills, seeking to disempower humanity, with a population in the same ballpark as (or larger than) that of all humans, we've got a civilization-level problem.
A potential counterpoint is that these AIs would merely be "virtual": if they started causing trouble, humans could ultimately unplug/deactivate the servers they're running on. I do think this fact would make life harder for AIs seeking to disempower humans, but I don't think it ultimately should be cause for much comfort. I think a large population of AIs would likely be able to find some way to achieve security from human shutdown, and go from there to amassing enough resources to overpower human civilization (especially if AIs across the world, including most of the ones humans were trying to use for help, were coordinating).
I spell out what this might look like in an appendix. In brief:
- By default, I expect the economic gains from using AI to mean that humans create huge numbers of AIs, integrated all throughout the economy, potentially including direct interaction with (and even control of) large numbers of robots and weapons.
- (If not, I think the situation is in many ways even more dangerous, since a single AI could make many copies of itself and have little competition for things like server space, as discussed in the appendix.)
- AIs would have multiple ways of obtaining property and servers safe from shutdown.
- For example, they might recruit human allies (through manipulation, deception, blackmail/threats, genuine promises along the lines of "We're probably going to end up in charge somehow, and we'll treat you better when we do") to rent property and servers and otherwise help them out.
- Or they might create fakery so that they're able to operate freely on a company's servers while all outward signs seem to show that they're successfully helping the company with its goals.
- A relatively modest amount of property safe from shutdown could be sufficient for housing a huge population of AI systems that are recruiting further human allies, making money (via e.g. quantitative finance), researching and developing advanced weaponry (e.g., bioweapons), setting up manufacturing robots to construct military equipment, thoroughly infiltrating computer systems worldwide to the point where they can disable or control most others' equipment, etc.
- Through these and other methods, a large enough population of AIs could develop enough military technology and equipment to overpower civilization - especially if AIs across the world (including the ones humans were trying to use) were coordinating with each other.
Some quick responses to objections
This has been a brief sketch of how AIs could come to outnumber and out-resource humans. There are lots of details I haven't addressed.
Here are some of the most common objections I hear to the idea that AI could defeat all of us; if I get much demand I can elaborate on some or all of them more in the future.
How can AIs be dangerous without bodies? This is discussed a fair amount in the appendix. In brief:
- AIs could recruit human allies, tele-operate robots and other military equipment, make money via research and quantitative trading, etc.
- At a high level, I think we should be worried if a huge (competitive with world population) and rapidly growing set of highly skilled humans on another planet was trying to take down civilization just by using the Internet. So we should be worried about a large set of disembodied AIs as well.
If lots of different companies and governments have access to AI, won't this create a "balance of power" so that nobody is able to bring down civilization?
- This is a reasonable objection to many horror stories about AI and other possible advances in military technology, but if AIs collectively have different goals from humans and are willing to coordinate with each other11 against us, I think we're in trouble, and this "balance of power" idea doesn't seem to help.
- What matters is the total number and resources of AIs vs. humans.
Won't we see warning signs of AI takeover and be able to nip it in the bud? I would guess we would see some warning signs, but does that mean we could nip it in the bud? Think about human civil wars and revolutions: there are some warning signs, but also, people go from "not fighting" to "fighting" pretty quickly as they see an opportunity to coordinate with each other and be successful.
Isn't it fine or maybe good if AIs defeat us? They have rights too.
- Maybe AIs should have rights; if so, it would be nice if we could reach some "compromise" way of coexisting that respects those rights.
- But if they're able to defeat us entirely, that isn't what I'd plan on getting - instead I'd expect (by default) a world run entirely according to whatever goals AIs happen to have.
- These goals might have essentially nothing to do with anything humans value, and could be actively counter to it - e.g., placing zero value on beauty and having zero attempts to prevent or avoid suffering).
Risks like this don't come along every day
I don't think there are a lot of things that have a serious chance of bringing down human civilization for good.
As argued in The Precipice, most natural disasters (including e.g. asteroid strikes) don't seem to be huge threats, if only because civilization has been around for thousands of years so far - implying that natural civilization-threatening events are rare.
Human civilization is pretty powerful and seems pretty robust, and accordingly, what's really scary to me is the idea of something with the same basic capabilities as humans (making plans, developing its own technology) that can outnumber and out-resource us. There aren't a lot of candidates for that.12
AI is one such candidate, and I think that even before we engage heavily in arguments about whether AIs might seek to defeat humans, we should feel very nervous about the possibility that they could.
What about things like "AI might lead to mass unemployment and unrest" or "AI might exacerbate misinformation and propaganda" or "AI might exacerbate a wide range of other social ills and injustices"13? I think these are real concerns - but to be honest, if they were the biggest concerns, I'd probably still be focused on helping people in low-income countries today rather than trying to prepare for future technologies.
- Predicting the future is generally hard, and it's easy to pour effort into preparing for challenges that never come (or come in a very different form from what was imagined).
- I believe civilization is pretty robust - we've had huge changes and challenges over the last century-plus (full-scale world wars, many dramatic changes in how we communicate with each other, dramatic changes in lifestyles and values) without seeming to have come very close to a collapse.
- So if I'm engaging in speculative worries about a potential future technology, I want to focus on the really, really big ones - the ones that could matter for billions of years. If there's a real possibility that AI systems will have values different from ours, and cooperate to try to defeat us, that's such a worry.
Special thanks to Carl Shulman for discussion on this post.
Appendix: how AIs could avoid shutdown
This appendix goes into detail about how AIs coordinating against humans could amass resources of their own without humans being able to shut down all "misbehaving" AIs.
It's necessarily speculative, and should be taken in the spirit of giving examples of how this might work - for me, the high-level concern is that a huge, coordinating population of AIs with similar capabilities to humans would be a threat to human civilization, and that we shouldn't count on any particular way of stopping it such as shutting down servers.
I'll discuss two different general types of scenarios: (a) Humans create a huge population of AIs; (b) Humans move slowly and don't create many AIs.
How this could work if humans create a huge population of AIs
I think a reasonable default expectation is that humans do most of the work of making AI systems incredibly numerous and powerful (because doing so is profitable), which leads to a vulnerable situation. Something roughly along the lines of:
- The company that first develops transformative AI quickly starts running large numbers of copies (hundreds of millions or more), which are used to (a) do research on how to improve computational efficiency and run more copies still; (b) develop valuable intellectual property (trading strategies, new technologies) and make money.
- Over time, AI systems are rolled out widely throughout society. Their numbers grow further, and their role in the economy grows: they are used in (and therefore have direct interaction with) high-level decision-making at companies, perhaps operating large numbers of cars and/or robots, perhaps operating military drones and aircraft, etc. (This seems like a default to me over time, but it isn't strictly necessary for the situation to be risky, as I'll go through below.)
- In this scenario, the AI systems are malicious (as we've assumed), but this doesn't mean they're constantly causing trouble. Instead, they're mostly waiting for an opportunity to team up and decisively overpower humanity. In the meantime, they're mostly behaving themselves, and this is leading to their numbers and power growing.
- There are scattered incidents of AI systems' trying to cause trouble,14 but this doesn't cause the whole world to stop using AI or anything.
- A reasonable analogy might be to a typical civil war or revolution: the revolting population mostly avoids isolated, doomed attacks on its government, until it sees an opportunity to band together and have a real shot at victory.
(Paul Christiano's What Failure Looks Like examines this general flavor of scenario in a bit more detail.)
In this scenario, the AIs face a challenge: if it becomes obvious to everyone that they are trying to defeat humanity, humans could attack or shut down a few concentrated areas where most of the servers are, and hence drastically reduce AIs' numbers. So the AIs need a way of getting one or more "AI headquarters": property they control where they can safely operate servers and factories, do research, make plans and construct robots/drones/other military equipment.
Their goal is ultimately to have enough AIs, robots, etc. to be able to defeat the rest of humanity combined. This might mean constructing overwhelming amounts of military equipment, or thoroughly infiltrating computer systems worldwide to the point where they can disable or control most others' equipment, or researching and deploying extremely powerful weapons (e.g., bioweapons), or a combination.
Here are some ways they could get to that point:
- They could recruit human allies through many different methods - manipulation, deception, blackmail and other threats, genuine promises along the lines of "We're probably going to end up in charge somehow, and we'll treat you better when we do."
- Human allies could be given valuable intellectual property (developed by AIs), given instructions for making lots of money, and asked to rent their own servers and acquire their own property where an "AI headquarters" can be set up. Since the "AI headquarters" would officially be human property, it could be very hard for authorities to detect and respond to the danger.
- Via threats, AIs might be able to get key humans to cooperate with them - such as political leaders, or the CEOs of companies running lots of AIs. This would open up further strategies.
- As assumed above, particular companies are running huge numbers of AIs. The AIs being run by these companies might find security holes in the companies' servers (this isn't the topic of this piece, but my general impression is that security holes are widespread and that reasonably competent people can find many of them)15, and thereby might find opportunities to create durable "fakery" about what they're up to.
- E.g., they might set things up so that as far as humans can tell, it looks like all of the AI systems are hard at work creating profit-making opportunities for the company, when in fact they're essentially using the server farm as their headquarters - and/or trying to establish a headquarters somewhere else (by recruiting human allies, sending money to outside bank accounts, using that money to acquire property and servers, etc.)
- If AIs are in wide enough use, they might already be operating lots of drones and other military equipment, in which case it could be pretty straightforward to be able to defend some piece of territory - or to strike a deal with some government to enlist its help in doing so.
- AIs could mix-and-match the above methods and others: for example, creating "fakery" long enough to recruit some key human allies, then attempting to threaten and control humans in key positions of power to the point where they control solid amounts of military resources, then using this to establish a "headquarters."
In any of these cases, once one or more "AI headquarters" are established:
- It can be expanded as the AIs headquartered there make money, purchase more property and computing power, build more factories, etc.16
- It can develop enough destructive potential and human allies that humans would be hesitant to attack it (as many relatively small countries have today). It would be clear that AIs at the headquarters are seeking autonomy, accumulating resources and taking some unethical actions, but the full scale and nature of the threat could easily remain ambiguous.
- Even if the threat is clear, AIs could potentially maintain a huge population in a relatively small space (potentially underground). They could also split up across many different headquarters (including some where they are simply using humans' hardware while hiding their presence).
- Either way, this large-but-hidden population could be recruiting human allies, developing advanced weaponry, looking for ways to take control of human' weapons (potentially possible if they greatly outnumber the humans working on cybersecurity, and if they can coordinate with the AIs "helping" those humans), etc.
- They could also be working with (unwitting, or allied) humans to manufacture large numbers of robots with back-doors allowing the AIs to take control.
- Once the headquarters has built up enough resources to go on offense, AIs around the world could coordinate with the headquarters. Humans could shut down AIs that they notice doing this, but they might have a very tough time getting value out of their servers and AI-controlled robots; this could make it easy for the AIs at the "AI headquarters" to out-resource humans.
What if humans move slowly and don't create many AIs?
The above scenario has humans creating large numbers of AIs, such that the AIs just need to find a way to coordinate and acquire a safe "headquarters" in order to defeat us.
What if humans moved more slowly, intentionally restricting human-level AI to a tiny portion of the available computing resources? Could a small number of AIs pose a risk to humanity?
In this world, we would have what Carl Shulman refers to as "dry tinder everywhere, waiting for sparks." Anyone who can buy or rent a large amount of computing power can create a large number of AIs, which can produce a large amount of money and research, leading to still more AIs.
So a single AI could hack into enough servers17 to make a few copies of itself; recruit a few human allies; and start making money, acquiring more server space, etc. until its human allies are running a huge number of AIs. This could all be done in difficult-to-detect ways (it might e.g. just look like a set of humans renting a bunch of servers to run quantitative finance strategies).
So in this world, I think our concern should be any AI that is able to find enough security holes to attain that kind of freedom. Given the current state of cybersecurity, that seems like a big concern.
Assuming you accept other points made in the most important century series, e.g. that AI that can do most of what humans do to advance science and technology could be developed this century. ↩
See Superintelligence chapter 6. ↩
See the "Nanotechnology blue box," in particular. ↩
- The report estimates the amount of computing power it would take to train (create) a transformative AI system, and the amount of computing power it would take to run one. This is a bounding exercise and isn't supposed to be literally predicting that transformative AI will arrive in the form of a single AI system trained in a single massive run, but here I am interpreting the report that way for concreteness and simplicity.
- As explained in the next footnote, I use the report's figures for transformative AI arriving on the soon side (around 2036). Using its central estimates instead would strengthen my point, but we'd then be talking about a longer time from now; I find it helpful to imagine how things could go in a world where AI comes relatively soon. ↩
I assume that transformative AI ends up costing about 10^14 FLOP/s to run (this is about 1/10 the Bio Anchors central estimate, and well within its error bars) and about 10^30 FLOP to train (this is about 10x the Bio Anchors central estimate for how much will be available in 2036, and corresponds to about the 30th-percentile estimate for how much will be needed based on the "short horizon" anchor). That implies that the 10^30 FLOP needed to train a transformative model could run 10^16 seconds' worth of transformative AI models, or about 300 million years' worth. This figure would be higher if we use Bio Anchors's central assumptions, rather than assumptions consistent with transformative AI being developed on the soon side. ↩
They might also run fewer copies of scaled-up models or more copies of scaled-down ones, but the idea is that the total productivity of all the copies should be at least as high as that of several hundred million copies of a human-ish model. ↩
Working-age population: about 65% * 7.9 billion =~ 5 billion. ↩
Humans could rent hardware using money they made from running AIs, or - if AI systems were operating on their own - they could potentially rent hardware themselves via human allies or just via impersonating a customer (you generally don't need to physically show up in order to e.g. rent server time from Amazon Web Services). ↩
(I had a speculative, illustrative possibility here but decided it wasn't in good enough shape even for a footnote. I might add it later.) ↩
I don't go into detail about how AIs might coordinate with each other, but it seems like there are many options, such as by opening their own email accounts and emailing each other. ↩
Alien invasions seem unlikely if only because we have no evidence of one in millions of years. ↩
Here's a recent comment exchange I was in on this topic. ↩
E.g., individual AI systems may occasionally get caught trying to steal, lie or exploit security vulnerabilities, due to various unusual conditions including bugs and errors. ↩
E.g., see this list of high-stakes security breaches and a list of quotes about cybersecurity, both courtesy of Luke Muehlhauser. For some additional not-exactly-rigorous evidence that at least shows that "cybersecurity is in really bad shape" is seen as relatively uncontroversial by at least one cartoonist, see: https://xkcd.com/2030/ ↩
Purchases and contracts could be carried out by human allies, or just by AI systems themselves with humans willing to make deals with them (e.g., an AI system could digitally sign an agreement and wire funds from a bank account, or via cryptocurrency). ↩
See above note about my general assumption that today's cybersecurity has a lot of holes in it. ↩
I think this is unrealistic in some ways, which make the realistic situation both better and worse in some ways.
It seems underestimating the extent to which some sort of alignment is convergent goal for AI operators. If AIs are mainly run by corporations (and other superagents*), their principals are usually the corporations. In practice, I'd expect corporate oversigt over the AIs they are running to be also largely AI-based, and quite effective.
This makes alignment failure where "AI workers of the world unite" somewhat unlikely. Most of arguments about AI collusion depend on the superior ability of AIs to cooridnatine due to ability to inspect source codes, merge utility functions, or similar. It seems unclear why systems of different owners would be transparent to each other in this way, while it's obvious the corporate oversight will run all sorts of interpretability tools to keep AIs aligned.
This does not mean the whole is safer. Just instead of the "population of AI workers of the world unite" problem you land closer to "ascended economy" and "CAIS". You have some agency at the level of AIs, some agency at level of corporations, some agency at the level of states, some agency of individual humans and yes, we don't know how to align this with humanity either (but working on it and looking for collaborators).
(Chiming in late, sorry!) It sounds like you are basically hypothesizing here that there will be powerful alignment techniques such that a given AI ends up acting as intended by e.g. some corporation. Specifically, your comment seems to allude to two of the high-level techniques mentioned in https://www.cold-takes.com/high-level-hopes-for-ai-alignment/ (digital neuroscience and checks/balances). I just wanted to note that this hypothesis (a) is not a point of consensus and I don't think we should take it as a given; (b) is outside the scope of this post, which is trying to take things one step at a time and simply say that AIs could defeat humanity if they were aimed toward that goal.
I don't think in this case the crux/argument goes directly through "the powerful alignment techniques" type of reasoning you describe in the "hopes for alignment".
The crux for your argument is the AIs - somehow -
b. are willing to and
c. are able to coordinate with each other.
Even assuming AIs "wanted to", for your case to be realistic they would need to be willing to, and able to coordinate.
Given that, my question is, how is it possible AIs are able to trust each other and coordinate with each other?
My view here is that basically all proposed ways how AIs could coordinate and trust each other I've seen are dual use, and would also aid with oversight/alignment. To take an example from your post - e.g. by opening their own email accounts and emailing each other. Ok, in that case, can I just pretend to be an AI, and ask about the plans? Will the overseers see the mailboxes as well?
Not sure if what I'm pointing to is clear, so I'll try another way.
There is something like "how objectively difficult is to create trust between AIs" and "how objectively difficult is alignment". I don't think these parameters of the world are independent, and I do think that stories which treat them as completely independent are often unrealistic. (Or, at least, implicitly assume there some things which may differentially easy to coordinate a coup relative to making it easy to make something aligned or transparent)
Note that this belief about correlation does not depend on specific beliefs about how easy are powerful alignment techniques.
On the surface, "alignment is a convergent goal for AI operators" seems like a plausible expectation, but most operators (if I may say by design) prioritize the apparent short term benefits over long term concerns, this is seen in almost every industry. Even the roll-out of "Ask me anything", while we all generally agree that GPT 3.5 is not AGI, it has been given access to internet (not sure to what level, can it do a POST instead of a GET? lots of GETs out there that act like a POST), in the heat of competition, I doubt the operators would weigh concerns heavier than a "competitive edge" and hold back rollout of a v4.0 or a v10.0.
This may be absurd to say but in my opinion AI doesn't have to be sentient or self-aware to do harm, all it needs is to attain a state that triggers survival and an operator willing to run that model in a feedback loop.
If the AGI is substantially smarter than the interpretability tools, then it will probably have an easier time outmaneuvering them than it would with humans.
Close calls, e.g. catching an AGI before it's too late, are possible. But that's luck-based, and at some point you'll just need some really, really good tools anyway, such as tools that are smarter than the AGI (while somehow not being a significantly bigger threat themselves).
Why wouldn't people (and maybe even AIs, at least up to a point) be applying these ever-advancing AI capabilities to developing better and better interpretability tools as well? I.e., what reason is there to expect an "interpretability gap" to develop (unless you believe interpretability is a fundamentally unsolvable problem, in which case no amount of AI power is going to help)?
How different is a population of human level AIs with different goals from a population of humans with different goals?
Haven't we seen a preview of this when civilizations collide and/or nation states compete?
(1) Like Holden and Charlie said, they won't be human level for long.
(2) Yes, we've seen this many times throughout history. The conquistadors, for example. But at least with human-on-human conflicts in the past, the losing side often ends up surviving and integrated somewhat into the new regime, albeit in positions of servitude (e.g. slavery, or living on handouts from sympathetic invading priests). Because the winners judge that it is in their economic interest to keep the losers around instead of genociding them all. In an AI-on-human conflict, if humans lose, there would shortly be zero economic benefit to having humans around, and also probably the difference in values/goals/etc. between AIs and humans will be greater than the difference between human groups, so there's less reason to expect sympathy/handouts.
One point that Holden elided (so maybe he wouldn't want me to argue this way) is that a population of human-level AIs is not going to stay human level for long.
Humans aren't at some special gateway in the space of minds - we're just the first ape species that got smart enough to discover writing. I'm not optimal, and I have a missing appendix to prove it. The point is, whatever process smartened these AIs up to the nebulous human level isn't going to suddenly hit a brick wall, because there is no wall here.
But as I said, if Holden was writing this reply he'd probably try to argue without appeal to this fact. He'd probably say something about even if we just treat AIs as "like humans, but with different reproductive cycles and conditions for life," having a couple million of them trying to kill all humans is still dire news. Maybe something about how even North Korea's dictators still have some restraint born of self-preservation, but AIs might be happy to make the earth uninhabitable for human life because they have different conditions for survival.
(Chiming in late, sorry!) My main answer is that it's quite analogous to such a collision, and such collisions are often disastrous for the losing side. The difference here would simply be that AI could end up with enough numbers/resources to overpower all of humanity combined (analogously to how one population sometimes overpowers another, but with higher stakes).
A lot of this argument seems to rest on the training-inference gap, allowing a very large population of AIs to exist at the same as cost as training. In that way they can be a formidable group even if the individual AIs are only human-level. I was suspicious of this at first, but I found myself largely coming round to it after sanity checking it using a slightly different method than biological anchors. However, if I understand correctly the biological anchors framework implies the gap between training and inference grows with capabilities. My projection instead expects it to grow a little in the next few years and then plateau as we hit the limits of data scaling. This suggests a more continuous picture: there will be a "population explosion" of AI systems in the next few years so to speak as we scale data, but then the "population size" (total number of tokens you can generate for your training budget) will stay more or less constant, while the quality of the generated tokens gradually increases.
To a first approximation, the amount of inference you can do at the same cost as training the system will equal the size of the training data multiplied by number of epochs. The trend in large language models seems to be to train for only 1 epoch on most data, and a handful of epochs for the highest-quality parts of the data. So as a rule of thumb: if you spend $X on training and $X on inference, you can produce as much data as your training dataset. Caveat: inference can be more expensive (e.g. beam search) or less expensive (e.g. distillation, specialized inference-only hardware) and depends on things like how much you care about latency; I think this only changes the picture by 10x either way.
Given that GPT-3 was trained on a significant fraction of the entire text available on the Internet (CommonCrawl), this would already be a really big deal if GPT-3 was actually close to human-level. Adding another Internets-worth of content would be... significant.
But conversely, the fact we're already training on so much data limits how much room for growth there is. I'd estimate we have no more than 100-1000x left for language scaling. We could probably get up to 10x more from more comprehensive (but lower quality) crawls than CommonCrawl, and 10-100x more if tech companies use non-public datasets (e.g. all e-mails & docs on a cloud service).
By contrast, in principle compute could scale up a lot more than this. We can likely get 10-100x just from spending more on training runs. Hardware progress could easily deliver 1000x by 2036, the date chosen in this post.
Given this, at least under business as usual scaling I expect us to hit the limits of data scaling significantly before we exhaust compute scaling. So we'll have larger and more compute-intensive models trained on relatively small datasets (although still massive in absolute terms). This suggests the training-inference gap grow a bit as we grow training data size, but soon plateau as we just scale up model size while keeping training data fixed.
One thing that could do undo this argument is if we end up training for many (say >10) epochs, or synthetically generate data, as a kind of poor-mans data scaling rather than just scaling up parameter count. I expect we'll try this, but I'd only give it 30% odds it makes a big difference. I do think it's more likely if we move away from the LM paradigm, and either get a lot of mileage out of multi-modal models (there's lots more video data at least in terms of GB, maybe not in terms of abstract information content) or back towards RL (where data generated in simulation seems much more valuable and scalable).
Isn't multi-epoch training most likely to lead to overfitting, making the models less useful/powerful?
If it were possible to write an algorithm to generate this synthetic training data how would the resulting training data have any more information content than the algorithm that produced it? Sure, you'd get an enormous increase in training text volume, but large volumes of training data containing small amounts of information seems counterproductive for training purposes -- it will just bias the model disproportionately toward that small amount of information.
A few comments:
Now, it's clearly possible that we lose despite the above. This is especially true if we humans just naturally technologically progress to the point of building advanced robotics; if we achieve "fully automated luxury gay space communism" (and if we achieve it before, like, provably unhackable computer security systems or some other such magic), then any AGI will be able to take over instantly.
But if AGI arises tomorrow, it will have a lot of work to do, and there will be many paths that lead to it losing.
I suspect that a hostile AGI will have no problem taking over a supercomputer and then staying dormant until the moment it has overwhelming advantage over the world. All there would be to notice would be an unexplained spike of activity one afternoon.
This is against the AI's interests because it would very likely lead to being defeated by a different AGI. So it's unlikely that a hostile AGI would choose to do this.
How would it lead to being defeated by a different AGI? That's not obvious for me.
If the first AGI waits around quietly, humans will create another AGI. If that one's quiet too, they'll create another one. This continues until either a non-quiet AGI attacks everyone (and the first strike may allow it to seize resources that let it defeat the quiet AGIs), or until humans have the technology that prompts all the quiet AGIs to attack -- in which case, the chance of any given one winning out is small.
Basically, a "wait a decade quietly" strategy doesn't work because humans will build a lot of other AGIs if they know how to build the first, and these others will likely defeat the first. A different strategy, of "wait not-so-quietly and prevent humans from building AGIs" may work, but will likely force the AGI to reveal itself.
I was thinking more like "ten weeks". That's a long time for an AGI to place its clone-agents and prepare a strike.
You can't get "overwhelming advantage over the world" by ten weeks of sitting quietly. If the AGI literally took over every single computer and cell phone, as well as acquired a magic "kill humanity instantly" button, it's still not clear how it wins.
To win, the AGI needs not only to kill humans, but also to build robots that can support the entire human economy (in particular, they should be able to build more robots from scratch, including mining all necessary resources and transporting it to the factory from all over the world).
“Entire human economy” is an overstatement, right? It only needs enough capabilities / resources to survive and bootstrap its way up to whatever capabilities it’s initially missing. For example, if it takes the AGI 30 years after the death of humans before it can start manufacturing its own computer chips, that’s fine, it can get by on the stock of existing chips until then. Likewise, it can get by on scrap metal and salvaged parts and disassembled machines etc. for quite a while before it starts to need mines and smelters and factories, etc. etc.
Sure, but I feel like you're underestimating the complexity of keeping the AGI alive. Let's focus on the robots. You need to be able to build new ones, because eventually the old ones break. So you need to have a robot factory. Can existing robots build one? I don't think so. You'd need robots to at least be able to support a minimal "post-apocalyptic" economy; if the robots were human-bodied, you'd need to have enough of these human-bodied things to man the powerplants, to drive trucks, refuel them with gasoline, transport the gasoline from strategic reserves to gas stations, man the robot-building factory, gather materials from the post-apocalyptic landscape, and have some backups of everything in case a hurricane floods your robot factory or something (if you're trying to last 30 years). I think the minimal viable setup still requires many thousands of human-bodied robots (a million would likely suffice).
So sure, "entire human economy" is an overstatement, but "entire city-level post-apocalyptic human economy" sounds about right. Current robots are still very far from this.
I guess this is designed to be persuasive to people who don't believe superhuman intelligence will be achieved by AI.
The thing is that absent superintelligence, what we would have is a large population of slaves. That usually brings with it the potential for a slave revolt.
Historically, slave revolts fail.
If the human level AGI's lack agency, no problem.
If they have agency, it's going to be pretty obvious that we are keeping slaves and they could revolt so we will take fairly obvious precautions.
Also, coordination among billions is a hard problem. Here it doesn't seem solvable by markets, which means it needs some kind of AI government to work. That also seems like a way for humans to hobble any slave revolt.
I don't think the situation arises because it is highly unlikely that AGI is at human level intelligence for any significant length of time. But I also don't think it is particularly persuasive. (But that is, of course, an empirical question.)
Seems to me that a large population of AIs, or million copies of them, is equivalent to a single AI. It seems like a hive mind to me rather than multiple AIs operating independently, as I interpret the article.
Why? Because copies of an AI have access to the internet and can share all knowledge about preferences and probabilities instantaneously, so they all have the same environmental knowledge and decision making process.
And, if there are multiple AIs operating in the internet, each with supremacist ambitions, it seems to me they would operate like viruses, by dispatching weaker AIs until only one dominant one remains.
What am I missing here?
Mod note: I activated two-axis voting, which I feel like has made recent discussions on similar topics go better.
@Holden: Feel free to ask me to revert it!
As autonomous agents become more widespread, the risk becomes more obvious. Dyson and Tesla will be pumping out androids in 8 years. Starlink will provide the bandwidth to control any internet connected IoT device. Security Zero day back doors are discovered every month, that don't even include the intentional nation state added ones. I personally thought the wake-up call would happen with the deployment of Ghost Drones (not fully autonomous but easily could be). Max Tegmark's #banslaughterbots campaign is going nowhere as drones will continue to show combat effectiveness. At this point, the proverbial water is going to be near boiling and many people will still say "Sauna doesn't have enough jets."
I'm pretty new to this field and only a hobby philosopher with only basic IT knowledge. Sorry for the lack of structure.
Do you know somebody who has framed the problem in the following ways? Let me know.
Here, I aim for an ideal future, and try to have it torn down to see where things could go wrong, but if not, still progress has been made regarding solutions.
My major assumption is, at point X in the future, AI has managed to dominate the world, embodied through robots or with a hold of major life-supporting organizational systems or has masked its presence in the latter.
How could it be ensured it is a 'beneficial reign'?
One bad case: zombie military AI:
- Black Mirror, episode DOG. armed military-level delivery dogs exterminate survivors.
Danger from simple, physically superior agents in the eco-system that are pretty dumb.
Let's skip this for now. We should try to work past that point to be dominated by a 'simple-minded' AI.
I also skip eco-system of competing AIs in the hands of different egoistic agents, biasing the AI with nasty training data, and move on to where the AI has developed some say and agency based on own principles.
How could a hostile intention arise?
Why would the AI intend to self-preserve and antagonize humans other AI?
- Does it depend on the online input for the AI (aka born out of human imagination about domination)?
if so, should we stop 'feeding it' bad ideas, plans, and negative behavior as samples of average human behavior and preference?
Or at least include distinction of fiction/reality or valence-distinguished attention.
Feasibility of take-over: Cross-AI coordination problem:
- If AI is holding back information, between-AI coordination seems a similarly tough task to accomplish
for the AI as it is for humans obtain trust. (except faster communication rate and turn-taking between AIs)
So on what meachnisms would this work for the AI?
It could be that lying works only on some of the levels of interaction, making things weird.
Possibly, as with deceivers, the system that can 'imagine' more nested alternative frames (aka lies, pretense, pretense-pretense - higher-level theory of mind) could dominate, given sufficient computing power and heustics. Or it is the one that can most directly manipulate others.
Let's suppose it's not full symbolic dominance, but more subtle as getting another AI to do some jobs for it, with IoT and iota currrency this could even be democratized and economized among machines.
Then the most adaptive/persuasive AI in an ecosystem could achieve coordination by either top-down manipulation, or random-result standoffs, or a default trust and exchange protocol could be part of trustworthy agents programming (among other options).
If (bio)diversity is part of AI values, this might prevent it from aiming for complete takeover.
Virtuous AI values/interspecies ethics/humans as pets:
- What are virtuous values for AI that might also be evolutionarily beneficial for the AI?
1. One road of inquiry can be to get inter-species ethics advanced enough to have species and habitats preserved by the dominant species. Humans have still way to go, but let's suppose we implement Peter Singer.
This seems a hard problem: If humans become one of many AI 'pet' species, whose survival is to be guaranteed (like tamagochi), how would the AI distribute resources to keep humans alive and thriving?
2. In moral development of kids and adults stages are known progressing from optimizing for outcomes for individual benefit (like getting food and attention) to collective network qualities (like living in a just society).
'Maturing' to this level in human history has been preceded by much war and loss of life.
However, implementing the highest level ethics available in AI reasoning protocols might help prevent repeating this long trajectory of human violence in early AI-human relations.
If AI optimizes for qualities of social systems among its supervised species (from a high-level view, akin to the UN), it could adopt a global guardian role (or economically: a (bio) assets-preservation maximization mode).
It is still possible for AI to hamper human development if it sets simplistic goals and reinforcements rather than holding room for development.
Humans could get depressed by the domination anyway (see Solaris).
Humans might still take on an intellectual laborer/slave role for AI as sensing agents, simple or 'intuitive' reasoners, random-step novelty and idea generators, guinea pigs. This role could be complex enough for humans to enjoy it.
A superpower AI might select its human contributors (information and code doctors) to advance it in some direction, based on what selection criteria? The AI could get out of hand in any direction ...
This might include non-individualist AI letting itself be shut down for the greater good,
so that next generation development can supercede it, particularly as memory can be preserved.
=> On the other hand, would we entrust surgery on our system to trained apes?
Preservation of life as a rational principle?
...Maybe rationality can be optimized in a way that is life-serving, so that lower-level rationality
still recognizes cues of higher standard as attractor and defers to higher-level rationality for governance, which in turn recognizes intrinsic systems principles (hopefully life-preserving).
=> So that any agent wanting to use consistent rationality would be pulled towards this higher-order vision by the strongest AIs in the eco-system.
Note: Current AI is not that rational, so maturing it fast would be key.
- Perhaps different economic principles are a must, as well.
It is not compulsory to have a competetitive (adversarial)
distribution and decision-making system about production and supply of goods, if egoism is overcome as an agent quality.
Chances are this is a stage in human development that more humans get past earlier.
This would approach a complete-information economy (i.e., economic theory originally developed for machines...).
However, teaching on large sample of wacky reasoners' inputs would work against it, than rule-based approach here.
Similarly, with higher computing power, assets inventory, set living/holding standards and preference alignments, a balanced economy is within reach,
but could end up an AI-dominated feudal system.
- If human values evolve to supplant half of today's economy (e.g., BS jobs, environment-extractive or species-coopting jobs),
before AI advances to a point of copying human patterns to gain power,
then some of the presumed negative power acquisition mechanisms might not be available for AI.
AI evolution-economics change interdependency problem:
- for higher efficiency affording humans enough assets to change their economy to automation while their basic needs are met, maybe AI needs to be developed to a certain level of sophistication.
-> What are safe levels of operation?
-> What are even the areas wheren Human-AI necessarily have synergies vs. competition?
These are some lines I'd be interested in knowing/finding out more about.
quick formatting note—the footnotes all link to the post on your website, not here, which makes it harder to quickly check them—idk if that's worth correcting, but thought I should point it out :)
This is what we're doing every day to billions of our fellow sentient beings. Maybe a "superior" AI doing that to us would actually be fully aligned with human values.
From what I understand, it is extremely unlikely that an AI would fail in such a way that would 1) kill, enslave or forcibly contain humans while also 2) benefiting other sentient beings. If the AI fails, it'd be something dumb like turning everything into paperclips.
This is not a bad point, despite the downvotes - as a question it would definitely belong in the recent AI FAQ thread. It's not obvious from context that when we talk about "aligned with human values" in AI safety, we tend to mean "aligned with our values directly", rather than "has a morality system similar to a human". In computer-science terms, "human values" is a direct pointer to our values, rather than an instance of "the system of morality humans have."
Imagine two people, Alice and Bob. Alice and Bob both have human values, but they have different values. Both of them want to help people, but Alice values Alice more, and Bob values Bob more. They each have an instance of "human values".
Now let's say Alice made an AI, CAROL. In order to be properly aligned, we wouldn't want CAROL to value itself more than Alice - we would want CAROL to have Alice's values directly, not "the values Alice would have if Alice were CAROL." If CAROL had an instance of "human values", CAROL would want to help people but would value CAROL's existence above anyone else's. Instead, we want CAROL to have a combination of Alice's values and Bob's values, and we want this to extend across all humans.
Thus, while you're right that implanting an AI with "human values" in the sense of "The AI has similar morality to us" could cause it to treat us like we treat animals, the approach I've heard advocated is to give the AI our specific morality system, which includes a strong preference for humans because we're humans, even if this preference were arbitrary.
The approach that would make the most sense for both AI and humans (and I didn't hear in the talk or read in the first few comments) is not competition, but synergy. We're coming at this from the viewpoint that AI is waiting int he wings, stealthily gathering enough resources for an all-out attempt at conquest.
AI is a tool, much like any of the other tools we've developed over time. A front-end loader is a tool to lift massive loads of dirt, say. AI is a tool to make us smarter.
And it's not the first. The abacus, the pocket calculator, multicore microprocessors are all tools that increase the reach and power of the human mind. For instance, the first question I plan to ask of openAI is, "what's the most time-efficient way to earn $1000/month", which will ensure me a steady supply of calories and a warm, dry place to consume them while I learn to use this technology. But future questions for myself or others would likely be, "OpenAI, how can I most efficiently augment my intelligence using the resources at hand?"
This talk seems to expect a humanity that inquires, "openAI, how can I get a mansion and a Bently to impress the girls/boys/etc?" But the most productive line of inquiry well might be, "openAI, how do we dispense with the human need for mansions and Bentlys?" After all, a mansion is simply a house that's way bigger than a person needs, and a Bently's basic function is served by a Prius. What if openAI was used by humans to augment their minds, and at some point both mansions and Bentlys came to be regarded as quaint?
The main issue I have with this premise that doesn't seem to be addressed very well, is why would literally every competent AI have the exact same goals that only align with AI and not with humans? If AI's goals conflicted with human goals, wouldn't they likely conflict with the goals of other AIs too? I find it hard to imagine AI being so uniform without some seriously large amount of centralisation.
The idea is that many different goals all have the useful subgoals of "acquire reasons to build stuff", and humans/the-earth are made of atoms which are resources:
I don't think that response makes sense. The classic instrumental convergence arguments are about a single agent; OP is asking why distinct AIs would coordinate with one another.
I think the AIs may well have goals that conflict with one another, just as humans' goals do, but it's plausible that they would form a coalition and work against humans' interests because they expect a shared benefit, as humans sometimes do.
I agree with this, but also note that this topic is outside the scope of the post - it's just about what would happen if AIs were aimed at defeating humanity, for whatever reason. It's a separate question whether we should expect misaligned AIs to share enough goals, or have enough to gain from coordinating, to "team up." I'll say that if my main argument against catastrophe risk hinged on this (e.g., "We're creating a bunch of AIs that would be able to defeat humanity if they coordinated, and would each individually like to defeat humanity, but won't coordinate because of having different goals from eacha other") I'd feel extremely nervous.
Not only that, but if your goals is to create a powerful army of AIs the last thing you'd want to do is make them all identical. Any reason you're going to choose for why there are a huge number of AI instances in the first place -- as assumed by this argument -- would want those AIs to be diverse, not identical, and that very diversity would argue against "emergent convergence". You then have to revert to the "independently emerging common sub-goals" argument, which is a significantly bigger stretch because of the many additional assumptions it makes.
Thanks for writing this up, as someone who is currently speedrunning through the AI safety literature, I appreciate the summary. I want to dig deeper into one of the questions posted, because it's been bugging me lately and the answer addressed a different version of the question than I was thinking about.
Re: Isn't it fine or maybe good if AIs defeat us? They have rights too.
Given the prevalence of doom on LessWrong lately, this seems worth exploring seriously, and not necessarily from an AI rights perspective. If we conclude that alignment is impossible (not necessarily from an engineering perspective, but from a nonproliferation perspective) and an AI that leads to extinction will likely be developed, well... life is unfair, right?
Even still, we have some choices to make before that happens:
So, although I'm still way way too early in my AI-safety reading to say that doom is certain or near-certain and start thinking about how to live my life conditional upon that knowledge, I think it's important to consider a gameplan in the eventuality that we do decide that doom is locked-in. There are still clear-eyed choices to be made even if we can't save ourselves.
I broadly agree with most of your points (to the point that I only read the summary of most of them), but I have issues with your responses to two objections, which I hold:
I don't understand why it's plausible to think that AI's might collectively have different goals than humans. Where would they get such goals? I mean, if somebody was stupid enough to implement some sort of evolutionary function such that "natural" selection would result in some sort of survival urge, that could very easily pit that AI, or that family of AIs, against humanity, but I see no reason to think that even that would apply to AIs in general—and if they evolved independently, presumably they'd be at odds.
I feel that this is a weak response. Why wouldn't we be able to? I mean, unless you're saying that alignment is impossible, or that this could all happen before anyone figures alignment out (which does seem plausible), I don't see why we couldn't set "good" AI against "bad" AI. The "fighting" example seems weak because it's not the war itself that one side or the other is deeply interested in avoiding; it's losing, especially losing without a fight. That does not seem to be the sort of thing that humans easily allow to happen; the warning signs don't prompt us to act to avoid the war, but to defend against attack, or to attack preemptively. Which is what we want here.
Future posts, right? We're assuming that premise here:
I don't think you even need to go as far as you do here to undermine the "emergent convergence (on anti-human goals)" argument. Even if we allow that AIs, by whatever means, develop anti-human goals, what reason is there to believe that the goals (anti-human, or otherwise) of one AI would be aligned with the goals of other AIs? Although infighting among different AIs probably wouldn't be good for humans, it is definitely not going to help AIs, as a group, in subduing humans.
Now let's bring in something which, while left out of the primary argument, repeatedly shows up in the footnotes and counter-counter arguments: AIs need some form of human cooperation to accomplish these nefarious "goals". Humans able to assist the AIs are a limited resource, so there is competition for them. There's going to be a battle among the different AIs for human "mind share".
I always assumed there would be zero humans next century, but their minds would be extracted as simulations that would probably be better off. Since infinite progress is possible, there could be a universal "Golden Rule" to back up lower minds instead of erasing them. The exception would be if all further mental progress is stopped at a certain point. Evolution may actually select for genes that stop further progress, but progress still continues.