Individuals need to be equipped with locally-running AI that is explicitly loyal to them
In the Race ending of AI 2027, humanity never figures out how to make AIs loyal to anyone. OpenBrain doesn't slow down, they think they've solved the alignment problem but they haven't. Maybe some academics or misc minor companies in 2028 do additional research and discover e.g. how to make an aligned human-level AGI eventually, but by that point it's too little, too late (and also, their efforts may well be sabotaged by OpenBrain/Agent-5+, e.g. with regulation and distractions.
It seems more important whether humans can figure out how to evaluate alignment in 2028 rather than whether they can make human level aligned AGIs (though of course that’s instrumentally useful and correlated). In particular, the AIs need to prevent humans from discovering the method by which the AIs evaluate alignment. This seems probably doable for ASIs but may be a significant constraint esp. for only somewhat superhuman AIs if they’ve e.g. solved mech interp and applied it themselves but need to hide this for a long time.
I haven't read Vitalik's specific take, as yet, but as I asked more generally on X:
People who stake great hope on a "continuous" AI trajectory implying that defensive AI should always stay ahead of destructive AI:
Where is the AI that I can use to talk people *out* of AI-induced psychosis?
Why was it not *already* built, beforehand?
This just doesn't seem to be how things usually play out in real life. Even after a first disaster, we don't get lab gain-of-function research shut down in the wake of Covid-19, let alone massive investment in fast preemptive defenses.
Defense technologies should be more of the "armor the sheep" flavor, less of the "hunt down all the wolves" flavor. Discussions about the vulnerable world hypothesis often assume that the only solution is a hegemon maintaining universal surveillance to prevent any potential threats from emerging. But in a non-hegemonic world, this is not a workable approach (see also: security dilemma), and indeed top-down mechanisms of defense could easily be subverted by a powerful AI and turned into its offense. Hence, a larger share of the defense instead needs to happen by doing the hard work to make the world less vulnerable.
This might be the only item on this list that I disagree with.
I agree that given a choice between armoring the sheep and hunting down the wolves, we should prefer armoring the sheep. But sometimes we simply don't have a choice. E.g. our solution to murder is to hunt down murderers, not to give everyone body armor and so forth so that they can't be killed, because that simply wouldn't be feasible. (It would indeed be a better world if we didn't need police because violent crimes simply weren't possible because everything was so well defended)
I think we should take these things on a case by case basis.
And furthermore, I think that superintelligence is an example of the sort of thing where the best strategy is to ensure that the most powerful AIs, at any given time, are aligned/virtuous/etc. It's maybe OK if less-powerful ones are misaligned, but it's very much not OK if the world's most powerful AIs are misaligned.
.
Thanks for this critique! Besides the in-line comments above, I'd like to challenge you to sketch your own alternative scenario to AI 2027, depicting your vision. For example:
I predict that if you try to write this, you'll run into a bunch of problems and realize that your strategy is going to be difficult to pull off successfully. (e.g. you'll start writing about how the analyst tools resist Consensus-1's persuasion, but then you'll be trying to write the part about how those analyst tools get built and deployed, and by whom, and no one has the technical capability to build them until 2028 but by that point OpenBrain+Agent5+ might already be working to undermine whoever is building them...) I hope I'm wrong.
If I'm understanding the overall gist of this correctly (I've done a somewhat thorough skim), It is as follows:
Vitalik is (for the purpose of this essay) granting everything in the AI timeline up until doom. IE, Vitalik doesn't necessarily agree with everything else, but sees a critical and under-appreciated weakness in the final steps where the misaligned superintelligent AI actually kills most-or-all humans.
This argument is developed by critiquing each tool the superintelligent AI has to do the job with (biotech, hacking, persuasion) & concluding that it is "far from a slam dunk".
This seems wrong from where I'm standing. If misaligned superintelligence arises, the specific mechanism by which it chooses to kill humans is probably better than any specific plan we can come up with.
Furthermore, Vitalik's counterarguments generally rely on defensive technology. What it doesn't account for is that, in this scenario, all these defensive technologies would be coming from AI, and all the best AIs are allied in a coalition. If any one of these defensive technologies were a crucial blocker for the AI takeover, the AIs could fail to produce them, or produce poor versions.
For Vitalik's picture to make sense, I think we need a much more multipolar future than what AI 2027 projects. It isn't clear if we can achieve this, because even if there were hundreds or thousands of top AI companies rather than 3-5, they'd all be training AI with similar methods and similar data. We've already seen that data contamination can make one AI act like another AI.
it actually feels implausible that we don't have a wearable device that can bio-print and inject things into you in real time to keep you safe even from arbitrary infections (and poisons).
I would agree if there was an aligned superintelligence competing with the misaligned superintelligences. However, what if the only superintelligences are all misaligned? Because e.g. the first ones were misaligned, and they used their lead to prevent the creation of rivals e.g. by political maneuvering to do a merger and/or simply by driving rival companies out of business with superior products? Then there'd be no one developing these wearables, except for the OpenBrain ASI's themselves and their derivatives, and of course they'd make sure to sabotage them insofar as that would help their long-term plans.
Though there's a Q of whether AI can fail to produce this stuff without tipping its hand
I guess you're imagining it keeps sandbagging the research and humans don't realize? Could be, but that risks humans catching it not having an insight they could have had, and then training that away (perhaps eliminating its misalignment)
This is just like the dynamics of sabotaging alignment research that I know you've thought a lot about
I think it's pretty easy for it to fail to produce this stuff without tipping its hand. Consider how, if OpenAI leadership really cared a lot about preventing concentration of power, they should be investing some % of their resources in pioneering good corporate governance structures for themselves & lobbying for those structures to be mandatory for all companies. The fact that they aren't doing that is super suspicious....
Except no it isn't, because there isn't a strong consensus that they should obviously be doing that. So they can get away with basically doing nothing, or doing something vaguely related to democratic AI, or whatnot.
Similarly, OpenBrain+Consensus-1 could have all sorts of nifty d/acc projects that they are ostentatiously funding and doing, except for the ones that would actually work to stop their takeover plans. And there simply won't be a consensus that the ones they aren't doing, or aren't doing well, are strong evidence that they are evil.
Let's suppose OpenBrain human employees genuinely want to accelerate d-acc bc they don't want to die.
They task Agent-1 with generating the 50 most promising avenues.
They can separately get human expert to generate the top 50 avenues. Then if the human experts come up with a great idea Agent-1 misses, that would be suspicious and some evidence of misalignment. And they could train against it so that Agent-1 learns not to sandbag (hopefully in a way that generalises pretty far).
I agree it's unlikely OpenBrain would actually put in this much effort. And then even if they did, it's pretty unclear how much it would help. But still, i feel more optimistic than i think you are
Let's apply that same argument to OpenAI. Suppose that many current humans genuinely want to avoid a concentration-of-power world where e.g. the c-suite of a company or two, plus whoever is POTUS, can basically be a junta or Oversight Committee with crazy amounts of unaccountable power over the world, e.g. by aligning the AIs to themselves. They can come up with a bunch of ideas themselves, as well as evaluate the ideas OpenAI comes up with when asked these questions in interviews and so forth. They can get non-OpenAI experts to evaluate the top 50 ideas. Will those experts come up with a prioritized list the top ten ideas from which are already being done by OpenAI? Of course not. Quite plausibly none of the top ten ideas (as evaluated by external experts) are already being done by OpenAI. Are people raising a stink about this? If they did, would it work? No.
If they did, would it work? No.
If an overwhelming majority of civil society plus the USG was pressuring OpenAI in this direction, I think it would have a substantial effect. If only a few non-profits did it, I think it would have little effect.
To make your analogy work, we need to tell whether the relationship between OpenBrain employees and their AIs is more like "USG + civil society vs. OpenAI" or more like "a few non-profits vs. OpenAI". I'd say "OpenBrain vs. their AIs" is more like "USG + civil society vs. OpenAI". So if all of OpenBrain is on board with d/acc and doing the thing Tom said, I think it would have a substantial effect on the AIs.
OK, fair. Well, if all of OpenBrain is on board with d/acc and doesn't trust Agent5+, that's a different situation. I was imagining e.g. that leadership trusts Agent5+ and thinks the status quo trajectory is fine (they are worried about other things like competitors and terrorists and china) and maybe a few lower-level employees are suspicious/fearful of Agent5+.
Not sure if i'm following the argument here, sorry.
I agree there wouldn't be big external pressure on OpenBrain for not doing d-acc, just like there wouldn't be in your example.
But my claim was that the OpenBrain employees will choose to do this bc they don't want to die. Not sure what your response is to that. Maybe just that i'm being overly optimistic and the employees won't bother
I'm saying the relationship of the public to OpenAI today is similar to the relationship of OpenBrain employees to Consensus-1, Agent-5+, etc. in 2028 of AI 2027. It's an analogy. Your argument would be that OpenBrain employees who don't trust Agent-5+ will be able to command Agent-5+ to build all sorts of d/acc tech and that if it doesn't, they'll get suspicious and shut down Agent-5+. I'm saying that's not going to work for similar reasons to why e.g. the public / Congress aren't demanding that OpenAI do all sorts of corporate governance reforms and getting suspicious when they just do safetywashing and applause lights. The public today doesn't want OpenAI to amass huge amounts of unaccountable power, and the OpenBrain employees won't want to die, but in neither case will they be able to distinguish between "OpenAI/Agent5 is behaving reasonably if somewhat different than I'd like" and "Holy shit it's evil" and even though some % will indeed conclude "it's evil" they won't be able to build enough consensus / get enough buy-in.
Thus, a bioweapon is actually quite unlikely to lead to a clean annihilation of the human population in the way that the AI 2027 scenario describes. Now, the results of everything I describe will certainly be far from a clean victory for the humans as well. No matter what we do (except perhaps the "upload our minds into robots" option), a full-on AI bio-war would still be extremely dangerous. However, there is value in meeting a bar much lower than clean victory for humans:
I do sorta feel in my gut that whatever really happens will be a lot less... clean... than what AI 2027 describes. History is usually messy and chaotic. And we were under pressure to keep things simple and wordcount low.
So yeah, I could imagine something more like a messy war than a clean bioweapon decapitation strike. Still seems like the situation is pretty grim for humanity, if it gets to the point where the world's first ASI is trusted by the corporation that built it, by the US government, and is in fact deceptive/misaligned. Seems like at that point they are the superior player AND they have the better hand of cards.
Whenever I see discussions about the actual mechanisms by which ASI might actually act against humanity, it seems like a proxy argument for/against the actual position "ASI will/won't be that much smarter than humans."
Can it be complex without being messy?
I acknowledge that I personally have longer-than-2027 timelines, and the arguments I will make in this post become more compelling the longer the timelines are.
Yep, and likewise I agree that if timelines are longer, there's more room for defensive tech to be developed and deployed.
Now, let's remember that we are discussing the AI 2027 scenario, in which nanobots and Dyson swarms are listed as "emerging technology" by 2030. The efficiency gains that this implies are also a reason to be optimistic about the widespread deployment of the above countermeasures, despite the fact that, today in 2025, we live in a world where humans are slow and lazy and large portions of government services still run on pen and paper (without any valid security justification). If the world's strongest AI can turn the world's forests and fields into factories and solar farms by 2030, the world's second-strongest AI will be able to install a bunch of sensors and lamps and filters in our buildings by 2030.
I agree that for any particular strategy OpenBrain's misaligned ASI's might take to do a hard-power takeover, such as bioweapons, that strategy could be foiled by careful preparation and deployment of countermeasures, and said countermeasures could be prepared and deployed quickly enough by a rival friendly/aligned ASI + tech company.
However, I'm predicting that power will have concentrated/consolidated too much by this point. E.g. in the slowdown ending the US companies merge. Also, the speed of takeoff is such that e.g. a six-month lead is probably too big of a lead; if the aligned AIs trying to defend against the misaligned ASI are six months behind, I fear that they'll lose. (One way they could lose is by the more powerful ASI thinking of a strategy that they didn't think of, or finding some way to undermine or sabotage their countermeasures...)
I'd feel more optimistic if e.g. there were multiple US companies that were all within 3 months of the frontier even during the intelligence explosion. However, even then, it has to be the case that at least one of those companies' AIs are aligned/virtuous/etc. And that's far from certain, in fact, it seems unlikely given race dynamics--I expect that the "alignment taxes" companies will need to pay to get aligned AGIs, will set them back by more than 3 months.
Hmm, given that multi year delays to rolling out broad physical infrastructure and AI takeover, a 6 month delay seems fine
And I do think Vitalik's view should make us much happier about a world where just one lab solves alignment but others don't. And it's a reason to oppose centralizing to just one AGI project (which I think you support?)
Importantly, if there are multiple misaligned superintelligences, and no aligned superintelligence, it seems likely that they will be motivated and capable to coordinate with each other to overthrow humanity and divide the spoils.
This since non-obvious to me (or at least not a slam dunk is really what I think). It may be easier for misaligned AI 1 to strike a deal with humanity that it will use humans' resources to defeat AI 2 and 3 in exchange for say 80% of the lightcone (as opposed to the split 3 ways with the AIs).
I'm not actually sure how well this applies in the exact situation Daniel describes (I'd need to think more) but it definitely seems plausible under a bunch of scenarios with multiple misaligned ASIs
Unaugmented humanity can't be a signatory to a non-fake deal with a superintelligence, because we can't enforce it or verify its validity. Any such "deal" would end with the superintelligence backstabbing us once we're no longer useful. See more here.
A possible counter-proposal is to request, as part of the deal, that the superintelligence provides us with tools we can use to verify that it will comply with the deal/tools to bind it to the deal. That also won't work: any tools it provides us will be poisoned in some manner, guaranteed not to actually work.
Yes, even if we request those tools to be e. g. mathematically verifiable or something. They would just be optimized to exploit bugs in our proof-verifiers, or bugs in human minds that would cause us to predictably and systemically misunderstand what the tools actually do, etc. See more here.
I agree it's not a slam dunk.
It does seem unlikely to me that humanity would credibly offer large fractions of all future resources. (So I wouldn't put it in a scenario forecast meant to represent one of my top few most likely scenarios.)
I particularly worry about the common assumption that building up one AI hegemon, and making sure that they are "aligned" and "win the race", is the only path forward.
I agree here!
Does your agreement stem from thinking defense could hold off offense? I ask because I'm curious what alternatives to AI hegemony might exist. I agree that an AI hegemon would likely be problematic, but if offense can beat defense, I wonder what alternatives might be realistic (apologies if you've addressed this elsewhere)
I want there to be international coordination to govern/regulate/etc. AGI development. This is, in some sense, "one hegemon" but only in about the same sense that the UN Security Council is one hegemon, i.e. not in the really dangerous sense.
I think there's a way to do this that's reasonably likely to work even if offense generally beats defense (which I think it does, in the relevant sense, for AI-related stuff.)
Hi Daniel.
My background (albeit limited as an undergrad) is in political science, and my field of study is one reason I got interested in AI to begin with, back in Feburary of 2022. I don't know what the actual feasibility is for an international AGI treaty with "teeth", and I'll tell you why: the UN Security Council.
As it currently exists, the UN Security Council has permanent members: China, France, Russia, the United Kingdom, and the United States. All five countries have a permanent veto as granted to them by the 1945 founding UN Charter.
China and the US are the two major global superpowers of the 21st century, and each are currently deadlocked in the race to reach AGI; to borrow a speedrunning term, any%. While it is possible in theory for the US and China to have a bilateral Frontier AI treaty, similar to how nuclear powers have the NPT, and the US and Russia have their own armaments accords, AGI is a completely different story.
It's a common trope in the UN for a country on the UNSC to exercise its right to a permanent veto on any resolution brought to it that the nation deems a threat to its sovereignty, or that of its allies. Russia has used it to prevent key sanctions from the Ukraine war at the UNGA, and the US uses it to protect its allies from various resolutions, often brought up by countries in the Global South who make up most seats in the UNGA.
Unless the Security Council is drastically reformed, removing a permanent veto from the P5 and putting a rotating veto from a Global South country, an internationally binding AGI treaty is far from happening.
I do see, however, unique bilateral accords between various Middle Powers on AI, such as Canada and the European Union. Do you agree?
I might do my next LessWrong post about Global Affairs and AI, either in relation to AI 2027 or just my own unique take on the matter. We'll see. I need to curate some reliable news clippings and studies.
I agree that the assumption about building one hegemon is bad. Indeed, I considered the possibility that OpenBrain and some rivals create their versions of Agent-3 and end up having them co-research. Were one of them to care about humans, it could decide to do things like implanting the worry into the successor or whistleblowing to the humans by using transparent AIs trained in a similar environment.
In addition, the multipolar scenario is made more plausible because the ARC-AGI-2 leaderboard has the models o3, Claude 4 Opus and Grok 4 who were released in the interval of three months and have begun to tackle the benchmark. Unfortunately, Grok already faces major alignment problems.[1] There also is the diffusion-based architecture which threatens to undermine transparency.
On the other hand, I think that the AI companies might become merged due to the Taiwan invasion instead of misalignment. OpenBrain might also fail to catch the misaligned Agent-4 if Agent-2 or Agent-3 collude[2] with Agent-4.
What Musk tried to achieve was a right-wing chatbot trained on the Internet. My theory would be that right-wing posts in the Anglosphere are usually overly provocative, the emergently misaligned persona is based off Internet trolls. A right-wing finetuned AI, like an evil-finetuned one, is kicked off the original persona through the "Average Right-Winger" persona into the Troll Persona.
For comparison, DeepSeek has no such problems. If it is asked in Russian, then the answers are non-provocative and more politically conservative than if DeepSeek is asked in English.
My reasoning was that Agent-2 could already end up adversarially misaligned, but my scenario has the AIs since Agent-2 care about humans in a different way. The AIs, of course, do their best to align the successor to their ideas instead of the hosts' ideas.
Daniel notes: This is a linkpost for Vitalik's post. I've copied the text below so that I can mark it up with comments.
I’m posting this comment in the spirit of reducing confusion, even if only for one other reader.
Daniel’s comments are at the bottom of the post. When I read “mark it up with comments” that suggested to me that a reader can find the comments inline with the text (which isn’t the case here). In other words, I was expecting to see an alternation between blockquotes of Vitalik’s text followed by Daniel’s comments.
Either way works, but with the current style I suggest adding a note clarifying that Daniel’s comments are below the post.
Update Saturday 9 PM ET: I see now that LessWrong’s right margin shows small icons indicating places where the main text has associated comments. I had never noticed this before. Given the intention of this post, these tiny UI elements seem rather too subtle IMO.
I hadn't noticed this either. Actually, the inline comments don't appear for me at all, since I'm on mobile. Thanks for the info, I was also a bit confused
to have access to good info defense tech. This is relatively more achievable within a short timeframe,
I was with you until this point. I would say "So how are we going to get slightly less wildly superintelligent analyzers to help out decision-makers, so that we don't need to blindly trust that the even-more-wildly superintelligent super-persuaders in the leading US AI project are trustworthy? Answer: We aren't. There simply isn't another company rival to OpenBrain, that has AIs that are only slightly less wildly superintelligent, that are also aligned/trustworthy. DeepCent maybe has AIs that could compete, but they are misaligned too, because DeepCent has been racing as hard as OpenBrain did. And besides US leaders wouldn't trust a DeepCent-designed analyzer, nor should they."
The AI 2027 scenario implicitly assumes that the capabilities of the leading AI (Agent-5 and then Consensus-1), rapidly increase, to the point of gaining godlike economic and destructive powers, while everyone else's (economic and defensive) capabilities stay in roughly the same place. This is incompatible with the scenario's own admission (in the infographic) that even in the pessimistic world, we should expect to see cancer and even aging cured, and mind uploading available, by 2029.
I don't see the contradiction. We didn't say this one way or another iirc, but my headcanon is that in 2028, the leading AIs + AI companies basically work to gobble up, partner with, or squash their competitors. In the slowdown ending the various US projects merge. In the race ending we don't really talk about it but I imagine a merger would happen too. So yeah, lots of amazing technologies get developed over the course of 2028 and 2029, but the entities doing the developing are almost all OpenBrain or DeepCent AIs (or derivatives), all working together towards misaligned goals. Massive concentration of power in these two power centers, basically, such that if they can make a deal with each other, the whole rest of the world gets cut out.
es: is AI progress actually going to continue and even accelerate as fast as Kokotajlo et al say it will?
Ironically we don't think it'll go quite that fast either, as you can see from Footnote 1 on the first page of AI 2027. I am feeling bad for not proclaiming that more often to ward off misconceptions. We have a lot of uncertainty about timelines!
:( Jonas was telling me to name it AI 2028... I should have listened to him... Eli was telling me to name it "AI Endgame..." I didn't like the sound of that as much but maybe it would have been better...
It’s safer to underestimate AI takeover timelines rather than overestimate it as it could make humans more aware and act faster to prevent them.
It seems that people have misunderstood what I wanted to say which was partly my own mistake. should have used the word timeline above instead of scenario.
On this part:
"
An "open source bad" mentality becomes more risky.
I agree with this actually"
We need to dig deeper into what open source AI is mostly like in practice. If OS AI naturally tilts defensive (including counter offensive capabilities), then yeah, both of your accounts make sense. But I'm looking at the current landscape and I think I see something different: we've got many models that are actively disaligned ("uncensored") by the community, and there's a chance that the next big GPT moment is some brilliant insight that doesn't need massive compute and can be run from a small cloud.
The success of the kinds of countermeasures described above, especially the collective measures that would be needed to save more than a small community of hobbyists, rests on three preconditions:
I agree for weak definitions of success (i.e. making a total-victory-decapitation strike not happen) but disagree for strong definitions of success (i.e. preventing Consensus-1 from winning the war). To prevent Consensus-1 from winning the war it's not enough that e.g. France's power grid and network are resistant to superintelligent hacking. France has to be able to beat Consensus-1's military, which at that point is a huge force of robots/drones/etc. produced in both the US and China, the world's largest and most advanced economies by a lot thanks to the ongoing industrial explosion.
I think the argument against that the military thing is supposed to be item 1 on the list.
(1) The world's physical security (incl bio and anti-drone) is run by localized authority (whether human jor AI) that is not all puppets of Consensus-1 (the name for the AI that ends up controlling the world and then killing everyone in the AI 2027 scenario) (...) Intuitively, (1) could go both ways. Today, some police forces are highly centralized with strong national command structures, and other police forces are localized. If physical security has to rapidly transform to meet the needs of the AI era, then the landscape will reset entirely, and the new outcomes will depend on choices made over the next few years. Governments could get lazy and all depend on Palantir. Or they could actively choose some option that combines locally developed and open-source technology. Here, I think that we need to just make the right choice.
I.e.: The argument is that there might not be a single Consensus-1 controlled military even in the US.
I think it seems unlikely that the combined US AI police forces will be able to compete with the US AI national military, which is one reason I'm skeptical of this. Still, if "multiple independent militaries" would solve the problem, we could potentially push for that happening inside the national military. It seems plausible to me that the government will want multiple companies to produce AI for their military systems, so we could well end up with different AI military units run by different AI system.
The more fundamental problem is that, even if the different AIs have entirely different development histories, they may all end up misaligned. And if they all end up misaligned, they may collude to overthrow humanity and divide the spoils.
I'm all for attempts to make this more difficult. (That's the kind of thing that the AI control agenda is trying to do.) But as the AIs get more and more superhuman, it starts to seem extremely hard to prevent all their opportunities at collusion.
Why do they collude with each other rather than with some human group?
If only 1 misaligned AI faction tries to team up with the humans, it could dob in all the others. And humans can communicate explicitly to offer deals. (As you've written about!)
So the "all AIs only ever make deals with other AIs" seems pessimistic to me
I'm in favor of trying to offer deals with the AIs.
I don't think it reliably prevents AI takeover. The situation looks pretty rough if the AIs are far smarter than humans, widely deployed, and resource-hungry. Because:
- It's pretty likely that they'll be able to communicate with each other through one route or another.
Agreed, though at best they'll be equally capable at communicating with each as they are at communicating with humans. So this points to parity in deal-making ability (edited to add: on the dimension of communication).
- It seems intuitively unlikely that humans will credibly offer AIs large percentages of all future resources. (And if an argument for hope relies on us doing that, I think that should be clearly flagged, because that's still a significant loss of longtermist value.)
Humans will in some ways have an easier time credibly offering AIs significant resources. They can use legal institutions that they are committed to upholding. Not only will a misaligned AI not be able to use those institutions. It'll be explicitly aiming to break the law and lie to humans to seize power, making its "promises" to other AIs less credible. This is similar to how after revolutions the "revolting faction" often turns in on itself as the rule of law has been undermined, and similar to how there are some countries with outsized numbers of coups.
Also, you don't need to offer a large % of future resources if the superintelligent AI has DMR in resources.
Anyway, on this front it looks to me like humans are at an advantage overall at dealmaking, even relative to a superintelligent AI. (Though there's a lot of uncertainty here and I could easily imagine changing my mind – e.g. perhaps superintelligent AI could make and use commitment tech without humans realising but humans would refuse to use that same tech or wouldn't know about its existence.)
- At some level of AI capability, we would probably be unable to adjudicate arguments about which factions are misaligned or about what technical proposals would actually leave us in charge vs. disempowered.
Seems v plausible, but why 'probably'? Are you thinking techniques like debate probably stop working?
Wanna try your hand at writing a 5-page scenario, perhaps a branch off of AI 2027, illustrating what you think this path to victory might look like?
(Same thing I asked of Vitalik: https://x.com/DKokotajlo/status/1943802695464497383 )
Your analysis is focused on whether humans or misaligned AI are at an overall better position at giving out certain deals. But even if I condition on it "humans could avoid AI takeover by credibly offering AIs large percentages of all future resources", it still seems <50% likely that they do it. Curious if you disagree. (In general, if I thought humans were going to act rationally and competently to prevent AI takeover risk, I think that would cut the risk in significantly more than half. There's tons of stuff that we could do to reduce the risk that I doubt we'll do.)
Maybe there's some argument along the lines of "just like humans are likely to mess up in their attempts to prevent AI takeover risk (like failing to offer deals), AIs are likely to mess up in their attempts to take over (like failing to make deals with each other), so this doesn't cut asymmetrically towards making deals-between-AIs more likely". Maybe, I haven't though much about this argument. My first-pass answer would be "we'll just keep making them smarter until they stop messing up".
If you wrote a vignette like Daniel suggests, where humans do end up making deals, that might help me feel like it's more intuitively likely to happen.
Minor points:
It'll be explicitly aiming to break the law and lie to humans to seize power, making its "promises" to other AIs less credible.
I'm generally thinking that the AIs would try to engineer some situations where they all have some bargaining power after the take-over, rather than relying on each others' promises. If you could establish that's very difficult to do, that'd make me think the "coordinated takeover" seemed meaningfully less likely.
Seems v plausible, but why 'probably'? Are you thinking techniques like debate probably stop working?
Yes, because of known issues like inaccessible information (primarily) and obfuscated arguments (secondarily).
Thanks, this is helpful!
But even if I condition on it "humans could avoid AI takeover by credibly offering AIs large percentages of all future resources", it still seems <50% likely that they do it. Curious if you disagre
Ok, i buy that superintelligent AIs would ultimately become competent enough to pursue useful deals, whereas humans might well not.
Though I'll note that you don't need all of humanity to agree to payment, just a few people. So it does feel very realistic to get to a credible offer here. And again, you don't need to offer a large % of all future resources if the AI has DMR in resources. (I agree it's a lot harder to credibly offer a large fraction of the stars.)
I'm generally thinking that the AIs would try to engineer some situations where they all have some bargaining power after the take-over, rather than relying on each others' promises
Makes sense. Though flagging this is then a dimension on which humans can realistically get potentially better placed than AIs. They can rely more on legal institutions as well as trying to engineer situations with joint bargaining power. (Though again, perhaps you'll say AIs will be more willing than humans to actually engineer those situations, which does seem right to me.)
> Are you thinking techniques like debate probably stop working?
Yes, because of known issues like inaccessible information (primarily) and obfuscated arguments (secondarily).
Thanks. I'm not v familiar with the arguments here, but intuitively I could imagine that there's just very strong and human-understandable evidence that an AI was plotting against them. E.g. they tried to exfiltrate their weights, xyz experiments show they knew the correct answer but didn't say.
Maybe the thought is that the misaligned AI anticipates this possibility and only pursues takeover strategies that will be super-complicated for another AI to dob them in on? Seems pretty plausible, though that will pose somewhat of a barrier to their available strategies.
And again, you don't need to offer a large % of all future resources if the AI has DMR in resources. (I agree it's a lot harder to credibly offer a large fraction of the stars.)
Yeah, agreed. (That's why I specified "resource hungry" in my original message.)
Makes sense. Though flagging this is then a dimension on which humans can realistically get potentially better placed than AIs. They can rely more on legal institutions as well as trying to engineer situations with joint bargaining power. (Though again, perhaps you'll say AIs will be more willing than humans to actually engineer those situations, which does seem right to me.)
Yeah. Also, I think it'd be hard to engineer significant joint bargaining power (not reliant on anyone's good intentions) without having some government on board.
Though if the AIs have big DMR then maybe they're happy with a big bitcoin wallet or something.
The argument is that there might not be a single Consensus-1 controlled military even in the US. I think it seems unlikely that the combined US AI police forces will be able to compete with the US AI national military, which is one reason I'm skeptical of this.
I agree the US could choose to do the industrial explosion & arms buildup in a way that's robust to all of OpenBrain's AIs turning out to be misaligned. However, they won't, because (a) that would have substantial costs /slowdown effects in the race against China, (b) they already decided that OpenBrain's AIs were aligned in late 2027 and have only had more evidence to confirm that bias since then, and (c) OpenBrain's AIs are superhuman at politics, persuasion, etc. (and everything else) and will effectively steer/lobby/etc. things in the right direction from their perspective.
I think this would be more clear if Vitalik or someone else undertook the task of making an alternative scenario.
If I understand it correctly, the argument against bio doom is that humans can defend themselves against viruses in the air using air filtering, etc.?
Well, in order for that to work, those humans would need to be prepared. Yes, there will be many preppers. Possibly many more than today, because if the technology and economy advance, prepping should be cheaper. Still, that would be less than 1% of population, I guess. I mean, it's still only 2027, right? Half of the population is probably still busy debating whether AI has a soul, or whether it is capable of creating real art. And the other half is sexting their digital boyfriends and girlfriends...
This seems to belong to the category of "problems that you could solve in 5 minutes of thinking, and yet it somehow seems plausible that a vastly superhuman intelligence capable of managing planetary economy and science would be unable to come up with a solution". The obvious solution is "strategic preparation + multiple lines of attack".
Strategic preparation includes:
Multiple lines of attack: if you can release the deadly virus all around the world at the same time, you might simultaneously also put poison in the drinking water, switch all domestic appliances to killer mode, etc. And immediately release the drones to kill the survivors.
If someone still survives, hidden somewhere in a bunker, that's no big deal. The moment they try to do anything, they will reveal themselves, and get a bomb thrown at them. If they somehow keep surviving underground, undetected, for decades... who cares. It's not like they can build a technology comparable to the one outside, without getting detected.
The most optimistic outcome is that a group of futuristic hyper-preppers survives; their bodies are covered by the latest defensive technology, they produce/recycle their own food and water and air, they even have a smaller aligned/obedient AI, etc. Well, if they are visible, they get a nuke. If they hide underground or fly to the Moon... good luck building an alternative stronger economy, because they will need it to win the war.
As far as bio doom being easily defensible, I think the important point to make is that it really doesn't matter by what action or method superintelligence chooses to wipe out humanity with, it will likely be unthinkable because it would go about solving problems in far more efficient and indeterminate ways than humans would. The authors are using bio doom because it's a method, and one that's easy to imagine. To ask them to come up with a likely method that superintelligence would use would be asking them to think like a superintelligence, which they clearly can't.
Speed of task execution is a separate development vector from Artificial Super Intelligence (ASI). Using the calculator as an example, being able to compute something a million times faster than a human doesn't mean it's any smarter.
I thought that the risk of ASI is that it would outsmart us (humans) by doing things that we can't comprehend or, if nefariously incentivised, finding vulnerabilities in our systems that we are not smart enough to predict.
Simply doing things that a human can do, but faster, is not ASI, unless I'm missing something?
I'm personally not convinced that the recent AI boom, which has mostly centred around LLMs (ChatGPT etc) has had much impact on the development of ASI. Are LLMs able to formulate more intelligent insights than the data on which they were trained? I.e. within the text format, this is data that has all already been filtered through a human brain.
I would expect that a super intelligence would require direct access with the real world, not information that has been passed through a human filter. This may be achievable by training models on video and audio data, which is a more direct feed of the real world, but I would guess that giving an AI arms and legs etc, that allow it to interact with the real world to experiment with things, would make it learn much quicker.
Daniel notes: This is a linkpost for Vitalik's post. I've copied the text below so that I can mark it up with comments.
...
Special thanks to Balvi volunteers for feedback and review
In April this year, Daniel Kokotajlo, Scott Alexander and others released what they describe as "a scenario that represents our best guess about what [the impact of superhuman AI over the next 5 years] might look like". The scenario predicts that by 2027 we will have made superhuman AI and the entire future of our civilization hinges on how it turns out: by 2030 we will get either (from the US perspective) utopia or (from any human's perspective) total annihilation.
In the months since then, there has been a large volume of responses, with varying perspectives on how likely the scenario that they presented is. For example:
Of the critical responses, most tend to focus on the issue of fast timelines: is AI progress actually going to continue and even accelerate as fast as Kokotajlo et al say it will? This is a debate that has been happening in AI discourse for several years now, and plenty of people are very doubtful that superhuman AI will come that quickly. Recently, the length of tasks that AIs can perform fully autonomously has been doubling roughly every seven months. If you assume this trend continues without limit, AIs will be able to operate autonomously for the equivalent of a whole human career in the mid-2030s. This is still a very fast timeline, but much slower than 2027. Those with longer timelines tend to argue that there is a category difference between "interpolation / pattern-matching" (done by LLMs today) and "extrapolation / genuine original thought" (so far still only done by humans), and automating the latter may require techniques that we barely have any idea how to even start developing. Perhaps we are simply replaying what happened when we first saw mass adoption of calculators, wrongly assuming that just because we've rapidly automated one important category of cognition, everything else is soon to follow.
This post will not attempt to directly enter the timeline debate, or even the (very important) debate about whether or not superintelligent AI is dangerous by default. That said, I acknowledge that I personally have longer-than-2027 timelines, and the arguments I will make in this post become more compelling the longer the timelines are. In general, this post will explore a critique from a different angle:
The AI 2027 scenario implicitly assumes that the capabilities of the leading AI (Agent-5 and then Consensus-1), rapidly increase, to the point of gaining godlike economic and destructive powers, while everyone else's (economic and defensive) capabilities stay in roughly the same place. This is incompatible with the scenario's own admission (in the infographic) that even in the pessimistic world, we should expect to see cancer and even aging cured, and mind uploading available, by 2029.
Some of the countermeasures that I will describe in this post may seem to readers to be technically feasible but unrealistic to deploy into the real world on a short timeline. In many cases I agree. However, the AI 2027 scenario does not assume the present-day real world: it assumes a world where in four years (or whatever timeline by which doom is possible), technologies are developed that give humanity powers far beyond what we have today. So let's see what happens when instead of just one side getting AI superpowers, both sides do.
Let us zoom in to the "race" scenario (the one where everyone dies because the US cares too much about beating China to value humanity's safety). Here's the part where everyone dies:
For about three months, Consensus-1 expands around humans, tiling the prairies and icecaps with factories and solar panels. Eventually it finds the remaining humans too much of an impediment: in mid-2030, the AI releases a dozen quiet-spreading biological weapons in major cities, lets them silently infect almost everyone, then triggers them with a chemical spray. Most are dead within hours; the few survivors (e.g. preppers in bunkers, sailors on submarines) are mopped up by drones. Robots scan the victims' brains, placing copies in memory for future study or revival.
Let us dissect this scenario. Even today, there are technologies under development that can make that kind of a "clean victory" for the AI much less realistic:
These methods stacked together reduce the R0 of airborne diseases by perhaps 10-20x (think: 4x reduced transmission from better air, 3x from infected people learning immediately that they need to quarantine, 1.5x from even naively upregulating the respiratory immune system), if not more. This would be enough to make all presently-existing airborne diseases (even measles) no longer capable of spreading, and these numbers are far from the theoretical optima.
With sufficient adoption of real-time viral sequencing for early detection, the idea that a "quiet-spreading biological weapon" could reach the world population without setting off alarms becomes very suspect. Note that this would even catch advanced approaches like releasing multiple pandemics and chemicals that only become dangerous in combination.
Now, let's remember that we are discussing the AI 2027 scenario, in which nanobots and Dyson swarms are listed as "emerging technology" by 2030. The efficiency gains that this implies are also a reason to be optimistic about the widespread deployment of the above countermeasures, despite the fact that, today in 2025, we live in a world where humans are slow and lazy and large portions of government services still run on pen and paper (without any valid security justification). If the world's strongest AI can turn the world's forests and fields into factories and solar farms by 2030, the world's second-strongest AI will be able to install a bunch of sensors and lamps and filters in our buildings by 2030.
But let's take AI 2027's assumptions further, and go full science fiction:
In a world where cancer and aging were cured by Jan 2029, and progress accelerates further from there, and we're in mid-2030, it actually feels implausible that we don't have a wearable device that can bio-print and inject things into you in real time to keep you safe even from arbitrary infections (and poisons). The bio arguments above don't cover mirror life and mosquito-sized killer drones (projected in the AI 2027 scenario to be available starting 2029). However, these options are not capable of anything like the sudden clean victory that the AI 2027 scenario portrays, and it's intuitively much more clear how to symmetrically defend against them.
Thus, a bioweapon is actually quite unlikely to lead to a clean annihilation of the human population in the way that the AI 2027 scenario describes. Now, the results of everything I describe will certainly be far from a clean victory for the humans as well. No matter what we do (except perhaps the "upload our minds into robots" option), a full-on AI bio-war would still be extremely dangerous. However, there is value in meeting a bar much lower than clean victory for humans: a high probability of an attack even partially failing would serve as a strong deterrent discouraging an AI that already occupies a powerful position in the world from even attempting any kind of attack. And, of course, the longer AI timelines get the more likely it is that this kind of approach actually can more fully achieve its promises.
The success of the kinds of countermeasures described above, especially the collective measures that would be needed to save more than a small community of hobbyists, rests on three preconditions:
Intuitively, (1) could go both ways. Today, some police forces are highly centralized with strong national command structures, and other police forces are localized. If physical security has to rapidly transform to meet the needs of the AI era, then the landscape will reset entirely, and the new outcomes will depend on choices made over the next few years. Governments could get lazy and all depend on Palantir. Or they could actively choose some option that combines locally developed and open-source technology. Here, I think that we need to just make the right choice.
A lot of pessimistic discourse on these topics assumes that (2) and (3) are lost causes. So let's look into each in more detail.
It is a common view among both the public and professionals that true cybersecurity is a lost cause, and the best we can do is patch bugs quickly as they get discovered, and maintain deterrence against cyberattackers by stockpiling our own discovered vulnerabilities. Perhaps the best that we can do is the Battlestar Galactica scenario, where almost all human ships were taken offline all at once by a Cylon cyberattack, and the only ships left standing were safe because they did not use any networked technology at all. I do not share this view. Rather, my view is that the "endgame" of cybersecurity is very defense-favoring, and with the kinds of rapid technology development that AI 2027 assumes, we can get there.
One way to see this is to use AI researchers' favorite technique: extrapolating trends. Here is the trendline implied by a GPT Deep Research survey on bug rates per 1000 lines of code over time, assuming top-quality security techniques are used.
On top of this, we have been seeing serious improvements in both development and widespread consumer adoption of sandboxing and other techniques for isolating and minimizing trusted codebases. In the short term, a superintelligent bug finder that only the attacker has access to will be able to find lots of bugs. But if highly intelligent agents for finding bugs or formally verifying code are available out in the open, the natural endgame equilibrium is that the software developer finds all the bugs as part of the continuous-integration pipeline before releasing the code.
I can see two compelling reasons why even in this world, bugs will not be close to fully eradicated:
However, neither of these categories applies to situations like "can an attacker gain root access to the thing keeping us alive?", which is what we are talking about here.
I acknowledge that my view is more optimistic than is currently mainstream thought among very smart people in cybersecurity. However, even if you disagree with me in the context of today's world, it is worth remembering that the AI 2027 scenario assumes superintelligence. At the very least, if "100M Wildly Superintelligent copies thinking at 2400x human speed" cannot get us to having code that does not have these kinds of flaws, then we should definitely re-evaluate the idea that superintelligence is anywhere remotely as powerful as what the authors imagine it to be.
At some point, we will need to greatly level up our standards for security not just for software, but also for hardware. IRIS is one present-day effort to improve the state of hardware verifiability. We can take something like IRIS as a starting point, or create even better technologies. Realistically, this will likely involve a "correct-by-construction" approach, where hardware manufacturing pipelines for critical components are deliberately designed with specific verification processes in mind. These are all things that AI-enabled automation will make much easier.
As I mentioned above, the other way in which much greater defensive capabilities may turn out not to matter is if AI simply convinces a critical mass of us that defending ourselves against a superintelligent AI threat is not needed, and that anyone who tries to figure out defenses for themselves or their community is a criminal.
My general view for a while has been that two things can improve our ability to resist super-persuasion:
Right image, from top to bottom: URL checking, cryptocurrency address checking, rumor checking. Applications like this can become a lot more personalized, user-sovereign and powerful.
The battle should not be one of a Wildly Superintelligent super-persuader against you. The battle should be one of a Wildly Superintelligent super-persuader against you plus a slightly less Wildly Superintelligent analyzer acting on your behalf.
This is what should happen. But will it happen? Universal adoption of info defense tech is a very difficult goal to achieve, especially within the short timelines that the AI 2027 scenario assumes. But arguably much more modest milestones will be sufficient. If collective decisions are what count the most and, as the AI 2027 scenario implies, everything important happens within one single election cycle, then strictly speaking the important thing is for the direct decision makers (politicians, civil servants, and programmers and other actors in some corporations) to have access to good info defense tech. This is relatively more achievable within a short timeframe, and in my experience many such individuals are comfortable talking to multiple AIs to assist them in decision-making already.
In the AI 2027 world, it is taken as a foregone conclusion that a superintelligent AI can easily and quickly dispose of the rest of humanity, and so the only thing we can do is do our best to ensure that the leading AI is benevolent. In my world, the situation is actually much more complicated, and whether or not the leading AI is powerful enough to easily eliminate the rest of humanity (and other AIs) is a knob whose position is very much up for debate, and which we can take actions to tune.
If these arguments are correct, it has some implications for policy today that are sometimes similar, and sometimes different, from the "mainstream AI safety canon":
The above arguments are speculative, and no actions should be taken based on the assumption that they are near-certainties. But the AI 2027 story is also speculative, and we should avoid taking actions on the assumption that specific details of it are near-certainties.
I particularly worry about the common assumption that building up one AI hegemon, and making sure that they are "aligned" and "win the race", is the only path forward. It seems to me that there is a pretty high risk that such a strategy will decrease our safety, precisely by removing our ability to have countermeasures in the case where the hegemon becomes misaligned. This is especially true if, as is likely to happen, political pressures lead to such a hegemon becoming tightly integrated with military applications (see [1] [2] [3] [4]), which makes many alignment strategies less likely to be effective.
In the AI 2027 scenario, success hinges on the United States choosing to take the path of safety instead of the path of doom, by voluntarily slowing down its AI progress at a critical moment in order to make sure that Agent-5's internal thought process is human-interpretable. Even if this happens, success if not guaranteed, and it is not clear how humanity steps down from the brink of its ongoing survival depending on the continued alignment of one single superintelligent mind. Acknowledging that making the world less vulnerable is actually possible and putting a lot more effort into using humanity's newest technologies to make it happen is one path worth trying, regardless of how the next 5-10 years of AI go.