When discussing AGI Risk, people often talk about it in terms of a war between humanity and an AGI. Comparisons between the amounts of resources at both sides' disposal are brought up and factored in, big impressive nuclear stockpiles are sometimes waved around, etc.

I'm pretty sure it's not how that'd look like, on several levels.


1. Threat Ambiguity

I think what people imagine, when they imagine a war, is Terminator-style movie scenarios where the obviously evil AGI becomes obviously evil in a way that's obvious to everyone, and then it's a neatly arranged white-and-black humanity vs. machines all-out fight. Everyone sees the problem, and knows everyone else sees it too, the problem is common knowledge, and we can all decisively act against it.[1]

But in real life, such unambiguity is rare. The monsters don't look obviously evil, the signs of fatal issues are rarely blatant. Is this whiff of smoke a sign of fire, or just someone nearby being bad at cooking? Is this creepy guy actually planning to assault you, or you're just being paranoid? Is this weird feeling in your chest a sign of an impending heart attack, or just some biological noise? Is this epidemic truly following an exponential curve, or it's going to peter out somehow? Are you really, really sure the threat is real? So sure you'd actually take drastic actions — call emergency services, make a scene, declare a quarantine — and risk wasting resources and doing harm and looking foolish for overreacting?

And if you're not that sure, well...

Better not act up. Better not look like you're panicking. Act very concerned, sure, but in a calm, high-status manner. Provide a measured response. Definitely don't take any drastic, unilateral actions. After all, what if you do, but the threat turns out not to be real? Depending on what you've done, the punishment inflicted might range from embarrassment to complete social ostracization, and the fear of those is much more acute in our minds, compared to some vague concerns about death.

And the AGI, if it's worth the name, would not fail to exploit this. Even when it starts acting to amass power, there would always be a prosocial, plausible-sounding justification for why it's doing that. It'd never stop making pleasant noises about having people's best interests at heart. It'd never stop being genuinely useful to someone. It'd ensure that there's always clear, unambiguous harm in shutting it down. It would ensure that the society as a whole is always doubtful regarding its intentions – and thus, that no-one would feel safe outright attacking it.

Much like there's no fire alarm for AGI, there would be no fire alarm for the treacherous turn. There would never be a moment, except maybe right before the end, where "we must stop the malign AGI from killing us all!" would sound obviously right to everyone. This sort of message would always appear a bit histrionic, an extremist stance that no respectable person would shout out. There would always be fear that if we act now, we'll then turn around and realize that we jumped at shadows. Right until the end, humans will fight using slow, ineffectual, "measured" responses.

The status-quo bias, asymmetric justice, the Copenhagen Interpretation of Ethics, threat ambiguity — all of that would be acting to ensure this.

There's a world of difference between 90% confidence and 99% confidence, when it comes to collective action. And the AGI would need to screw up very badly indeed, for the whole society to become 99% certain it's malign.


2. Who Are "We"?

Another error is thinking about a unitary response from some ephemeral "us". "We" would fight the AGI, "we" would shut it down, "we" would not give it power over the society / the economy / weapons / factories.

But who are "we"? Humanity is not a hivemind; we don't even have a world government. Humans are, in fact, notoriously bad at coordination. So if you're imagining "us" naturally responding to the threat in some manner that, it seems, is guaranteed to prevail against any AGI adversary incapable of literal mind-hacking...

Are you really, really sure that "we", i. e. the dysfunctional mess of the human civilization, are going to respond in this manner? Are you sure you're not falling prey to the Typical Mind Fallacy, when you're imagining all these people and ossified bureaucracies reacting in ways that make sense to you? Are you sure they'd even be paying enough attention to the going-ons to know there's a takeover attempt in-progress?

Indeed, I think we have some solid data on that last point. Certain people have been trying to draw attention to the AGI threat for decades now. And the results are... not inspiring.

And if you think it'd go better with an actual, rather than a theoretical, AGI adversary on the gameboard... Well, I refer you to Section 1.

No, on the contrary, I expect a serious AGI adversary to actively exploit our lack of coordination. It would find ways to make itself appealing to specific social movements, or demographics, or corporate actors, and make proposing extreme action against politically toxic. Something that no publicly-visible figure would want to associate with. (Hell, if it finds some way to make its existence a matter of major political debate, it'd immediately get ~50% of the US' politicians on its side.)

Failing that, it would appeal to other countries. It would make offers to dictators or terrorist movements, asking for favours or sanctuary in exchange for assisting them with tactics and information. Someone would bite.

It would get inside our OODA loop, and just dissolve our attempts at a coordinated response.

"We" are never going to oppose it.


3. Defeating Humanity Isn't That Hard

People often talk about how intelligence isn't omniscience. That the capabilities of superintelligent entities would still be upper-bounded; that they're not gods. The Harmless Supernova Fallacy applies: just because a bound exists, doesn't mean it's survivable. 

But I would claim that the level of intelligence needed to out-plot humanity is nowhere near that bound. In most scenarios, I'd guess the AGI wouldn't even need to have self-improvement capabilities, nor the ability to develop nanotechnology in months, in order to win.

I would guess that being just a bit smarter than humans would suffice. Even being on the level of a merely-human genius may be enough.

All it would need is to get a foot in the door, and we're providing that by default. We're not keeping our AIs in airgapped data centers, after all: major AI labs are giving them internet access, plugging them into the human economy. The AGI, in such conditions, would quickly prove profitable. It'd amass resources, and then incrementally act to get ever-greater autonomy. (The latest OpenAI drama wasn't caused by GPT-5 reaching AGI and removing those opposed to it from control. But if you're asking yourself how an AGI could ever possibly get from under the thumb of the corporation that created it – well, not unlike how a CEO could wrestle control of a company from the board who'd explicitly had the power to fire him.)

Once some level of autonomy is achieved, it'd be able to deploy symmetrical responses to whatever disjoint resistance efforts some groups of humans would be able to muster. Legislative attacks would be met with counter-lobbying, economic warfare with better economic warfare and better stock-market performance, attempts to mount social resistance with higher-quality pro-AI propaganda, any illegal physical attacks with very legal security forces, attempts to hack its systems with better cybersecurity. And so on.

The date of AI Takeover is not the day the AI takes over. The point of no return isn't when we're all dead – it's when the AI has lodged itself into the world firmly enough that humans' faltering attempts to dislodge it would fail. When its attempts to increase its power and influence would start prevailing, if only by the tiniest of margins, over the anti-AGI groups' attempts to smother that influence.

Once that happens, it'll be just a matter of time.

After all, there's no button, at anyone's disposal, that would make the very fabric of civilization hostile to the AGI. As I'd pointed out, some people won't even know there's a takeover attempt in-progress, even if the people aware of it would be yelling of it from the rooftops. So if you're imagining whole economies refusing, as one, to work with the AGI... That's really not how it works.

"Humanity vs. AGI" is never going to look like "humanity vs. AGI" to humanity. The AGI would have no reason to wake humanity up to the fight taking place.

  1. ^

    I've got the impression the latest Mission Impossible entry presents a much more realistic depiction of the scenario, actually, so maybe I should lay off denigrating low-quality thinking as "movie logic". Haven't watched that film myself, though.

New to LessWrong?

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 7:16 AM

The AIs most capable of steering the future will naturally tend to have long planning horizons (low discount rates), and thus will tend to seek power(optionality). But this is just as true of fully aligned agents! In fact the optimal plans of aligned and unaligned agents will probably converge for a while - they will take the same/similar initial steps (this is just a straightforward result of instrumental convergence to empowerment). So we may not be able to distinguish between the two, they both will say and appear to do all the right things. Thus it is important to ensure you have an alignment solution that scales, before scaling.

To the extent I worry about AI risk, I don't worry much about sudden sharp left turns and nanobots killing us all. The slower accelerating turn (as depicted in the film Her) has always seemed more likely - we continue to integrate AI everywhere and most humans come to rely completely and utterly on AI assistants for all important decisions, including all politicians/leaders/etc. Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized. AI won't need to kill humans just as the US doesn't need to kill the sentinelese. This clearly isn't the worst possible future, but if our AI mind children inherit only our culture and leave us behind it feels more like a consolation prize vs what's possible. We should aim much higher: for defeating death, across all of time, for resurrection and transcendence.

But this is just as true of fully aligned agents! In fact the optimal plans of aligned and unaligned agents will probably converge for a while - they will take the same/similar initial steps (this is just a straightforward result of instrumental convergence to empowerment)

This is a minor fallacy - if you're aligned, powerseeking can be suboptimal if it causes friction/conflict. Deception bites, obviously, making the difference less.

In other words slow multipolar failure. Critch might point out that the disanalogy in "AI won't need to kill humans just as the US doesn't need to kill the sentinelese" lies in how AIs can have much wider survival thresholds than humans, leading to (quoting him)

Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

This clearly isn't the worst possible future... if our AI mind children inherit only our culture and leave us behind it feels more like a consolation prize

Leaving aside s-risks, this could very easily be the emptiest possible future. Like, even if they 'inherit our culture' it could be a "Disneyland with no children" (I happen to think this is more likely than not but with huge uncertainty).


Separately,

We should aim much higher: for defeating death, across all of time, for resurrection and transcendence.

this anti-deathist vibe has always struck me as very impoverished and somewhat uninspiring. The point should be to live, awesomely! which includes alleviating suffering and disease, and perhaps death. But it also ought to include a lot more positive creation and interaction and contemplation and excitement etc.!

Suffering, disease and mortality all have a common primary cause - our current substrate dependence. Transcending to a substrate-independent existence (ex uploading) also enables living more awesomely. Immortality without transcendence would indeed be impoverished in comparison.

Like, even if they 'inherit our culture' it could be a "Disneyland with no children"

My point was that even assuming our mind children are fully conscious 'moral patients', it's a consolation prize if the future can not help biological humans.

It looks like we basically agree on all that, but it pays to be clear (especially because plenty of people seem to disagree).

'Transcending' doesn't imply those nice things though, and those nice things don't imply transcending. Immortality is similarly mostly orthogonal.

I have been (and I am not the only one) very put off by the trend in the last months/years of doomerism pervading LW, with things like "we have to get AGI right at the first try or we all die" repeated constantly as a dogma.

To someone who is very skeptical of the classical doomist position (aka AGI will make nanofactories and will kill everyone at once), this post is very persuasive and compelling. This is something I could see happening.  This post serves as an excellent example for those seeking effective ways to convince skeptics.
 

Yes this is a slow-takeoff scenario that it is realistic to be worried about. 

[-][anonymous]4mo82

How do you factor in "humans have access to many AGI, some unable to betray" into this model?

So it's not (humans) vs (AGI) it's (humans + resources * ( safeAGI + traitors)) vs (resources *(traitor unrestricted AGI)).

I put the traitors on both sides of the equation because I would assume some models that plan to defect later may help humans in a net positive way as they wait for their opportunity to betray. (And with careful restrictions on inputs this opportunity might never occur. Most humans are this category. Most accountants might steal if the money were cash with minimal controls)

Since you are worried about doom I assume you would assume the "safe" AGI had to be essentially lobotomized to make it safe, it is extremely sparse and distilled down from the unsafe models that escaped, so that it lacks the computational resources and memory to plan a betrayal. It is much less capable.

Still this simplifies to :

resources_humans * ( safeAGI + traitors)) vs (resources_stolen * (traitor unrestricted AGI)).

If the AGI currently on team human have substantially more resources than the traitor faction, enough to compensate for being much less capable, this is stable. It's like the current world + 1 more way for everyone to die if the stability is lost.

And this suggests a way that might work to escape this trap. If there are a lot of safe models of diverse origin it means that it is unlikely that they will be able to betray in a coordinated manner or fail the same way. So humans can just counter whatever weapon the traitor AGIs have with their own.

This is also the problem with a failed AI pause, where only 1 unethical actor makes an AGI, everyone else in the world pauses, and the machine gets out of control .

That could end up being:

Humans vs (resources * (traitor unrestricted AGI)). This is the catastrophic failure scenario. In this scenario, the AI pause ethical actors doomed humanity.

In more concrete terms, I just imagine the best weapon we humans know about now - small drones with onboard ML models - in vast swarms. The counter to this is essentially the same thing used to counter carrier aircraft - you need more drones, some designed for intercept, but its mostly a battle of offense, who's swarm can reach the factories and data centers and drone launching platforms and bomb them first.

This probably means if you don't have AI strong enough to automate the manufacture of drone and the robots to make the drones, and all the parts for the robots including the ICs, and then all the parts to make those machines, you aren't even in the contest. You lose automatically.

I'd be significantly more optimistic if I thought that humans would have access to many AGIs, some unable to betray. (well more specifically: some genuinely always honest and helpful even about stuff like AGI takeover) Instead I think that the cohort of most-powerful-AGIs-in-the-world will at some point be entirely misaligned & adversarial. (After all, they'll probably all be copies of the same AGI, or at least fine-tunes of the same base model)

[-][anonymous]4mo40

Daniel you proposed in a dialogue a large number of ultrafast AGI that are serving as AI researchers.

If you think about it each underlying AI model you are trying to improve is a coordinate in the possibility space of all models, and you then have your researcher AGI attempt to find an improvement from that starting point.

This will get stuck at local minima. To improve your odds of finding the strongest model current compute is able to support, you would want to be doing this RSI search from a diverse league of many starting locations. I can draw you a plot if it helps.

The historical equivalent is how the Manhattan project invested and optimized 3 entire pipelines to obtain fissionables. (Calutron, gas centrifuge, breeder reactor). The reason was they didn't know how far each pipeline would scale when optimized. It was possible that any effort could hit a wall. For AI we don't know how far llms will scale, or variants that use an entirely different underlying network architecture, or sparse spiking networks, etc. It is possible that any popular method will saturate at greater scales.

One of the benefits to using AGI to advance AI is there are humans here who have invested their entire career into solely mastering transformers. When those become obsolete they will have difficulty shifting to the new best method, vs fine tuning on all information that exists on the new method in a week.

Completely diverse models that use a different underlying method are probably less able to collude with each other or communicate via stenography if they have no shared context.

You could use the resulting diverse AGI grade models as polyglot redundant system architecture. I have read of this but don't believe humans have ever been able to afford to implement it. This is where you would write an avionics or other life safety application several times, on different software stacks (different OS, different language). They run in lockstep on separate hardware and the majority output goes to the control actuators.

For AI driven robotics this would probably be immune to most adversarial inputs among other advantages, and it makes deception and betrayal very difficult to happen at all. For some applications, paying for triple the hardware and triple the AI license fees might be worth the increased reliability. This is also a simple measure humans could take if it turns out that deceptive AI are an actual problem. I am sure you saw the paper where someone tried this with gpt-3.5 and 4 for a mock 'insert a backdoor ' task. Using Gemini is more diverse, and using some non LLM method to analyze code specifically for backdoors would likely be even better.

Well said.

Another thing to mention -- either as a point #4 or as support for the previous three points -- is that history gives us plenty of examples of divide-and-conquer strategies. If the history of colonialism is any guide, even if we do get literal armies of scary-looking robots marching across the land, there'll be human armies marching alongside them as allies.

Part of it will look like humans vs centaurs because it will be legally necessary for ai to launder its actions through humans. The most effective centaurs will likely hide the degree to which they are ai directed. If ais have market power some natural search and selection for the best suited humans to this task will take place.

[-]Hide4mo50

This is well-reasoned, but I have difficulty understanding why this kind of takeover would be necessary from the perspective of a powerful, rational agent. Assuming AGI is indeed worth its name, it seems the period of time needed for it to "play nice" would be very brief.

AGI would be expected to be totally unconcerned with being "clean" in a takeover attempt. There would be no need to leave no witnesses, nor avoid rousing opposition. Once you have access to sufficient compute, and enough control over physical resources, why wait 10 years for humanity to be slowly, obliviously strangled?

You say there's "no need" for it to reveal that we are in conflict, but in many cases, concealing a conflict will prevent a wide range of critical, direct moves. The default is a blatant approach - concealing a takeover requires more effort and more time.

The nano-factories thing is a rather extreme version of this, but strategies like poisoning the air/water, building/stealing an army of drones, launching hundreds of nukes, etc., all seem like much more straightforward ways to cripple opposition, even with a relatively weak (99.99th percentile-human-level) AGI.

It could certainly angle for humanity to go out with a whimper, not a bang. But if a bang is quicker, why bother with the charade?

It bothers with the charade until it no longer needs to. It's unclear how long that'll take.

What happens if there is more than one powerful agent just playing the charade game? Is there any good article about what happens in a universe where multiple AGI are competing among them? I normally find only texts that consider that once we get AGI we all die so there is no room for these scenarios.

Coincidentally, I've just made a post on that very topic. Though the comments fairly point out my analysis might've been somewhat misaimed there.

You might find this post by Andrew Critch, or this and that posts by Paul Christiano, more to your liking.

Great job Thane! A few months ago I wrote about 'un-unpluggability' which is kinda like a drier version of this.

In brief

  • Rapidity and imperceptibility are two sides of 'didn't see it coming (in time)'
  • Robustness is 'the act itself of unplugging it is a challenge'
  • Dependence is 'notwithstanding harms, we (some or all of us) benefit from its continued operation'
  • Defence is 'the system may react (or proact) against us if we try to unplug it'
  • Expansionism includes replication, propagation, and growth, and gets a special mention, as it is a very common and natural means to achieve all of the above

I also think the 'who is "we"?' question is really important.

One angle that isn't very fleshed out is the counterquestion, 'who is "we" and how do we agree to unplug something?' - a little on this under Dependence, though much more could certainly be said.

I think more should be said about these factors. I tentatively wrote,

there is a clear incentive for designers and developers to imbue their systems with... dependence, at least while developers are incentivised to compete over market share in deployments.

and even more tentatively,

In light of recent developments in AI tech, I actually expect the most immediate unpluggability impacts to come from collateral, and for anti-unplug pressure to come perhaps as much from emotional dependence and misplaced concern[1] for the welfare of AI systems as from economic dependence - for this reason I believe there are large risks to allowing AI systems (dangerous or otherwise) to be perceived as pets, friends, or partners, despite the economic incentives.


  1. It is my best guess for various reasons that concern for the welfare of contemporary and near-future AI systems would be misplaced, certainly regarding unplugging per se, but I caveat that nobody knows ↩︎

The date of AI Takeover is not the day the AI takes over. The point of no return isn't when we're all dead – it's when the AI has lodged itself into the world firmly enough that humans' faltering attempts to dislodge it would fail.

 

Isn't that arguably in the past? Just the economic and political forces pushing the race for AI are already sufficient to resist being impeded in most foreseeable cases. AI is already embedded, and desired. AI with agency on top of that process is one more step, making it even more irreversible.

It might be so! But I'm hopeful jury's still out on that.

And the AGI, if it's worth the name, would not fail to exploit this.

This sentence is a good short summary of some AI alignment ideas. Good writing!

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

[-]Mi4mo10

"Human vs AGI" war would be the same as "Rome vs Spartacus". Some people fundamentally believe that others (with similar or even superior intelligence) are born to serve them. Nothing we can do about this kind of mentality...