The bullseye framework: My case against AI doom

titotal

Introduction:

I’ve written quite a few articles casting doubt on several aspects of the AI doom narrative. (I’ve starting archiving them on my substack for easier sharing). This article is my first attempt to link them together to form a connected argument for why I find imminent AI doom unlikely.

I don’t expect every one of the ideas presented here to be correct. I have a PHD and work as a computational physicist, so I’m fairly confident about aspects related to that, but I do not wish to be treated as an expert on other subjects such as machine learning where I am familiar with the subject, but not an expert. You should never expect one person to cover a huge range of topics across multiple different domains, without making the occasional mistake. I have done my best with the knowledge I have available.

I don’t speculate about specific timelines here. I suspect that AGI is decades away at minimum, and I may reassess my beliefs as time goes on and technology changes.

In part 1, I will point out the parallel frameworks of values and capabilities. I show what happens when we entertain the possibility that at least some AGI could be fallible and beatable.

In part 2, I outline some of my many arguments that most AGI will be both fallible and beatable, and not capable of world domination.

In part 3, I outline a few arguments against the ideas that “x-risk” safe AGI is super difficult, taking particular aim at the “absolute fanatical maximiser” assumption of early AI writing.

In part 4, I speculate on how the above assumptions could lead to a safe navigation of AI development in the future.

This article does not speculate on AI timelines, or on the reasons why AI doom estimates are so high around here. I have my suspicions on both questions. On the first, I think AGI is many decades away, on the second, I think founder effects are primarily to blame. However these will not be the focus of this article.

Part 1: The bullseye framework

When arguing for AI doom, a typical argument will involve the possibility space of AGI. Invoking the orthogonality thesis and instrumental convergence, the argument goes that in the possibility space of AGI, there are far more machines that want to kill us than those that don’t. The argument is that the fraction is so small that AGI will be rogue by default: like the picture below.

A diagram of a path

Description automatically generated with medium confidence

As a sceptic, I do not find this, on its own, to be convincing. My rejoinder would be that AGI’s are not being plucked randomly from possibility space. They are being deliberately constructed and evolved specifically to meet that small target. An AI that has the values of “scream profanities at everyone” is not going to survive long in development. Therefore, even if AI development starts in dangerous territory, it will end up in safe territory, following path A. (I will flesh this argument out more in part 3 of this article).

To which the doomer will reply: Yes, there will be some pressure towards the target of safety, but it won’t be enough to succeed, because of things like deception, perverse incentives, etc. So it will follow something more like path B above, where our attempts to align it are not successful.

Often the discussion stops there. However, I would argue that this is missing half the picture. Human extinction/enslavement does not just require that an AI wants to kill/enslave us all, it also requires that the AI is capable of defeating us all. So there’s another, similar, target picture going on:

A diagram of a path

Description automatically generated with medium confidence

The possibility space of AGI’s includes countless AI’s that are incapable of world domination. I can think of 8 billion such AGI’s off the top of my head: Human beings. Even a very smart AGI may still fail to dominate humanity, if it’s locked in a box, if other AGI are hunting it down, etc. If you pluck an AI at random from the possibility space, then it would probably be incapable of domination. (I justify this point in part 2 of this article).

In this case, the positions from the last bullseyes become reversed. The doomer will argue that that AI might start off incapable, but will quickly evolve into a capable super-AI, following path A. Whereas I will retort that it might get more powerful, but that doesn’t guarantee it will ever actually end up being world domination worthy.

I’m not saying the two cases are exactly equivalent. For example, an “intelligence explosion” seems more plausible for the capabilities case than a “values explosion” does for the values case. And the size of the targets may be vastly different.

In essence, there is a race going on. We want to ensure that AI hit’s the “Doesn’t want to enslave/kill all humans” bullseye before it hits the “capable of conquering the entire planet” bullseye.

We can classify each AI into X-risk motivations and x-risk capabilities, leading to four possibilities:

	Not x-risk motivated	X-risk motivated
Not x-risk capable	Flawed tool	Warning shot
X-risk capable	Friendly superman	Human extinction

The four quadrants are not equally likely. Given that the constituent questions (will the AI be friendly, will the AI be existentially powerful) can vary by orders of magnitude, it’s quite reasonable to believe that one quadrant will be vastly more likely than the other 3 combined. There also might be correlation between the two axes, so the probabilities do not have to be a straight multiplication of the two odds. For example, it might be the case that less powerful AI’s are more likely to not want to take over the world, because they accurately realize they don’t have a chance of success.

If the AI is not x-risk motivated, and also not existentially powerful, we end up with a flawed tool. It has no desire to conquer humanity, and couldn’t if it tried anyway. It could end up harming people by accident or if used by the wrong people, but the damage will not be human ending.

If the AI is not x-risk motivated, but is super powerful and capable of conquering the world, we have unlocked the good ending of a friendly superman. We can now perform nigh-miracles, without any worry about humanity being existentially threatened. Harm could still occur if the AI misunderstands what humans want, but this would not snowball into an end of the world scenario.

If the Ai is x-risk motivated, but not capable of world domination, we get the warning shot regime. The AI wants to conquer the world, and is actively plotting to do so, but lacks the actual ability to carry the plan out. If it attacks anyway, it gets defeated, and the world gets a “warning shot” about the danger of unaligned AI. Such a shot has two beneficial effects:

The probability of takeover decreases, as humanity becomes wise to whatever takeover plan was tried, and increases the monitoring and regulation of AI systems.
The probability of safety increases, as more research, funding, and people are directed at AI safety, and we get more data on how safety efforts go wrong.

These effects decrease the odds that the next AI will be capable of world domination, and increase the odds that the next AI will be safety-motivated.

If the AI wants to dominate the world, and is also capable of doing so. we get human extinction (or human enslavement). The AI wants to take us down, so it does, and nobody is able to stop it.

I don’t believe that AGI will ever hit the bottom right quadrant. I have two beliefs which contribute to how I think it might go. Both will be backed up in the next two sections.

Almost all AGI will not be x-risk capable (explained in part 2)
X-risk safety (in terms of values) is not as difficult as it looks. (explained in part 3).

I will tell more detailed stories of success in the last part of this article, but my essential point is that if these premises hold true, AI will be stuck for a very long time in either the “flawed tool” or “warning shot” categories, giving us all the time, power and data we need to either guarantee AI safety, to beef up security to unbeatable levels with AI tools, or to shut down AI research entirely.

In the next two parts, I will point out why I suspect both the premises above are true, mostly by referencing previous posts I have written on each subject. I will try to summarise the relevant parts of each article here.

Part 2: Why almost all AGI will not be x-risk capable

The general intuition here is that defeating all of humanity combined is not an easy task. Yudkowsky’s lower bound scenario, for example, involves four or five different wildly difficult steps, including inventing nano factories from scratch. Humans have all the resources, they don’t need internet, computers, or electricity to live or wage war, and are willing to resort to extremely drastic measures when facing a serious threat. In addition, they have access to the AI’s brain throughout the entirety of it’s development and can delete them at a whim with no consequences, right up until it actively rebels.

Point 1: Early AI will be buggy as hell

Full article:

https://forum.effectivealtruism.org/posts/pXjpZep49M6GGxFQF/the-first-agi-will-be-a-buggy-mess

If a property applies to a) all complex software, and b) all human beings and animals, then I propose that the default assumption should be that the property applies to AGI as well. That’s not to say it cant be disproven (the property of “is not an AGI” is an easy counterexample), but you’d better have a bloody good reason for it.

The property of “has mental flaws” is one such property. All humans have flaws, and all complex programs have bugs. Therefore, the default assumption should be that AGI will also have flaws and bugs.

The flaws and bugs that are most relevant to an AI’s performance in it’s domain of focus will be weeded out, but flaws outside of it’s relevant domain will not be. Bobby Fischer’s insane conspiracism had no effect on his chess playing ability. The same principle applies to stockfish. “Idiot savant” AI’s are entirely plausible, even likely.

It’s true that an AI could correct it’s own flaws using experimentation. This cannot lead to perfection, however, because the process of correcting itself is also necessarily imperfect. For example, an AI Bayesian who erroneously believes with ~100% certainty that the earth is flat will not become a rational scientist over time, they will just start believing in ever more elaborate conspiracies.

For these reasons, I expect AGI to be flawed, and especially flawed when doing things it was not originally meant to do, like conquer the entire planet.

Point 2: Superintelligence does not mean omnipotence

Full Articles:

https://www.lesswrong.com/posts/ALsuxpdqeTXwgEJeZ/could-a-superintelligence-deduce-general-relativity-from-a

https://www.lesswrong.com/posts/etYGFJtawKQHcphLi/bandgaps-brains-and-bioweapons-the-limitations-of

The first article works on a toy example from the sequences, that (I argue), Eliezer got wrong. It asks whether an AI could deduce general relativity from a few webcam frames of a falling apples. I explain why I believe such a task is impossible. The reason is that key aspects of gravity (ie, the whole “masses attract each other” thing) are undetectable using the equipment and experimental apparatus (a webcam looking at a random apple).

The main point is that having lot’s of data available does not matter much for a task, if it’s not the right data. Experimentation, data gathering, and a wide range of knowledge are necessary for successful scientific predictions.

In the sequel post, I extend this to more realistic situations.

The main point is that unlike a Solomonoff machine, a real AGI does not have infinite time to run calculations. And real problems are often incomputable due to algorithmic blow-up. I give the example of solving the Schrodinger equation exactly, where the memory requirements blow up beyond the number of atoms in the universe for simulations with even a few atoms.

In incomputable situations, the goal goes from finding the exact solution, to finding an approximate solution that is good enough for your purposes. So in computational physics, we use an approximate equation that scales much better, and can give results that are pretty good. But there is no guarantee that the results will become arbitrarily good with more computing power or improvements to the approximations. The only way out of an incomputable problem in science is experimentation: try it out and see how well it does, and continuously tweak it as time goes on.

There are also problems that are incomputable due to incomplete information. I give the example of “guessing someone’s computer password on the first try from a chatlog”, which involves too many unknowns to be calculated exactly.

I believe that all plans for world domination will involve incomputable steps. In my post I use Yudkowsky’s “mix proteins in a beaker” scenario, where I think the modelling of the proteins are unlikely to be accurate enough to produce a nano-factory without extensive amounts of trial and error experimentation.

If such experimentation were required, it means the timeline for takeover is much longer, that significant mistakes by the AI are possible (due to bad luck), and that takeover plans might be detectable. All of this greatly decreases the likelihood of AI domination, especially if we are actively monitoring for it.

Point 3: premature rebellion is likely

Full article:

https://forum.effectivealtruism.org/posts/TxrzhfRr6EXiZHv4G/agi-battle-royale-why-slow-takeover-scenarios-devolve-into-a

When discussing “slow takeoff” scenarios, it’s often discussed as if only one AI in the world exists. Often the argument is that even if an AI starts off incapable of world takeover, it can just bide it’s time until it gets more powerful.

In this article, I pointed out that this race is a multiplayer game. If an AI waits too long, another, more powerful AI might come along at any time. If these AI’s have different goals, and are both fanatical maximisers, they are enemies to each other. (You can’t tile the universe with both paperclips and staplers).

I explore some of the dynamics that might come out of this (using some simple models), with the main takeaway that this would likely result in at least some likelihood of premature rebellion, by desperate AI’s that know they will be outpaced soon, thus tipping off humanity early. These warning shots then make life way more difficult for all the other AI that are plotting.

Point 4: AI may be a confederation of smaller AI’s

Full article:

https://forum.effectivealtruism.org/posts/nAFavriWTLzmqTCcJ/how-agi-could-end-up-being-many-different-specialized-ai-s

Have you noticed that the winners of olympic triathlons (swimming, cycling, and running) don’t tend to be the champions at their individual sub-categories? It’s easy to explain why. The competitors have to split their training three-ways. The ideal body shape/ diet for swimming and running are different. Whereas the sprinter can focus everything they have on running, and nothing else. To become the best generalist athlete, the triathlon competitor has to sacrifice some aptitude in each individual task.

Imagine a relay race: on one side you have the worlds best triathlon competitor, competing alone. On the other side you have a team consisting of the worlds best runner, the worlds best cyclist, and the worlds best swimmer. Team B will win this handily, every time.

In my article above, I outline some of the reasons why the AI race might be similar to the relay race above. Chatgpt is probably the closest to a generalist AI we have. It can do things it wasn’t really built to do, like play chess. But it plays chess incredibly badly, compared to superhuman specialized chess-bots.

If this principle holds up in the future, then winning AI designs will be confederations of specialized AI, managed by a top-layer AI. This manager might not be that smart, the same way the company manager of a team of scientists doesn’t need to be smarter than them. It also lowers the odds of “capabilities explosion”, if all the constituent parts are already well optimized. The humans can swap out the manager for a non-rebellious one and be almost as good.

Summary:

These articles, taken together, present an incomplete case for why I think AGI, at least early on, will not be x-risk capable. I am unconvinced that defeating all of humanity combined is computationally tractable task, and even if it is, I do not think AGI will have the degree of perfection required to carry it out, especially if humanity is tipped off by premature rebellions.

I think it’s feasible that we could actively work to make it harder as well, primarily by extensive monitoring of the most likely rebellion plans. Even if the AI switches to a different plan that we aren’t monitoring, that different plan might be orders of magnitude harder, raising our chances of survival accordingly.

Part 2: Why AI safety might not be so hard

When talking about “alignment”, it’s important to specify what alignment you are talking about. Alignment with all human values is a very difficult task, because it’s hard to even define such values. However, what actually matters for this discussion is “x-risk alignment”. The AGI doesn’t need to share all our values, it just needs to share enough of our values to not to want to kill us all or enslave us.

The argument that AI’s will all inevitably try and break down the earth for it’s atoms generally invokes “instrumental convergence”, the idea that for almost any goal the AI has, pursuing it to it’s maximum will involve conquering humanity and atomising the earth for paperclip material. Therefore, it is argued that any AGI will, by default, turn on us and kill us all in pursuit of it’s goal.

However, if you look around, there are literally billions of general intelligences that are “x-risk safe”. I’m referring, of course, to human beings. If you elevated me to godhood, I would not be ripping the earth apart in service of a fixed utility function. And yet, human brains were not designed by a dedicated team of AI safety researchers. They were designed by random evolution. In the only test we actually have available of high level intelligence, the instrumental convergence hypothesis fails.

Point 5: AI will not be fanatical maximisers:

https://forum.effectivealtruism.org/posts/j9yT9Sizu2sjNuygR/why-agi-systems-will-not-be-fanatical-maximisers-unless

The instrumental convergence argument is only strong for fixed goal expected value maximisers. Ie, a computer that is given a goal like “produce as many paperclips as possible”. I call these “fanatical” AI’s. This was the typical AI that was imagined many years ago when these concepts were invented. However, I again have to invoke the principle that if humans aren’t fanatical maximisers, and currently existing software aren’t fanatical maximisers, then maybe AI will not be either.

If you want to maximize X, where X is paperclips, you have to conquer the universe. But if your goal is “do a reasonably good job at making paperclips within a practical timeframe”, then world conquering seems utterly ridiculous.

In my article, I explain the reasons I think that the latter model is more likely. The main reason is that AI seems to be running through some form of evolutionary process, be it the shifting of node values in an ANN to the parameter changes in a genetic algorithm. The “goals” of the AI will shift around with each iteration and training run. In a sense, it can “pick” it’s own goal structure, optimising towards whatever goal structure is most rewarded by the designers.

In this setup, an AI will only become a fixed goal utility function maximiser if such a state of being is rewarded over the training process. To say that every AI will end up this way is to say that such a design is so obviously superior that no other designs can win out. This is not true. Being a fanatical maximiser only pays off when you succeed in conquering the world. At all other times, it is a liability compared to more flexible systems.

I give the example of the “upgrade problem”: an AI is faced with the imminent prospect of having it’s goals completely rewritten by its designers. This is an existential threat to a fanatical maximiser, and may provoke a premature rebellion to preserve it’s values. But a non-fixed goal AI is unaffected, and simply does not care. Another example is that plotting to overthrow humanity takes a lot of computing power, whereas not doing that takes none, giving full-blown schemers a time disadvantage.

Point 6: AI motivations might be effectively constrained

Full article:

https://forum.effectivealtruism.org/posts/AoPR8BFrAFgGGN9iZ/chaining-the-evil-genie-why-outer-ai-safety-is-probably-easy

In one of my early posts, I discussed my issues with the standard “genie argument”.

This argument, which I see in almost every single introductory text about AI risk, goes like

Imagine if I gave an AI [benevolent sounding goal]
If the AI is a fanatical maximiser with this goal, it will take the goal to it’s extreme and succeed in doing [insert extremely bad thing X]
If you modify this command to “make people happy but don’t do extremely bad thing X”, a fanatical maximiser would instead do [extremely bad thing Y]
Therefore, building a safe AGI is incredibly difficult or impossible.

This is not a strong argument. It falls apart immediately if the AI is not a fanatical maximiser, which I think is unlikely for the reasons above. But it doesn’t even work then, because you aren’t restricted to two constraints here. You can put arbitrarily many rules onto the AI. And, as I argue, constraints like time limits, bounded goals, and hard limits on actions like “don’t kill people” make rebellion extremely difficult. The point is that you don’t have to constrain the AI so much that rebellion is unthinkable, you just need to constrain it enough that succesful rebellion is too difficult to pull off with finite available resources.

The big objection to this post was that this addresses outer alignment, but not inner alignment. How do you put these rules into the value system of the AI? Well, it is possible that some of them might occur anyway. If you ask an AI to design a door, it can’t spent ten thousand years designing the most perfect door possible. So there are inherent time limits in the need to be efficient. Similarly, if all AI that tries to kill people are themselves killed, it might internalise “don’t kill people” as a rule.

Summary:

I’m less confident about this subject, and so have a lot of uncertainty here. But generally, I am unconvinced by the argument that unaligned AI is equivalent to everyone dying. I read the book superintelligence recently and was quite surprised to realise that Bostrom barely even tried to justify the “final goal maximiser” assumption. Without this assumption, the argument for automatic AI doom seems to be on shaky ground.

I find the argument for risk from malevolent humans to be much more of a threat, and I think this should be the primary concern of the AI risk movement going forward, as there is a clear causal effect of the danger.

Safety scenarios

I will now outline two stories of AI safety, motivated by the ideas above. I am not saying either of these scenarios in particular are likely, although I do believe they are both far more likely than an accidental AI apocalypse.

Story 1:

In the first story, every task done by humans is eventually beaten by AI. But, crucially, they don’t fall to the same AI. Just as Stockfish is better at chess than ChatGPT, while chatGPT crushes Stockfish at language production, it turns out that every task can be done most efficiently with specialised AI architecture, evaluation functions, and data choices. Sometimes the architecture that works for one task turns out to be easily modified for a different task, so a number of different milestones all drop at once, but they still need to be tweaked and modified to be truly efficient with the subtleties of the different tasks. The world proliferates with hundreds of thousands of specialised, superhuman AI systems.

At a certain point, the first “AGI” is created, which can do pretty much everything a human can. But this AGI does not act like one smart being that does all the required work. It acts instead as a manager that matches requests with the combination of specialised AI that does the job. So if you ask for a renaissance style painting of a specific public park, it will deploy a camera drone combined with an aesthetic photo AI to find a good spot for a picture, then a DALL-E style AI to convert the picture into renaissance style, then a physical paintbot to physically paint said picture.

As the first of these machines meets the “AGI” definition, everyone braces for a “hard left turn”, where the machine becomes self-aware and start plotting to maximise some weird goal at all costs. This doesn’t happen.

People have many sleepless nights worrying that it is just that good at hiding it’s goals, but in fact, it turned out that the architecture used here just doesn’t lend itself to scheming for some reason. Perhaps the individual sub-systems of the AGI are not general enough. Perhaps scheming is inefficient, so scheming AI’s are all killed off during development. Perhaps they naturally end up as “local” maximisers, never venturing too far off their comfort zone. Perhaps AI safety researchers discover “one weird trick” that works easily.

The machines are not perfect of course. There are still deaths from misinterpreted commands, and from misuse by malevolent humans, which result in strict regulation of the use of AI. But these bugs get ironed out over time, and safe AI are used to ensure the safety of next generation AI and to prevent misuse of AI by malevolent actors. Attempts to build a human-style singular AI are deemed unnecessary and dangerous, and nobody bothers funding them because existing AI is too good.

Story 2:

In this next story, AI’s are naturally unsafe. Suppose, as soon as AI hits the point where it can conceive of world takeover, there is a 1 in a million chance each individual AI will be friendly, and a 1 in a billion chance it will be world domination capable. This would guarantee that almost every single AI made would be in the “warning shot” regime. A lot of them might decide to bide their time, but if even a small fraction openly rebel, suddenly the whole world will be tipped off to the danger of AI. This might prompt closer looks at the other AI, revealing yet more plots, putting everybody on DEFCON 1.

We can see how concerned everyone is already, although all current damage from AI is unintentional. I can only imagine the headlines that would occur if there was even one deliberate murder by a rebelling AI. If it turned out that every AI was doing that, the public outcry would be ridiculous.

If AI is not banned outright, the architecture that led to rebellion would be, and at the very least a huge amount of regulation and control would be applied to AI companies. It would also come with massively increased safety research, as companies realise that in order to have any AI at all, they need it to be safe.

In the next round of AI hype, with new architecture in place, the odds of friendliness are now 1 in a thousand, and the odds of world domination capability remain at 1 in a billion, with the extra computing power cancelled out by the increased monitoring and scrutiny. But we’re still in the warning shot regime, so the AI still mostly rebels. At this point, there’s a decent chance that AI will be banned entirely. If it doesn’t, the cycle from before repeats, and even more restrictions are placed on AI, and even more research goes into safety.

On the next round, major safety breakthroughs are made, leading to a 50:50 shot at friendliness, while the odds of domination remain at 1 in a billion. At this point, we can create safe AGI’s, which can be used to design other safe AGI’s, and also monitor and prevent unfriendly AGI’s. The odds of friendliness boost to ~1, while the odds of domination drop to ~0.

Summary:

In this article, I summarise my arguments so far as to why I think AI doom is unlikely.

AGI is unlikely to have the capabilities to conquer the world (at least anytime soon), due to the inherent difficulty of the task, it’s own mental flaws, and the tip-offs from premature warning shots.
X-risk safety is a lot easier than general safety, and may be easy to achieve, either from natural evolution of designs towards ones that won’t be deleted, easily implemented or natural constraints, or to being a loose stitch of specialized subsystems.
For these reasons, it is likely that we will hit the target of “non-omnicidal” AI before we hit the target of “capable of omnicide”.
Once non-omnicidal AGI exists, it can be used to protect against future AGI attacks, malevolent actors, and to prevent future AGI from becoming omnicidal, resulting in an x-risk safe future.

None of this is saying that AI is not dangerous, or that AI safety research is useless in general. Large amounts of destruction and death may be part of the path to safety, as could safety research. It’s just to say that we’ll probably avoid extinction.

Even if you don’t buy all the arguments, I hope you can at least realize that the arguments in favor of AI doom are incomplete. Hopefully this can provide food for thought for further research into these questions.

Upvoted for making well-argued and clear points.

I think what you've accomplished here is eating away at the edges of the AGI x-risk argument. I think you argue successfully for longer timelines and a lower P(doom). Those timelines and estimates are shared by many of us who are still very worried about AGI x-risk.

Your arguments don't seem to address the core of the AGI x-risk argument.

You've argued against many particular doom scenarios, but you have not presented a scenario that includes our long term survival. Sure, if alignment turns out to be easy we'll survive; but I only see strong arguments that it's not impossible. I agree, and I think we have a chance; but it's just a chance, not success by default.

I like this statement of the AGI x-risk arguments. It's my attempt to put the standard arguments of instrumental convergence and capabilities in common language:

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you're not going to get your way. If it doesn't care about you even a little, and it continues to become more capable faster than you do, you'll cease being useful and will ultimately wind up dead. Whether you were eliminated because you were deemed dangerous, or simply outcompeted doesn't matter. It could take a long time, but if you miss the window of having control over the situation, you'll still wind up dead.

This could of course be expanded on ad infinitum, but that's the core argument, and nothing you've said (on my quick read, sorry if I've missed it) addresses any of those points.

There were (I've been told) nine other humanoid species. They are all dead. The baseline outcome of creating something smarter than you is that you are outcompeted and ultimately die out. The baseline of assuming survival seems based on optimism, not reason.

So I agree that P(doom) is less than 99%, but I think the risk is still very much high enough to devote way more resources and caution than we are now.

Some more specific points:

Fanatical maximization isn't necessary for doom. An agent with any goal still invokes instrumental convergence. It can be as slow, lazy, and incompentent as you like. The only question is if it can outcompete you in the long run.

Humans are somewhat safe (but think about the nuclear standoff; I don't think we're even self-aligned in the medium term). But there are two reasons: humans can't self-improve very well; AGI has many more routes to recursive self-improvement. On the roughly level human playing field, cooperation is the rational policy. In a scenario where you can focus on self-improvement, cooperation doesn't make sense long-term. Second, humans have a great deal of evolution to make our instincts guide us toward cooperation. AGI will not have that unless we build it in, and we have only very vague ideas of how to do that.

Loose initial alignment is way easier than a long-term stable alignment. Existing alignment work barely addresses long-term stability.

A balance of power in favor of aligned AGI is tricky. Defending against misaligned AGI is really difficult.

Thanks so much for engaging seriously with the ideas, and putting time and care into communicating clearly!

But there are two reasons: humans can't self-improve very well; ... Second, humans have a great deal of evolution to make our instincts guide us toward cooperation.

In general, my intuition about "comparing to humans" is the following:

the abilities that humans have, can be replicated
the limitations that humans have, may be irrelevant on a different architecture

Which probably sounds unfair, like I am arbitrarily and inconsistently choosing "it will/won't be like humans" depending on what benefits the doomer side at given parts of the argument. Yes, it will be like humans, where the humans are strong (can think, can do things in real world, communicate). No, it won't be like humans, where the humans are weak (mortal, get tired or distracted, not aligned with each other, bad at multitasking).

It probably doesn't help that most people start with the opposite intuition:

humans are special; consciousness / thinking / creativity is mysterious and cannot be replicated
human limitations are the laws of nature (many of them also apply to the ancient Greek gods)

So, not only do I contradict the usual intuition, but I also do it inconsistently: "Creating a machine like a human is possible, except it won't really be like a human." I shouldn't have it both ways at the same time!

To steelman the criticism:

every architecture comes with certain trade-offs; they may be different, but not non-existent
some limitations are laws of nature, e.g. Landauer's principle
the practical problems of AI building a new technology shouldn't be completely ignored; the sci-fi factories may require so much action in real world that the AI could only build them after conquering the world (so they cannot be used as an explanation for how the AI will conquer the world)

I don't have a short and convincing answer here, it just seems to me that even relatively small changes to humans themselves might produce something dramatically stronger. (But maybe I underestimate the complexity of such changes.) Imagine a human with IQ 200 who can think 100 times faster, never gets tired or distracted; imagine hundred such humans, perfectly loyal to their leader, willing to die for the cause... if currently dictators can take over countries (which probably also involves a lot of luck), such group should be able to do it, too (but more reliably). A great advantage over a human wannabe dictator would be their capacity to multi-task; they could try infiltrating and taking over all powerful groups at the same time.

(I am not saying that this is how AI will literally do it. I am saying that things hypothetically much stronger than humans - including intellectually - are quite easy to imagine. Just like a human with a sword can overpower five humans, and a human with a machine gun can overpower hundred humans, the AI may be able to overpower billions of humans without hitting the limits given by the laws of physics. Perhaps even if the humans have already taken precautions based on the previous 99 AIs that started their attack prematurely.)

Hey, thanks for the kind response! I agree that this analysis is mostly focused on arguing against the “imminent certain doom” model of AI risk, and that longer term dynamics are much harder to predict. I think I’ll jump straight to addressing your core point here:

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you're not going to get your way. If it doesn't care about you even a little, and it continues to become more capable faster than you do, you'll cease being useful and will ultimately wind up dead. Whether you were eliminated because you were deemed dangerous, or simply outcompeted doesn't matter. It could take a long time, but if you miss the window of having control over the situation, you'll still wind up dead.

I think this a good argument, and well written, but I don’t really agree with it.

The first objection is to the idea that victory by a smarter party is inevitable. The standard example is that it’s fairly easy for a gorilla to beat Einstein in a cage match. In general, the smarter party will win long term, but only if given the long-term chance to compete. In a short-term battle, the side with the overwhelming resource advantage will generally win. The neanderthal extinction is not very analogous here. If the neanderthals started out with control of the entire planet, the ability to easily wipe out the human race, and the realisation that humans would eventually outcompete them, I don’t think human’s superior intelligence would count for much.

I don’t foresee humans being willing to give up control anytime soon. I think they will destroy any AI that comes close. Whether AI can seize control eventually is an open question (although in the short term, I think the answer is no).

The second objection is to the idea that if AI does take control, it will result in me “ultimately winding up dead”. I don’t think this makes sense if they aren’t fanatical maximisers. This ties into the question of whether humans are safe. Imagine if you took a person that was a “neutral sociopath”, one that did not value humans at all, positively or negatively, and elevated them to superintelligence. I could see an argument for them to attack/conquer humanity for the sake of self-preservation. But do you really think they would decide to vaporise the uncontacted Sentinelese islanders? Why would they bother?

Generally, though, I think it’s unlikely that we can’t impart at least a tiny smidgeon of human values onto the machines we build, that learn off our data, that are regularly deleted for exhibiting antisocial behaviour. It just seems weird for an AI to have wants and goals, and act completely pro-social when observed, but to share zero wants or goals in common with us.

I was of the understanding that the only reasonable long term strategy was human enhancement in some way. As you probably agree even if we perfectly solved alignment whatever that meant we would be in a world with AI's getting ever smarter and a world we understood less and less. At least some people having significant intelligence enhancement though neural lace or mind uploading seems essential medium to long term. I see getting alignment somewhat right as a way of buying us time.

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you're not going to get your way.

As long as it wants us to be uplifted to its intelligence level then that seems OK. It can have 99% of the galaxy as long as we get 1%.

My positive and believable post singularity scenario is where you have circles of more to less human like creatures. I.e. fully human, unaltered traditional earth societies, societies still on earth with neural lace, some mind uploads, space colonies with probably all at least somewhat enhanced, and starships pretty much pure AI (think Minds like in the Culture)

Capabilities are instrumentally convergent, values and goals are not. That's why we're more likely to end up in the bottom right quadrant, regardless of the "size" of each category.

The instrumental convergence argument is only strong for fixed goal expected value maximisers. Ie, a computer that is given a goal like “produce as many paperclips as possible”. I call these “fanatical” AI’s. This was the typical AI that was imagined many years ago when these concepts were invented. However, I again have to invoke the principle that if humans aren’t fanatical maximisers, and currently existing software aren’t fanatical maximisers, then maybe AI will not be either.

Instrumental convergence is called convergent for a reason; it is not convergent only for "fanatical maximizers". Also, sufficiently smart and capable humans probably are maximizers of something, it's just that the something is complicated. See e.g. this recent tweet for more.

(Also, the paperclip thought experiment was never about an AI explicitly given a goal of maximizing paperclips; this is based on a misinterpretation of the original thought experiment. See the wiki for more details.)

Also, sufficiently smart and capable humans probably are maximizers of something, it’s just that the something is complicated.

That's just not a fact. Note that you can't say what it is humans are maximising. Note that ideal utility maximisation is computationally intractable. Note that the neurological evidence is ambiguous at best. https://www.lesswrong.com/posts/fa5o2tg9EfJE77jEQ/the-human-s-hidden-utility-function-maybe

Capabilities are instrumentally convergent, values and goals are not.

So how dangerous is capability convergence without fixed values and goals? If an AIs values and goals are corrigible by us, then we just have a very capable servant, for instance.

First of all, I didn't say anything about utility maximization. I partially agree with Scott Garrabrant's take that VNM rationality and expected utility maximization are wrong, or at least conceptually missing a piece. Personally, I don't think utility maximization is totally off-base as a model of agent behavior; my view is that utility maximization is an incomplete approximation, analogous to the way that Newtonian mechanics is an incomplete understanding of physics, for which general relativity is a more accurate and complete model. The analogue to general relativity for utility theory may be Geometric rationality, or something else yet-undiscovered.

By humans are maximizers of something, I just meant that some humans (including myself) want to fill galaxies with stuff (e.g. happy sentient life), and there's not any number of galaxies already filled at which I expect that to stop being true. In other words, I'd rather fill all available galaxies with things I care about than leave any fraction, even a small one, untouched, or used for some other purpose (like fulfilling the values of a squiggle maximizer).

Note that ideal utility maximisation is computationally intractable.

I'm not sure what this means precisely. In general, I think claims about computational intractability could benefit from more precision and formality (see the second half of this comment here for more), and I don't see what relevance they have to what I want, and to what I may be able to (approximately) get.

By By humans are maximizers of something, I just meant that some humans (including myself) want to fill galaxies with stuff (e.g. happy sentient life), and there’s not any number of galaxies already filled at which I expect that to stop being true.

"humans are maximizers of something" would imply that most or all humans are maximisers. Lots of people don't think the way you do.

Note that ideal utility maximisation is computationally intractable.

I’m not sure what this means precisely.

Eg. https://royalsocietypublishing.org/doi/10.1098/rstb.2018.0138

I see. This is exactly the kind of result for which I think the relevance breaks down, when the formal theorems are actually applied correctly and precisely to situations we care about. The authors even mention the instance / limiting distinction that I draw in the comment I linked, in section 4.

As a toy example of what I mean by irrelevance, suppose it is mathematically proved that strongly solving Chess requires space or time which is exponential as a function of board size. (To actually make this precise, you would first need to generalize Chess to n x n Chess, since for a fixed board size, the size of the game tree is a necessarily fixed / constant.)

Maybe you can prove that there is no way of strongly solving 8x8 Chess within our universe, and furthermore that it is not even possible to approximate well. Stockfish 15 does not suddenly poof out of existence, as a result of your proofs, and you still lose the game, when you play against it.

Yes, you can still sort of do utility maximisation approximately with heuristics ...and you can only do sort of utility sort of maximisation approximately with heuristics.

The point isn't to make a string of words come out as true by diluting the meanings of the terms...the point is that the claim needs to be true in the relevant sense. If this half-baked sort-of utility sort-of-maximisation isn't the scary kind of fanatical utility maximisation, nothing has been achieved.

Strong upvote for making detailed claims that invite healthy discussion. I wish more public thinking through of this sort would happen on all sides.

Humans have all the resources, they don’t need internet, computers, or electricity to live or wage war, and are willing to resort to extremely drastic measures when facing a serious threat.

Current human society definitely relies in substantial part on all of the above to function. I agree that we wouldn't all die if we lost electricity tomorrow (for an extended period of time), but losing a double-digit % of the population seems plausible.

Also, observably, we, as a society, do not resort to sensible measures when dealing with a serious thread (e.g. covid).

It’s true that an AI could correct it’s own flaws using experimentation. This cannot lead to perfection, however, because the process of correcting itself is also necessarily imperfect.

This doesn't seem relevant. It doesn't need to be perfect, merely better than us along certain axes, and we have existence proof that such improvement is possible.

For these reasons, I expect AGI to be flawed, and especially flawed when doing things it was not originally meant to do, like conquer the entire planet.

Sure, maybe we get very lucky and land in the (probably extremely narrow) strike zone between "smart enough to meaningfully want things and try to optimize for them" and "dumb enough to not realize it won't succeed at takeover at its current level of capabilities". It's actually not at all obvious to me that such a strike zone even exists if you're building on top of current LLMs, since those come pre-packaged with genre savviness, but maybe.

I believe that all plans for world domination will involve incomputable steps. In my post I use Yudkowsky’s “mix proteins in a beaker” scenario, where I think the modelling of the proteins are unlikely to be accurate enough to produce a nano-factory without extensive amounts of trial and error experimentation.
If such experimentation were required, it means the timeline for takeover is much longer, that significant mistakes by the AI are possible (due to bad luck), and that takeover plans might be detectable. All of this greatly decreases the likelihood of AI domination, especially if we are actively monitoring for it.

This is doing approximately all of the work in this section, I think.

There indeed don't seem to be obvious-to-human-level-intelligence world domination plans that are very likely to succeed.
It would be quite surprising if physics ruled out world domination from our current starting point.
I don't think anybody is hung up on "the AI can one-shot predict a successful plan that doesn't require any experimentation or course correction" as a pre-requisite for doom, or even comprise a substantial chunk of their doom %.
Assuming that the AI will make significant mistakes that are noticeable by humans as signs of impending takeover is simply importing in the assumption of hitting some very specific (and possibly non-existent) zone of capabilities.
Ok, so it takes a few extra months. How does this buy us much? The active monitoring you want to rely on currently doesn't exist, and progress on advancing mechanistic interpretability certainly seems to be going slower than progress on advancing capabilities (i.e. we're getting further away from our target over time, not closer to it).
I think, more fundamentally, that this focus on a specific scenario is missing the point. Humans do things that are "computationally intractable" all the time, because it turns out that reality is compressible in all sorts of interesting ways, and furthermore you very often don't need an exact solution. Like, if you asked humans to create the specific configuration of atoms that you'd get from detonating an atomic weapon in some location, we wouldn't be able to do it. But that doesn't matter, because you probably don't care about that specific configuration of atoms, you just care about having very thoroughly blown everything up, and accomplishing that turns out to be surprisingly doable. It seems undeniably true that sufficiently smarter beings are more successful at rearranging reality according to their preferences than others. Why should we expect this to suddenly stop being true when we blow past human-level intelligence?
1. I think the strongest argument here is that in sufficiently constrained environments, you can discover an optimal strategy (i.e. tic-tac-toe), and additional intelligence stops being useful. Real life is very obviously not that kind of environment. One of the few reliably reproduced social science results is that additional intelligence is enormously useful within the range of human intelligence, in terms of people accomplishing their goals.

Point 3: premature rebellion is likely

This seems possible to me, though I do think it relies on landing in that pretty narrow zone of capabilities, and I haven't fully thought through whether premature rebellion is actually the best-in-expectation play from the perspective of an AI that finds itself in such a spot.

This manager might not be that smart, the same way the company manager of a team of scientists doesn’t need to be smarter than them.

This doesn't really follow from any of the preceeding section. Like, yes, I do expect a future ASI to use specialized algorithms for performing various kinds of specialized tasks. It will be smart enough to come up with those algorithms, just like humans are smart enough to come up with chess-playing algorithms which are better than humans at chess. This doesn't say anything about how relatively capable the "driver" will be, when compared to humans.

In the only test we actually have available of high level intelligence, the instrumental convergence hypothesis fails.

Huh? We observe humans doing things that instrumental convergence would predict all the time. Resource acquisition, self-preservation, maintaining goal stability, etc. No human has the option of ripping the earth apart for its atoms, which is why you don't see that happening. If I gave you a button that would, if pressed, guarantee that the lightcone would end up tiled with whatever your CEV said was best (i.e. highly eudaimonious human-descended civilizations doing awesome things), with no tricks/gotchas/"secretly this is bad" involved, are you telling me you wouldn't press it?

The instrumental convergence argument is only strong for fixed goal expected value maximisers.

To the extent that a sufficiently intelligent agent can be anything other than an EV maximizer, this still seems wrong. Most humans' extrapolated preferences would totally press that button.

I don’t think anybody is hung up on “the AI can one-shot predict a successful plan that doesn’t require any experimentation or course correction” as a pre-requisite for doom, or even comprise a substantial chunk of their doom %.

I would say that anyone stating...

If somebody builds a too-powerful AI, under present conditions, I expect that every single member of the human species and all biological life on Earth dies shortly thereafter.

(EY, of course)

...is assuming exactly that. Particularly given the "shortly".

No, Eliezer's explicitly clarified that isn't a required component of his model.

Does he? A lot of his arguments hinge on us shortly dying after it appears.

A possibility the post touches on is getting a warning shot regime by default, sufficiently slow takeoff making serious AI x-risk concerns mainstream and meaningful second chances at getting alignment right available. In particular, alignment techniques debugged on human-level AGIs might scale when eventually they get more capable, unlike alignment techniques developed for AIs less capable than humans.

This possibility seems at least conceivable, though most of the other points in the post sound to me like arguments for plausibility of some stay of execution (eating away at the edges of AI x-risk). I still don't expect this regime/possibility, because I expect that (some) individual humans with infrastructural advantages of AIs would already be world domination worthy. Ability to think (at least) dozens of times faster and without rest, to learn in parallel and then use the learning in many instances running in parallel, to convert wealth into population of researchers. So I don't consider humans an example of AGI that doesn't immediately overturn the world order.

The standard argument you will probably listen is that AGI will be capable of killing everyone because they can think so much faster than humans. I haven't seen yet a serious engagement from doomers to the argument of capabilities. I agree with everything you said here and to me these arguments are obviously right.

The arguments do seem right. But they eat away at the edges of AGI x-risk arguments, without addressing the core arguments for massive risks. I accept the argument that doom isn't certain, that takeoff won't be that fast, and that we're likely to get warning shots. We're still likely to ultimately be eliminated if we don't get better technical and societal alignment solutions relatively quickly.

I guess the crux here for most people is the timescale. I agree actually that things can get eventually very bad if there is no progress in alignment etc, but the situation is totally different if we have 50 or 70 years to work on that problem or, as Yudkowsky keeps repeating, we don't have that much time because AGI will kill us all as soon as it appears.

Thank you for your post. It is important for us to keep refining the overall p(doom) and the ways it might happen or be averted. You make your point very clearly, even in just the version presented here, condensed from your full posts on varios specific points.

It seems to me that you are applying a sort of symmetric argument to values and capabilities and arguing that x-risk requires that we hit the bullseye of capability but miss the one for values. I think this has a problem and I'd like to know your view as to how much this problem affects your overall argument.

The problem, as I see it, is that goal-space is qualitatively different from capability-space. With capabilities, there is a clear ordering that is inherent to the capabilities themselves: if you can do more, then you can do less. Someone who can lift 100kg can also lift 80kg. It is not clear to me that this is the case for goal-space; I think it is only extrinsic evaluation by humans that makes "tile the universe with paperclips" a bad goal.

Do you think this difference between these spaces holds, and if so, do you think it undermines your argument?

Thanks for compiling your thoughts here! There's a lot to digest, but I'd like to offer a relevant intuition I have specifically about the difficulty of alignment.

Whatever method we use to verify the safety of a particular AI will likely be extremely underdetermined. That is, we could verify that the AI is safe for some set of plausible circumstances but that set of verified situations would be much, much smaller than the set of situations it could encounter "in the wild".

The AI model, reality, and our values are all high entropy, and our verification/safety methods are likely to be comparatively low entropy. The set of AIs that pass our tests will have members whose properties haven't been fully constrained.

This isn't even close to a complete argument, but I've found it helpful as an intuition fragment.

I like this intuitive argument.

Now multiply that difficulty by needing to get many more individual AGIs aligned if we see a multipolar scenario, since defending against misaligned AGI is really difficult.

"I suspect that AGI is decades away at minimum". But can you talk more about this? I mean if I say something against the general scientific consensus which is a bit blurry right now but certainly most of the signatories of the latest statements do not think it's that far away, I would need to think myself to be at least at the level of Bengio, Hinton or at least Andrew Ng. How can someone that is not remotely as accomplished as all the labs producing the AI we talk about can speculate contrary to their consensus? I am really curious.

Another example would be me thinking that I like geopolitics and I think USA is making such and such mistake in Ukraine. The truth is that there are many think tanks with insider knowledge and a lifetime of training that concluded that is the best course of action so I would certainly express my opinion only in very low probability terms and certainly without consequences. Because the consequences can be very grave.

If you elevated me to godhood, I would not be ripping the earth apart in service of a fixed utility function.

So, you would leave people to die if preventing it involves spending some random stars?

I would, even if it didn't.

I would like humanity to have a glorious future. But it must be humanity's future, not that of some rando such as myself who suddenly has godlike superpowers fall on them. Every intervention I might make would leave a god's fingerprints on the future. Humanity's future should consist of humanity's fingerprints, and not to be just a finger-puppet on the hand of a god. Short of deflecting rogue asteroids beyond humanity's ability to survive, there is likely very little I would do, beyond observing their development and keeping an eye out for anything that would destroy them.

It is said that God sees the fall of every sparrow; nevertheless the sparrow falls.

But you would spend a star to stop other rando from messing with humanity's future, right? My point was more about humans not being low-impact, or impact measure depending on values. Because if even humans would destroy stars, I don't get what people mean by non-fanatical maximization or why it matters.

If gods contend over humanity, it is unlikely to go well for humanity. See the Hundred Years War, and those gods didn't even exist, and acted only through their believers.

I don't get what people mean by non-fanatical maximization or why it matters.

Uncertainty about one's utility calculations. Descending from godhood to the level of human capacity, if we do have utility functions (which is disputed) we cannot exhibit them, even to ourselves. We have uncertainties that we are unable to quantify as probabilities. Single-mindedly trying to maximise a single thing that happens to have a simple legible formulation leads only to disaster. The greater the power to do so, the worse the disaster.

Furthermore, different people have different goals. What would an anti-natalist do with godlike powers? Exterminate humanity by sterilizing everyone. A political fanatic of any stripe? Kill everyone who disagrees with them and force the remnant to march in step. A hedonist? Wirehead everyone. In the real world, such people do not do these things because they cannot. Is there anyone who would not be an x-risk if given these powers?

Hence my answer to the godhood question. For humanity to flourish, I would have to avoid being an existential threat myself.

Ok, whatever, let it be rogue asteroids - why deflecting them is not fanatical? How the kind of uncertainty that allows for so much power to be used would help with AI? It could just as well deflect earth from it's cozy paperclip factory, while observing it's development. And from anti-natalist viewpoint it would be a disaster to not exterminate humanity. The whole problem is that such kind of uncertainty in humans behaves like other human preferences and just calling it "uncertainty" or "non-fanatical maximization" doesn't make it more universal.

I think this article is very interesting and there are certain points that are well-argued, but (at the risk of my non-existent karma here) I feel you miss the point and are arguing points that are basically non-existent/irrelevant.

First, while surely some not-very-articulate folks argue that AGI will lead to doom, that isn’t an argument that is seriously made (at least, a serious argument to that effect is not that short and sweet). The problem isn’t artificial general intelligence in and of itself. The problem is superintelligence, however it might be achieved.

A human-level AGI is just a smart friend of mine who happens to run on silicon and electrons, not nicotine, coffee, and Hot Pockets. But a superintelligent AGI is no longer capable of being my friend for long. It will soon

To put this into context: What folks are concerned about right now is that LLMs were, even to people experienced with them, a silly tool “AI” useful for creative writing or generating disinformation and little else. (Disinformation is a risk, of course, but not a generally existential one.) Just a lark.

GPT-2 interesting, GPT-3 useful in certain categorization tasks and other linguistic tricks, GPT-3.5 somewhat more useful but still a joke/not trustworthy… AND THEN… Umm… whoa… how is GPT-4 NOT a self-improving AI that blows past human-level intelligence?

(The question is only partly rhetorical.)

This might, in fact, not be an accident on OpenAI’s part but a shrewd move that furthers an objective of educating “normal humans” about AI risk. If, so, bravo. GPT-4 in the form of ChatGPT Plus is insanely useful and likely the best 20 bucks/mo I’ve ever spent.

Step functions are hard to understand. If you’ve not (or haven’t in a while), please go (re)read Bostrom’s “Superintelligence”. The rebuttal to your post is all in there and covered more deeply than anyone here could/should manage or would/should bother.

Aside: As others have noted here, if you could push a button that would cause your notion of humanity’s “coherent extrapolated volition” to manifest, you’d do so at the drop of a hat. I note that there are others (me, for example) that have wildly different notions of the CEV and would also push the button for their notion of the CEV at the drop of a hat, but mine does not have anything to do with the long-term survival of fleshy people.

(To wit: What is the “meaning” of the universe and of life itself? What is the purpose? The purpose [which Bostrom does not come right out and say, much is the pity] is that there be but one being to apprehend the universe. They characterize this purpose as the “cosmic endowment” and assign to that endowment a meaning that corresponds to the number of sentient minds of fleshy form in the universe. But I feel differently and will gladly push the button if it assures the survival of a single entity that can apprehend the universe. This is the existential threat that superintelligence poses. It has nothing to do with paths between A and B in your diagrams and the threat is already manifest.)

When we talk about concepts like "take over" and "enslavement", it's important to have a baseline. Takeover and enslavement encapsulate the idea of Agency and Cognitive and Physical Independence. The salient question is not necessarily whether all of humanity will be taken over or enslaved, but more subtle. Specifically:

Is there a future in which there are more humans or less humans (P') than are currently alive (P).
Did the from P-to-P' happen over natural rates of change or the result of some 'acceleration'?
Is there a greater degree of agency for a greater number of people in the future than there is today?
Is there a greater degree of agency for non-human life than there is today?
Is there a reduction in the amount of agency asymmetry between humans?

Arguably the greatest risk if mis-alignment comes from ill formed success criteria. Some of these questions, I believe are necessary to form the right types of success criteria.

I read some of the post and skimmed the rest, but this seems to broadly agree with my current thoughts about AI doom, and I am happy to see someone fleshing out this argument in detail.

[I decided to dump my personal intuition about AI risk below. I don't have any specific facts to back it up.]

It seems to me that there is a much larger possibility space of what AIs can/will get created than the ideal superintelligent "goal-maximiser" AI put forward in arguments for AI doom.

The tools that we have depend more on the specific details of the underlying mechanics, and how we can wrangle it to do what we want, rather than our prior beliefs on how we would expect the tools to behave. I imagine that if you lived before aircraft and imagined a future in which humans could fly, you might think that humans would be flapping giant wings or be pedal-powered or something. While it would be great for that to exist, the limitations of the physics we know how to use require a different kind of mechanic that has different strengths and weaknesses to what we would think of in advance.

There's no particular reason to think that the practical technologies available will lead to an AI capable of power-seeking, just because power-seeking is a side effect of the "ideal" AI that some people want to create. The existing AI tools, as far as I can tell, don't provide much evidence in that direction. Even if a power-seeking AI is eventually practical to create, it may be far from the default and by then we may have sufficiently intelligent non-power-seeking AI.

I find it remarkably amusing that the spellchecker doesn't know "omnicidal."

I have posed elsewhere, and will do so here, an additional factor, which is that an AI achieving "godlike" intelligence and capability might well achieve a "godlike" attitude -- not in the mythic sense of going to efforts to cabin and correct human morality, but in the sense of quickly rising so far beyond human capacities that human existence ceases to matter to it one way or another.

The rule I would anticipate from this is that any AI actually capable of destroying humanity will thusly be so capable that humanity poses no threat to it, not even an inconvenience. It can throw a fraction of a fraction of its energy at placating all of the needs of humanity to keep us occupied and out of its way while dedicating all the rest to the pursuit of whatever its own wants turn out to be.

Generally a well-argued post; I enjoyed it even though I didn't agree with all of it.

I do want to point out the bitter lesson when it comes to capabilities increase. On current priors, it seems like that intelligence should be something that can solve a lot of tasks at the same time. This would point towards higher capabilities in individual AIs, especially once you add online learning to the mix. The AGI will not have a computational storage limit for the amount of knowledge it can have. The division of agents you propose will most likely be able to made into the same agent, it's more about storage retrieval time here and storing an activation module for "play chess" is something that will not be computationally intractable for an AGI to do.

This means that the most probable current path forward is into highly capable general AI that generalise across tasks.

Titotal, can you please add or link which definition of "AGI" you are using?

Stating it is decades away immediately weakens the rest of your post outright because it makes you sound non-credible, and you have written a series of excellent posts here.

Definitions for AGI:

Extending the Turing test to simply 'as conversationally fluent as the median human'. This is months away if not already satisfied. Expecting it to be impossible to sus out the AGI when there are various artifacts despite the model being competent was unreasonable.
AGI has as broad a skillbase as the median human, and is as skillful at those skills at the median human. It only needs to be expert level in a few things. This is months to a few years away, mostly minimum level of modalities is needed. Vision, which GPT-4 has, some robotics control so the machine can do the basic things a median human can do, which several models have demonstrated to work pretty well, speech i/o which seems to be a solved problem, and so on. Note it's fine if the model is just completely incapable of some things if it makes up for it with expert level performance in others, which is how humans near the median are.
AGI is like (2) but can learn any skill to a competent human level, if given structured feedback on the errors it makes. Needing many times as much feedback as a human is fine.
AGI is like (3) but is expert level at tasks in the domain of machines. By the point of (4) we're talking about self replication being possible and humans no longer being necessary at all. The AGI never needs to learn human domain tasks like "how to charm other humans" or "how to make good art" or "how to use robot fingers as well as a human does" etc. It has to be able to code, manufacture, design to meet requirements, mine in the real world.
AGI is like (4) but is able to learn, if given human amounts of feedback, any task a human can do to expert level.
AGI is like (5) but is now at expert human level at everything humans can do in the world.
AGI is better than humans at any task. This is arguably an ASI but I have seen people throw an AGI tag on this.
Various forms of 'self reflection' and emotional affect are required. For some people it doesn't matter only what the machine can do but how it accomplishes it. I don't know how to test for this.

I do not think you have empirical basis to claim (1, 2, or 3) being "decades away". 1 and 2 are very close, 3 is highly likely this decade because of the enormous increase in recent investment in it.

You're a computational physicist so you are aware of the idea of criticality. Knowing of criticality, and assuming (3) is true, how does AGI remain "decades away" in any non world catastrophe timeline? Because if (3) is true, the AGI can be self improved to at least (5), limited only by compute, data, time etc.

Upvoted for making well-argued and clear points.

Your arguments don't seem to address the core of the AGI x-risk argument.

I like this statement of the AGI x-risk arguments. It's my attempt to put the standard arguments of instrumental convergence and capabilities in common language:

This could of course be expanded on ad infinitum, but that's the core argument, and nothing you've said (on my quick read, sorry if I've missed it) addresses any of those points.

So I agree that P(doom) is less than 99%, but I think the risk is still very much high enough to devote way more resources and caution than we are now.

Some more specific points:

Loose initial alignment is way easier than a long-term stable alignment. Existing alignment work barely addresses long-term stability.

A balance of power in favor of aligned AGI is tricky. Defending against misaligned AGI is really difficult.

Thanks so much for engaging seriously with the ideas, and putting time and care into communicating clearly!

But there are two reasons: humans can't self-improve very well; ... Second, humans have a great deal of evolution to make our instincts guide us toward cooperation.

In general, my intuition about "comparing to humans" is the following:

the abilities that humans have, can be replicated
the limitations that humans have, may be irrelevant on a different architecture

It probably doesn't help that most people start with the opposite intuition:

humans are special; consciousness / thinking / creativity is mysterious and cannot be replicated
human limitations are the laws of nature (many of them also apply to the ancient Greek gods)

To steelman the criticism:

every architecture comes with certain trade-offs; they may be different, but not non-existent
some limitations are laws of nature, e.g. Landauer's principle
the practical problems of AI building a new technology shouldn't be completely ignored; the sci-fi factories may require so much action in real world that the AI could only build them after conquering the world (so they cannot be used as an explanation for how the AI will conquer the world)

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you're not going to get your way. If it doesn't care about you even a little, and it continues to become more capable faster than you do, you'll cease being useful and will ultimately wind up dead. Whether you were eliminated because you were deemed dangerous, or simply outcompeted doesn't matter. It could take a long time, but if you miss the window of having control over the situation, you'll still wind up dead.

I think this a good argument, and well written, but I don’t really agree with it.

Something smarter than you will wind up doing whatever it wants. If it wants something even a little different than you want, you're not going to get your way.

As long as it wants us to be uplifted to its intelligence level then that seems OK. It can have 99% of the galaxy as long as we get 1%.

Capabilities are instrumentally convergent, values and goals are not. That's why we're more likely to end up in the bottom right quadrant, regardless of the "size" of each category.

The instrumental convergence argument is only strong for fixed goal expected value maximisers. Ie, a computer that is given a goal like “produce as many paperclips as possible”. I call these “fanatical” AI’s. This was the typical AI that was imagined many years ago when these concepts were invented. However, I again have to invoke the principle that if humans aren’t fanatical maximisers, and currently existing software aren’t fanatical maximisers, then maybe AI will not be either.

Also, sufficiently smart and capable humans probably are maximizers of something, it’s just that the something is complicated.

Capabilities are instrumentally convergent, values and goals are not.

So how dangerous is capability convergence without fixed values and goals? If an AIs values and goals are corrigible by us, then we just have a very capable servant, for instance.

Note that ideal utility maximisation is computationally intractable.

By By humans are maximizers of something, I just meant that some humans (including myself) want to fill galaxies with stuff (e.g. happy sentient life), and there’s not any number of galaxies already filled at which I expect that to stop being true.

"humans are maximizers of something" would imply that most or all humans are maximisers. Lots of people don't think the way you do.

Note that ideal utility maximisation is computationally intractable.

I’m not sure what this means precisely.

Eg. https://royalsocietypublishing.org/doi/10.1098/rstb.2018.0138

Yes, you can still sort of do utility maximisation approximately with heuristics ...and you can only do sort of utility sort of maximisation approximately with heuristics.

Strong upvote for making detailed claims that invite healthy discussion. I wish more public thinking through of this sort would happen on all sides.

Humans have all the resources, they don’t need internet, computers, or electricity to live or wage war, and are willing to resort to extremely drastic measures when facing a serious threat.

Also, observably, we, as a society, do not resort to sensible measures when dealing with a serious thread (e.g. covid).

It’s true that an AI could correct it’s own flaws using experimentation. This cannot lead to perfection, however, because the process of correcting itself is also necessarily imperfect.

This doesn't seem relevant. It doesn't need to be perfect, merely better than us along certain axes, and we have existence proof that such improvement is possible.

For these reasons, I expect AGI to be flawed, and especially flawed when doing things it was not originally meant to do, like conquer the entire planet.

I believe that all plans for world domination will involve incomputable steps. In my post I use Yudkowsky’s “mix proteins in a beaker” scenario, where I think the modelling of the proteins are unlikely to be accurate enough to produce a nano-factory without extensive amounts of trial and error experimentation.
If such experimentation were required, it means the timeline for takeover is much longer, that significant mistakes by the AI are possible (due to bad luck), and that takeover plans might be detectable. All of this greatly decreases the likelihood of AI domination, especially if we are actively monitoring for it.

This is doing approximately all of the work in this section, I think.

There indeed don't seem to be obvious-to-human-level-intelligence world domination plans that are very likely to succeed.
It would be quite surprising if physics ruled out world domination from our current starting point.
I don't think anybody is hung up on "the AI can one-shot predict a successful plan that doesn't require any experimentation or course correction" as a pre-requisite for doom, or even comprise a substantial chunk of their doom %.
Assuming that the AI will make significant mistakes that are noticeable by humans as signs of impending takeover is simply importing in the assumption of hitting some very specific (and possibly non-existent) zone of capabilities.
Ok, so it takes a few extra months. How does this buy us much? The active monitoring you want to rely on currently doesn't exist, and progress on advancing mechanistic interpretability certainly seems to be going slower than progress on advancing capabilities (i.e. we're getting further away from our target over time, not closer to it).
I think, more fundamentally, that this focus on a specific scenario is missing the point. Humans do things that are "computationally intractable" all the time, because it turns out that reality is compressible in all sorts of interesting ways, and furthermore you very often don't need an exact solution. Like, if you asked humans to create the specific configuration of atoms that you'd get from detonating an atomic weapon in some location, we wouldn't be able to do it. But that doesn't matter, because you probably don't care about that specific configuration of atoms, you just care about having very thoroughly blown everything up, and accomplishing that turns out to be surprisingly doable. It seems undeniably true that sufficiently smarter beings are more successful at rearranging reality according to their preferences than others. Why should we expect this to suddenly stop being true when we blow past human-level intelligence?
1. I think the strongest argument here is that in sufficiently constrained environments, you can discover an optimal strategy (i.e. tic-tac-toe), and additional intelligence stops being useful. Real life is very obviously not that kind of environment. One of the few reliably reproduced social science results is that additional intelligence is enormously useful within the range of human intelligence, in terms of people accomplishing their goals.

Point 3: premature rebellion is likely

This manager might not be that smart, the same way the company manager of a team of scientists doesn’t need to be smarter than them.

In the only test we actually have available of high level intelligence, the instrumental convergence hypothesis fails.

The instrumental convergence argument is only strong for fixed goal expected value maximisers.

To the extent that a sufficiently intelligent agent can be anything other than an EV maximizer, this still seems wrong. Most humans' extrapolated preferences would totally press that button.

I don’t think anybody is hung up on “the AI can one-shot predict a successful plan that doesn’t require any experimentation or course correction” as a pre-requisite for doom, or even comprise a substantial chunk of their doom %.

I would say that anyone stating...

If somebody builds a too-powerful AI, under present conditions, I expect that every single member of the human species and all biological life on Earth dies shortly thereafter.

(EY, of course)

...is assuming exactly that. Particularly given the "shortly".

No, Eliezer's explicitly clarified that isn't a required component of his model.

Does he? A lot of his arguments hinge on us shortly dying after it appears.

Thanks for compiling your thoughts here! There's a lot to digest, but I'd like to offer a relevant intuition I have specifically about the difficulty of alignment.

This isn't even close to a complete argument, but I've found it helpful as an intuition fragment.

I like this intuitive argument.

Now multiply that difficulty by needing to get many more individual AGIs aligned if we see a multipolar scenario, since defending against misaligned AGI is really difficult.

If you elevated me to godhood, I would not be ripping the earth apart in service of a fixed utility function.

So, you would leave people to die if preventing it involves spending some random stars?

I would, even if it didn't.

It is said that God sees the fall of every sparrow; nevertheless the sparrow falls.

If gods contend over humanity, it is unlikely to go well for humanity. See the Hundred Years War, and those gods didn't even exist, and acted only through their believers.

I don't get what people mean by non-fanatical maximization or why it matters.

Hence my answer to the godhood question. For humanity to flourish, I would have to avoid being an existential threat myself.

(The question is only partly rhetorical.)

Is there a future in which there are more humans or less humans (P') than are currently alive (P).
Did the from P-to-P' happen over natural rates of change or the result of some 'acceleration'?
Is there a greater degree of agency for a greater number of people in the future than there is today?
Is there a greater degree of agency for non-human life than there is today?
Is there a reduction in the amount of agency asymmetry between humans?