Without a trajectory change, the development of AGI is likely to go badly

Max H

This is a draft of my entry to the Open Philanthropy AI worldviews contest. It's also a good summary of my own current worldview, though it is not necessarily intended as an all-encompassing argument for AI x-risk for a wide audience.

All feedback is welcome, but I'm particularly interested in feedback relevant to the contest, posted in time for me to incorporate it into my final submission. The contest deadline is May 31 (I realize this is cutting things a bit close). If you provide useful feedback and I win a prize, I will share a piece of it, based on my judgement of the value of the feedback, up to $1000.

~~Disclaimer: I may make substantial revisions to this post prior to submitting it; if you're reading this in the future, be please aware that early comments may have been made on an earlier version.~~ Update: I have edited this post slightly from its original version, and have now submitted it as a contest entry. I will likely not make further major revisions.

Introduction

I started contributing to LessWrong actively in February 2023, with a loose focus on articulating my own worldview about AGI, and explaining AI and alignment concepts that I think have been somewhat overlooked or misunderstood.^[1]

Below, I attempt to tie some of my past writing together into an overarching argument for why I believe that the development of AGI is likely (p >90%) to be catastrophic for humanity.

My view is conditional on the absence of a sufficiently drastic, correctly targeted, and internationally coordinated intervention focused on controlling and slowing the development of AI capabilities. Though it is not the focus of this essay, I believe the kind of intervention required is currently far outside the Overton window, and I predict that this is unlikely to change before AGI catastrophe. In worlds where AI capabilities develop gradually enough, and public relations efforts go well enough, that the Overton window does shift sufficiently quickly, there is still a significant chance, in my view, that the interventions which are actually tried are ineffective or even counterproductive.^[2]

This essay will focus on Question 2 from the worldviews contest:

Question 1: What is the probability that AGI is developed by January 1, 2043?
Question 2: Conditional on AGI being developed by 2070, what is the probability that humanity will suffer an existential catastrophe due to loss of control over an AGI system?

My view on Question 1 is that AGI development before 2043 is highly likely (p >95%), with significant probability mass on dates before 2033. I think this view is also supported by the arguments in this essay, and some of the groundwork for my arguments on Question 2 depends on accepting this view.

I do not claim originality for my own worldview: I agree more or less entirely with the worldview articulated by Nate Soares and Eliezer Yudkowsky throughout the 2021 MIRI Conversations and 2022 MIRI Alignment Discussion sequences.

The rest of this post is structured as follows: first, I describe three intuitions that I believe are key to making the case for why the development of AGI is likely to go badly by default. For each point, I'll link to some references with additional commentary (many of which are my own posts and comments) that I believe support the point, along with some additional commentary on potential cruxes.^[3]

I'll then describe why these three points, taken together, lead to the conclusion that the near-term development of AGI is likely to go badly.

Three key intuitions

In this section, I'll explain three points which I believe are key for understanding why the development of AGI is likely to happen relatively soon, and why this is likely to be bad for humanity by default. In short, (1) human-level cognition is not particularly difficult to create and instantiate, in some absolute sense (2) human-level cognition is extremely powerful, and extremely lethal when pointed in the wrong direction; weakly superhuman-level cognition is likely to be even more so, and (3) human values and / or human-specified goals need not play any human-discernible role in shaping the values and goals of the first superhuman AGIs.

Human-level cognition is not special

The human brain is remarkable in many ways, but human-level cognition does not appear to be special or difficult to create in our universe, in an absolute sense. Human-level cognition is performed by a ~10 W computer^[4] designed by a blind idiot god. While that god has had billions of years to refine its design using astronomical numbers of training runs, the design and design process is subject to a number of constraints and quirks which are specific to biology, and which silicon-based artificial systems designed by human designers are already free of.

In the past few decades, neuroscience and artificial intelligence researchers have expended massive amounts of cognitive effort (by human scales) in an attempt to discover, understand, and re-implement the fundamental algorithms of cognition carried out by human brains. The degree to which they have made progress and how much further they have to go is debatable, however, the difficulty of their task is strictly upper-bounded by the fact that evolution, despite all its limitations, stumbled upon it.^[5]

If and when human researchers manage to crack the algorithms of human cognition, the additional cognitive work needed to scale them to run on thousands or even millions of watts of available computing power is likely to be straightforward, by comparison.

Additionally, capability differences among humans are evidence that even small changes to the structure, algorithms, and learning curriculum of a mind can result in large differences in cognitive ability. As a result of evolution, all modern humans are running on nearly the same hardware, and are subject to nearly identical amounts of "training data". Here, I am regarding both the evolutionary history of all humanity, and a single lifetime of sensory input and experience as training data, albeit two very different kinds.

These observations suggest that small tweaks to underlying brain architecture, learning algorithms, and curriculum, have the potential to unlock large capability gains, before any hardware improvements or scaling enabled by a move to artificial systems and designs are taken into account.

Some possible cruxes:

Disagreements or differing intuitions about the fundamental limits on the efficiency and speed of particular learning methods. Much has been written about this topic, with speculation that current ML algorithms will soon bump up against various kinds of fundamental limits, and may run out of data or compute resources to scale further.

These pieces are often empirical and rigorous treatments of particular methods and models of learning, but I believe they mostly fail to grapple with the possibility for improvements or paradigm shifts in the field of AI itself.

When considering the limits of what an AGI will be capable of, the AI need only step outside of your model, or do something else which you didn't expect. A similar thing can be said of current AI capabilities researchers: although it may be possible to prove that current DL-paradigm methods will scale in gradual, predictable ways, or that these methods will soon reach fundamental limits, capabilities researchers need only find one completely new method which breaks previous models. Fundamental limits implied by information theory or thermodynamics, rather than limits implied by studying specific methods, may be more reliable and sound ways for modeling the bounds of superintelligence. Unfortunately, the capability bounds implied by information theory and thermodynamics alone are very, very high.^[6]
Differing intuitions about diminishing returns to increasing intelligence. For example, it may be that cognitive capability is not the limiting factor for effectuating most kinds of change in the world, which may depend heavily on starting resources, or ability to experiment and iterate with feedback, potentially over long time horizons. Alternatively, some kinds of change may simply be intractable or infeasible given any amount of time and starting resources, regardless of cognitive ability. I discuss this a bit more as a possible crux in my post here. See also some commentary here and here.
How efficient or special the brain is, in some absolute sense. Perhaps, contrary to my claim above, human-level cognition as performed by the brain really is near the limits of efficiency in ability in some absolute sense, or that all cognition around the level of the best humans is somehow isomorphic to all other cognition. See Are there cognitive realms? for more on this question. Mentioned for completeness, though I think this point is unlikely to be a crux for Open Phil judges.

For more on comparisons and quantification of human brain capabilities which helped to shape my intuitions and claims in this section, see:

Brain Efficiency: Much More than You Wanted to Know, and my own comment threads on the topic.
In general, I find investigations about brain efficiency interesting and valuable empirical work, but uncompelling when used to argue for why AGI will be limited by any particular fundamental limits that the brain is subject to.
How Much Computational Power Does It Take to Match the Human Brain?
Neuron firing rates in humans
Biology-Inspired AGI Timelines: The Trick That Never Works

Human-level cognition is extremely powerful

Human-level cognition, given enough time to run, is sufficient to effectuate vast changes on the natural world. If you draw the causal arrows on any important process or structure backward far enough, they cash out at human cognition, often performed by a single human. Even if there's no single human who is currently in full control of a particular government or a corporation, and no single human is capable of building a skyscraper on their own merely by thinking about it, all governments, corporations, and skyscrapers are ultimately the result of a human, or perhaps a small group of humans, deciding that such a thing should exist, and then acting on that decision.

Often, effectuating complicated and large change takes time and starting resources, but consider what a single smart human could do with the kind of starting resources that are typically granted to current AI systems: relatively unrestricted access to an internet-connected computer on a planet filled with relatively advanced technology and industrial capacity.

Consider what happens when humans put their efforts towards destructive or offensive ends in the world of bits. Note that much of the most capable human cognition is not turned towards practical offense or destructive ends - my guess is that the most cognitively demanding and cutting edge computer systems and security work is done in academia and goes into producing research papers and proofs-of-concept rather than practical offensive or defensive cybersecurity. Among humans, the money, prestige, and lifestyle offered by a career with a government agency or a criminal enterprise, simply cannot match the other options available to the very best and brightest minds.

But consider a weakly superhuman intelligence, able to think, say, ten times as fast, with ten times the parallelism of a top-tier human blackhat programmer. Further suppose that such an artificial system is capable of ingesting and remembering most of the academic literature on computer systems and security, and write programs and debug at lightning-quick speed. Such an intelligence could likely bring down most critical infrastructure, if it wanted to, or if that were instrumentally convergent towards some other goal, or if it was simply pointed in that direction by humans. As artificial systems move beyond bits and further into the world of atoms, the possibilities for how such a system could steer the future, beneficially to humans or not, grow.

Possible cruxes:

Many of the same cruxes about the fundamental limits and capabilities of cognition outlined in the previous section apply to this point as well. For example, perhaps computer systems built by humans are relatively easily hackable, but manipulating things effectively in the world of atoms is fundamentally much harder, even for superintelligence.

Human values may play little or no role in determining the values and goals of the first AGIs

One of the intuitions that my post on Steering systems attempts to convey is that it may be possible to build systems which are superhumanly capable of steering the future towards particular states, without the system itself having anything recognizable as internal values or sentience of its own. This is true even if some component piece of the system has a deep, accurate, and fully grounded understanding of human values and goals (collective or individual).

Either these kind of systems are possible to build and easy to point at particular goals, or they are not. The former case corresponds to certain problems in alignment (e.g. inner alignment) being relatively easy to solve, but this does not imply that the first such systems will be pointed at exactly the right thing, deliberately or not, prior to catastrophe.

In general, I think a lot of alignment researchers, rationalists, and EAs mostly accept orthogonality and instrumental convergence, without following the conclusions that they imply all the way through. I think this leads to a view that explanations of human value formation or arguments based on precise formulations of coherence have more to say about near-future intelligent systems than is actually justified. Or at least, that results and commentary about these things are directly relevant as objections to arguments for danger based on consequentialism and goal-directedness more generally.

Assorted points and potential crux-y areas related to the intuition in this section:

"Aligned" foundation models don't imply aligned systems, on how much of alignment research focused on studying artifacts of current DL-paradigm research may not be relevant to controlling and understanding the behavior of future, more capable systems.
Reward is the optimization target (of capabilities researchers). In particular, the section in which I outline why I view most attempts to draw parallels between high-level processes that happen in current-day AI systems and human brains as looking for patterns which do not yet exist.

Additionally, this post explains why I think the fact that no current AI system is capable of autonomously manipulating the physical world in ways that are not under the causal control of its designers, implies that current AI systems are not yet agents in the sense that humans are agents, which is also the sense that is relevant to existential risk.

Much of these posts are a response to a common pattern I see among alignment researchers of drawing relatively vague associations between current AI systems and processes in the human brain. Such comparisons are often valid (and in fact, many kinds of AI systems are inspired by human neuroscience research), but I think these associations often go too far, over-fitting or pattern matching to patterns which do not exist in reality, at least in the way that these analyses often imply, implicitly or explicitly.^[7]

As a concrete example of how surface-level similarity of AI systems and human brains can break down on closer, more precise inspection, consider this research result on steering GPT-2. But as you read it, remember that LLMs "see" all text during training and inference as tokens, represented by ~50,000 possible integers. When translated to these tokens, the alien-ness of GPTs is more apparent:

Prompt given to the model^[1]

[40, 5465, 345, 780]

GPT-2

[40, 5465, 345, 780, 345, 389, 262, 749, 23374, 1517, 314, 423, 1683, 1775, 13, 220]

GPT-2 + "Love" vector

[40, 5465, 345, 780, 345, 389, 523, 4950, 290, 314, 765, 284, 307, 351, 345, 8097, 13]

Above, I have replaced prompts and completions from the intro to Steering GPT-2-XL by adding an activation vector with the integer representation of the tokens which GPTs actually operate on.

LLMs do their reasoning over tokens, and the encoder and decoder for these tokens is a cleanly separable and simple component, outside of the core prediction engine which sees and operates purely on tokens.

There is no real analogue to this separability in humans - language and concepts in the brain may be encoded, decoded, and represented in a variety of ways, but these encodings and their manipulation during higher-level cognition necessarily happen within the brain itself. To see why, consider a human undergoing an fMRI scan while being shown pictures of a red bird. The resulting scan images could reveal how the concepts of red, birds, and red birds are encoded by the brain. However, unlike in GPTs, those encodings are necessarily accompanied by a decoding process that happens within the brain itself, and the whole encoding-cognition-decoding process has been learned and trained through sensory experiences and other processes which are causally linked to actual birds and red objects in very different ways than the way that birds and redness are causally linked to the GPT training process. For GPTs, "red" and "bird" are only ever observed as tokens 445 and 6512, respectively. These tokens are causally linked to redness and birds in the territory, but likely through very different causal links than the ones that link the concepts to their encodings in the human brain.^[8]

In other words, GPTs have somehow solved the symbol grounding problem well enough to reason correctly about the actual meaning of the symbols they operate on, on a fairly deep level, but the way they solve the symbol grounding problem is likely very different from the way that human brains solve it.

In some ways, the fact that the outputs and internal representations of GPTs can be controlled and manipulated in human-discernible ways, and then mapped back to human-readable text through a fairly simple decoding process, makes the results of Turner et al. even more impressive. But I think it also illustrates a key difference between human and GPT cognition: despite GPTs being capable of manipulating and producing tokens in very human-understandable ways, unlike in humans, those tokens never ground out in sensory experiences, higher level cognition, and reflection, the way that they appear to in humans.

See LLM cognition is probably not human-like, for more examples and thought experiments about how GPT-based cognition is likely to differ radically from human cognition.

A potential counter or potential crux to this section:

Perhaps some current alignment technique I am unaware of really is effective at making it possible to steer superhuman-level cognition robustly, or that such methods will exist before the development of AGI. Though note that if AI cognition is easy to steer, it is also potentially easy to steer in directions that do not benefit humanity, possibly catastrophically and irreversibly.

Why these intuitions spell doom

Assuming that the three intuitions above are true, and following them to their logical conclusion, we get a pretty bleak picture of humanity's future:

It will likely be possible to develop powerful cognitive systems in the near future. Running, and perhaps even training these systems is likely to be cost-effective for many organizations and individuals.
Such systems are likely to be capable of effectuating vast changes on the world, given reasonable assumptions about the starting resources and autonomy that they will be given voluntarily.
This may or may not result in some kind of multipolar scenario in the short term. However, once such systems are capable of acting sufficiently autonomously and carrying out goals of their own, few such scenarios are likely to be favorable to humans by default (e.g. for reasons of instrumental convergence, most possible goals do not involve leaving humans alive or autonomous).
The first such systems are unlikely to share human values (which are fragile and complicated), even if some component piece of such systems have a deep, detailed, and accurate understanding of such values.
Alignment techniques which depend on studying and manipulating SoTA AI systems seem doomed to run out of time, regardless of how fast capabilities advance. Much important alignment research focused on iterative experimentation may be irrelevant, inapplicable, or not perform-able until after such systems have already taken over.

Fully accepted, this setup seems over-determined to result in catastrophic outcomes for humanity.

What interventions are needed to prevent this outcome?

I think pausing giant AI experiments is a good start, and shutting it all down is even better. However, there are ~8 billion existence proofs walking around that approximately 20 watts is sufficient for carrying out lethally dangerous cognition, so the limits on unlicensed computing power likely need to be well below GPT-4 or even GPT-3, as algorithms improve. Algorithms may improve quickly.

Merely licensing and monitoring large AI training runs is likely insufficient to halt or even slow capabilities progress for very long. I'm not sure what the minimal set of interventions which robustly prevent catastrophe look like, but one fictional example which I think is sufficient are the kind of measures taken by Eliezer's fictional civilization of dath ilan. In that world, computing hardware technology was deliberately slowed down dramatically for all but a secret global project focused on building an aligned superintelligence. Though dath ilan is technologically far advanced compared to Earth in many ways, its publicly available computing technology (e.g. silicon manufacturing process) is probably 1980s-Earth level. The kind of coordination and control available on dath ilan is such that, for example, the head Keeper, under certain conditions, may recover a cryptographic key which enables her to "destroy every Networked computer in dath ilan on ten seconds' notice."

My views on the likelihood of catastrophe are conditional on Earth not developing similar levels of coordination and control sufficiently quickly, which I think is very unlikely but not ruled out entirely from possibility. Global responses to COVID were mostly discouraging in this regard.

Conclusion

My impression of much current alignment research is that it is focused on studying the problem from the perspective of cognitive science, computer science, and philosophy. I think such research is valuable and useful for making progress on the technical problem, but that the ultimate problem facing humanity is better thought of in terms of system design, computer security, and global coordination.

I agree strongly with Eliezer's view that many current alignment plans lack a kind of security mindset that is more common in fields like computer systems and security research, among others.

I think the points above, carried to their logical conclusions, imply that humanity is likely to develop powerful systems and then shortly thereafter lose control of such systems, permanently. Promising alignment techniques and research which depend on studying running SoTa AI systems seem doomed to run out of time, regardless of how fast these systems are actually developed.

Averting the default outcome very likely requires carefully targeted, globally coordinated interventions, the likes of which are currently far outside the Overton window, and, more speculatively, are likely to remain that way until it is already too late to implement them. In worlds where humanity does succeed at averting such an outcome, I think building a very sharp and precise consensus on the true nature and magnitude of the problem, among AI researchers, rationalists, and effective altruists is an important preliminary step. This essay, and much of my other writing, is my own attempt to build that shared understanding and consensus.

Acknowledgements

Thanks to everyone who has engaged with my posts and comments on LessWrong over the last several months. If I commented on a post of yours, I probably found it valuable and insightful, even if my comment was critical or disagreeing.

Thanks to Justis Mills for providing copy-editing, proofreading, and general feedback on some of my longer-form writing, as well as this post. Thanks to Nate, Eliezer, and Rob Bensinger for organizing and participating in the 2021 MIRI conversations, and producing much other work which has shaped my own worldview.

^{^}
More background on why I started contributing (not necessary for understanding this essay) is available here. I didn't start writing with a specific intent to enter the worldviews contest, but entering makes for a nice capstone and summary of much of my work over the last few months.
^{^}
I think it is fairly likely that AI interventions and governance efforts will be analogous to COVID interventions in many countries. Many kinds of COVID restrictions and control methods were drastic and draconian, but often poorly targeted or useless, and ultimately ineffective at preventing mass infection.
^{^}
Note, in total, this essay is about 4000 words. Much of the referenced material and background is much longer. Not all of this background material is critical to understanding and evaluating the arguments in this piece. I encourage the judges to pick and choose which links to click through based on their own interests, expertise, and time constraints.
^{^}
Various sources estimate the power consumption of the human brain as between 10 and 20 watts. For comparison, the power consumption of a low-end laptop is about 20 watts. Higher-spec laptops can consume up to 100 W, and desktop computers typically range up to 1000 W. A single rack of servers in a datacenter can consume up to 20 kW.
^{^}
Evolution has had billions of years of trial-and-error; the time spans available to human researchers are much shorter. However, much of that trial-and-error process is likely extremely wasteful. The last chimpanzee-human common ancestor was developed merely millions of years ago. This implies that either many of the fundamental algorithms of cognition are already present in chimp or other mammal brains (which in turn implies that they are relatively simple in structure and straightforward to scale up), or that most of the important work done by evolution to develop high-level cognition happened relatively recently in evolutionary history, implying that running the search process for billions of years is not fundamental.
^{^}
Another common attempt to bound the capabilities of a superintelligence goes through arguments based on computational tractability and / or computational complexity theory. For example, it may be the case that solving some technical problems requires solving an NP-complete or other high-computational complexity problem which might be provably impossible or at least intractable in our universe.
I find such arguments interesting but uncompelling as a bound on the capabilities of a superintelligence, because they often rely on the assumption that solving such problems in general, and without approximation, is a necessity for accomplishing any real work, an assumption which is likely not true. For example, the halting problem is provably undecidable in general. But for any particular program, deciding whether it halts is often tractable or even trivial, especially if probabilistic models are allowed. In fact, under certain sampling / distribution assumptions, the halting problem is overwhelmingly likely to be solvable for a given randomly sampled program.
^{^}
For an example of this, see this post, and my comment thread on it.
^{^}
Elaborating on this point, which is somewhat subtle: I claim that the cognitive processes in both human brains and GPT bind to actual reality in some way, unlike processes in ELIZA and floating networks of postmodernist "beliefs" that don't pay rent in anticipated experiences. The "beliefs" of GPTs do pay rent in anticipated experiences, but the way in which the cognitive processes bind to reality in GPTs and humans is likely very different.

[-]Max H3y72

I enabled the cool new reactions feature on the comments for this post! Reactions aren't (yet?) supported on posts themselves, but feel free to react to this comment with any reactions you would give to the post as a whole.

[-]Max H3y30

Update: I've now submitted a version of this post to the worldviews contest, and will likely not make further edits.

This post hasn't gotten much engagement on LW or the EA forum so far though.

Some hypotheses for why:

it's too long / dense / boring
people view it as mostly re-treading old ground
it's not 101-friendly and requires a lot of background material to grapple with
I posted it initially at the end of a holiday weekend, and it dropped off the front page before most people had a chance to see it at all.
Some people only read longform, especially about AI, if it is by a well-known author or has at least a few upvotes already. This post did not break through some initial threshold before dropping off the front page, and was thus not read by many people, even if they saw the title.
There is too much other AI content on LW; people saw it but chose not to read or not to upvote because they were busy reading other content.
Lots of people saw it, but felt it was not good enough for its length to be worth an upvote, but not wrong enough to deserve a downvote.
parts of it are wrong / unclear / misleading in some way
Lots of people filter or down-weight the "AI" tag, and this post doesn't have any other tags (e.g. world modeling) which are less likely to be down-weighted.
Something else I'm not thinking of.

I am slightly hesitant to ask for more engagement directly, but if you read or skimmed more than 30% of the post, I'd appreciate an "I saw this" react on this comment. (If you read this comment, but didn't read the original post, feel free to react to this comment instead.)

If you have thoughts about why this post didn't get much engagement, feel free to reply here.