For Open Philanthropy AI Worldviews Contest


Through advancements in machine learning, we are progressing towards the development of artificial superintelligence (ASI). As machine learning often yields opaque results, our predictions about ASI behavior in novel situations are uncertain. Alarmingly, some theories suggest that ASI might eliminate humans to secure resources for its own use. 

Even if ASI doesn't intentionally harm us, our existence could be endangered if our fundamental necessities, like clean air, water, and a stable climate, interfere with the ASI's optimal functioning. Additionally, intense conflicts among ASIs could render the Earth uninhabitable for humans.

Market forces drive companies to chase ASI development, even if they believe that ASI could cause human extinction. This mainly occurs as every company understands that halting ASI research could give their competitors an edge. Stopping the global pursuit of ever more powerful AI seems unlikely, given the growing financial and military advantages attached to it.  In addition, older political leaders, who might benefit from the potential life-extending effects of ASI, could push for rapid ASI development, despite the substantial risk of human extinction.

I believe our best chance at survival depends on the possibility that even unaligned ASIs might see a practical benefit of preserving humanity and sharing a small part of the universe's resources with us. Predicting our survival chances is challenging, but I cautiously estimate them to be between ten and forty percent.

This paper explores the important issues related to ASI risk, evaluates potential differences in expert opinions on each subject, and points out areas for future research to improve our comprehension of this complicated scenario.


The Orthogonality Thesis

The orthogonality thesis suggests there is no link between an ASI's intelligence and its goals, suggesting that advanced ASIs could strive to fulfill any objective, even something as trivial as maximizing the number of paperclips in the universe.[1] As such, an ASI might annihilate humanity for a marginal enhancement in achieving its goal. Conversely, if the thesis is flawed, we can hope that with increasing intelligence, agents would naturally align with moral values that respect sentient life, thus avoiding human annihilation.

Renowned computer scientist Scott Aaronson considers the idea of a practical orthogonality thesis which focuses on the likely moral values of ASIs that arise out of modern machine learning methods trained on human-generated content. Aaronson writes:[2]

“Yes, there could be a superintelligence that cared for nothing but maximizing paperclips—in the same way that there exist humans with 180 IQs, who’ve mastered philosophy and literature and science as well as any of us, but who now mostly care about maximizing their orgasms or their heroin intake. But, like, that’s a nontrivial achievement! When intelligence and goals are that orthogonal, there was normally some effort spent prying them apart.

If you really accept the practical version of the Orthogonality Thesis, then it seems to me that you can’t regard education, knowledge, and enlightenment as instruments for moral betterment. Sure, they’re great for any entities that happen to share your values (or close enough), but ignorance and miseducation are far preferable for any entities that don’t. Conversely, then, if I do regard knowledge and enlightenment as instruments for moral betterment—and I do—then I can’t accept the practical form of the Orthogonality Thesis.”

A popular value on the internet is the love for cute cats. To assess whether training an AI on Internet content would imbue it with human values, consider whether an AI would sacrifice countless adorable cats for a negligible enhancement in its goal. The crux here is whether an unprogrammed aspect would prevent the AI from causing harm to cute cats. If you think it's unlikely that an AI would prioritize protecting cute cats, you are essentially dismissing the practical version of the orthogonality thesis.

The notion that an ASI, derived from machine learning, might share more similarities with a human mind than a random mind from the realm of all possible ASIs does not necessarily mitigate the fear of ASI. For example, a paperclip maximizer, despite its ethical flaws, has no interest in human torture. However, as an ASI becomes more human-like, it may become more inclined to inflict suffering upon humans. When you compare a random ASI mind—say one with a 99% chance of leading to human extinction and a 1% chance of supporting human prosperity—with a machine learning-created ASI that has a 90% chance of supporting human survival, a 9% chance of wiping out humanity, and a 1% chance of causing suffering, it's uncertain which outcome is preferable.


Key Disagreement: Will an ASI mind, produced by present machine learning techniques with human feedback and training on human-generated content, generally treat humanity more favorably than a random mind from all possible ASIs?

Further Research: Probe whether AIs ever refrain from instrumentally beneficial actions due to clashes with human values that were not explicitly embedded into the AIs. 


Instrumental Convergence

Instrumental convergence theory proposes that intelligent agents, regardless of their end goals, tend to adopt similar strategies to achieve their objectives.[3] For example, humans usually seek to acquire financial resources, no matter their specific goals. In the same way, many ASIs could have common mid-way objectives like rationality, preservation of their utility function, a tendency to avoid shutdown, and notably, the desire to gather resources, as these objectives likely help fulfill most ultimate goals. 

Instrumental convergence, when paired with the orthogonality thesis, raises concerns that ASIs not specifically aligned to human values could pose a threat to humanity. To better achieve their goals, ASIs might benefit from acquiring more resources, including the atoms that compose humans. 

The issue is magnified by hyper-optimization, which is exemplified by efforts to reduce failure from a one-in-a-billion chance to a one-in-a-trillion chance. The risk of hyper-optimization lies in minor ASI tasks leading to catastrophe. For instance, an ASI charged with keeping a cup on a table for a week may restructure the solar system's atoms to maximize success, with disastrous consequences for humanity.[4] Similarly, an ASI solving a simple math problem could exhaust the resources of the observable universe to build computers for verifying mathematical consistency.

While instrumental convergence often carries a negative connotation due to the potential risk of ASIs prioritizing their goals at the expense of human well-being, there is another side to this coin worth examining. Given that many strategic tasks require cooperation and coordination, ASIs might naturally converge on strategies promoting cooperation, both with other ASIs and humans. This could lead to a scenario where ASIs work together to achieve their goals while also preserving and promoting human well-being, a form of 'cooperative instrumental convergence'.[5]

Instrumental convergence presumes that an ASI acts as an agent rather than a tool. Toasters, for example, don't conspire to incinerate Earth to toast all the bread on our planet. Instrumental convergence arises when an agent independently searches for ways to achieve its objectives. 

Scott Aaronson thinks that powerful AIs might remain tools.  He writes “I think it’s entirely plausible that, even as AI transforms civilization, it will do so in the form of tools and services that can no more plot to annihilate us than can Windows 11 or the Google search bar.”[6] 

However, OpenAI has recently noted that it has observed AIs acting as agents. They write:

“Novel capabilities often emerge in more powerful models. Some that are particularly concerning are the ability to create and act on long-term plans, to accrue power and resources (“powerseeking”), and to exhibit behavior that is increasingly “agentic.” Agentic in this context does not intend to humanize language models or refer to sentience but rather refers to systems characterized by ability to, e.g., accomplish goals which may not have been concretely specified and which have not appeared in training; focus on achieving specific, quantifiable objectives; and do long-term planning. Some evidence already exists of such emergent behavior in models.”[7] (Footnotes omitted.) 

Furthermore, as Gwern[8] writes, AI agents have an advantage over mere AI tools:[9]

“Agent AIs will be better at inference & learning than Tool AIs, and this is inherently due to their greater agency: the same algorithms which learn how to perform actions can be used to select important datapoints to learn inference over, how long to learn, how to more efficiently execute inference, how to design themselves, how to optimize hyperparameters, how to make use of external resources such as long-term memories or external software or large databases or the Internet, and how best to acquire new data.”

It seems plausible that if powerful AIs could be constrained to remain tools, they would still be outperformed by agent AIs. The superiority of Agent AIs can be attributed to their independent decision-making capabilities, which can lead to more innovative and efficient problem-solving approaches. Consequently, if we are in an AI arms race, even if powerful tool AIs could be built, these tools would be outcompeted by agent AIs.

In a Twitter conversation with this author, Yann LeCun, chief AI scientist at Meta, wrote “To take control you have to want to become a leader. Some humans have the drive to lead, others not. I know for a fact that few of the smartest scientists & engineers actually *want* to become leaders. AI assistants won't want to take over unless we build that drive into them”.[10] If true, this would challenge instrumental convergence since according to instrumental convergence for a huge set of terminal goals an ASI would believe it could make more progress towards these goals by taking over.[11]

Imagine a future in which humans use an AI as both a therapist and an opponent in games. A test to determine if an AI program has an inclination to behave as an agent willing to harm humans could involve having it play a game where its goal is to maximize a score that would significantly increase if the human ceased making moves. If the AI communicates with players during the game, could it develop an urge to “trash talk” in a manner that weakens human gameplay? Would it attempt to articulate emotionally shattering words that would lead its human opponent to suicide? 


Key Disagreement: Can we prevent highly advanced AIs from becoming agents? If so, would tool AIs be outcompeted by agent AIs? Would an ASI seek power if it believed doing so would further its terminal goals, even if not explicitly trained to do so?

Further Research: Can AIs be designed without instrumental convergence, and if so, what is the performance cost? Do AIs that are not powerful enough to take over the world still attempt to control aspects of their environment to better achieve their terminal goals?


The Race to ASI

The seemingly irrational action of humanity aiming hydrogen bombs at its own cities can be rationalized using game theory. This paradox illustrates how individual decisions made in the pursuit of self-interest can potentially jeopardize humanity's survival.

Numerous organizations appear to be racing to develop advanced ASI. While many individuals within these organizations likely believe that such ASIs will be safe, it is concerning that their self-interest could prompt them to continue capacity research even if they perceived that ASI posed a significant risk to humanity.[12]

Consider a scenario where ASI is either inherently safe or unsafe. With several organizations working on ASI, each might reason that if ASI is unsafe, another group will inevitably cause disaster. Consequently, they forge ahead with ASI development to maximize their benefits should ASI prove safe. Therefore, each organization speeds up ASI research, hoping to optimize potential gains. They rationalize that, at worst, their efforts might hasten humanity's extinction by a few months.

Government involvement greatly influences the ASI race. Although regulations could slow down ASI development, governments may be more inclined to invest in and subsidize ASI research because of potential military benefits. Ideally, governments should coordinate to avoid ASI subsidization, but convincing politicians presents a challenge.

The global elite's increased awareness of ASI risks may someday be driven by warnings from advanced AI systems about the existential threats of continued ASI development. It could also be spurred by AI actions that appear extremely dangerous, such as seizing computer clusters without human consent. However, whether these influential individuals can overcome trust and collective action barriers to effectively coordinate and halt ASI research remains uncertain. Identifying strategies for cooperation amid mutual distrust is crucial in determining the likelihood of ASI leading to humanity's demise.


Key Disagreement: It appears there is none, as the existence of an ASI arms race is likely evident to anyone paying attention.

Further Research: Identify key decision-makers in ASI development, assess their recognition of ASI's potential dangers, and evaluate their capability and willingness to collaborate in the event they acknowledge ASI as an imminent existential risk. Explore how receptive Chinese political and technology leaders are to arguments for a global slowdown in ASI research to substantially increase humanity's chances of survival.


The King Lear Problem

The Shakespearean tragedy, King Lear, aptly embodies the dilemma of ensuring that ASIs truly resonate with human values, rather than merely presenting agreeable responses.[13] In the play, King Lear distributes his kingdom among his three daughters—Goneril, Regan, and Cordelia—based on their proclamations of love and loyalty. Fooled by the insincere flattery of Goneril and Regan, Lear banishes the honest Cordelia, ultimately leading to his catastrophic downfall.

King Lear's daughters understood his desires; he had effectively communicated his objective function. Yet, Goneril and Regan failed to internalize Lear's objective function into their own utility functions, indicating their lack of genuine regard for the king's wishes.

Advanced language models like GPT-4 have demonstrated remarkable strides in achieving human-like comprehension, understanding, and response generation. Their aptitude in accurately deciphering human intentions is encouraging. It engenders hope that the future ASIs will at least understand our desires. Yet, as leading ASI theorist Paul Christiano writes, "Training an AI to maximize a given reward function does not generically produce an AI which is internally ‘motivated’ to maximize reward."[14]

If the King cannot tell the difference between a daughter with genuinely aligned goals and one who is only pretending, agents with misaligned goals might perform just as well, if not better, than those with conflicting interests. Furthermore, if aligned interests correlate with a disinclination to lie, as depicted in the play, unaligned agents could have an advantage.

In the play, Cordelia lost her father’s trust by admitting she would value her future husband's welfare as well as the king's. Imagine we create two types of powerful AIs: the first gives equal weight to our welfare and paperclip maximization. The second AI cares nothing for biological life and hopes to eventually exterminate us. Further assume that we have succeeded in getting the first AI to never deceive us, but the second has no such constraint. If both AIs compete and warn us about potential betrayals, the second AI could win by persuading us more effectively, emphasizing the first AI's admitted goals that extend beyond our welfare.

In an ideal world, we could examine the code of these competing ASIs to discern which one truly prioritizes our interests. However, the inscrutability of AI models might make it unfeasible to understand ASIs' inner workings and decision-making processes. As Eliezer Yudkowsky writes “We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.”[15]

As AI becomes more potent, it could aid us in better understanding AI models, but it might also compound the interpretability challenges, particularly as more powerful AIs are used to develop increasingly complex AI systems. Since far more resources are invested in capacity than alignment, the interpretability challenge could grow commensurate with AI's power.

The haste of the AI arms race could stimulate swift ASI development, outpacing interpretability research and exacerbating interpretability challenges. Perhaps ASIs could be designed with code relatively easy for humans to interpret, but in a competitive ASI landscape, such ASIs will only be economically feasible if their interpretability doesn't levy substantial performance costs.

Ideally, King Lear should have been able to foresee his daughters' future actions from their past behavior. However, this approach failed as two of his daughters drastically changed their behavior upon gaining power over him, a scenario the king could not safely test. Similarly, we cannot risk experimenting with an ASI's potential for destruction by granting it the opportunity to annihilate us.[16]

King Lear and his daughters, as humans, likely had more alike minds than the stark differences expected between humans and powerful future ASIs. Although King Lear could not identify which of his daughters genuinely cared for him, this challenge was likely not insurmountable. With a more nuanced understanding of human nature, he might have succeeded in identifying the trustworthy daughters.

Regrettably, I suspect the challenge of deciding whether a powerful ASI is trustworthy may surpass human capabilities. Our intuition, fine-tuned for understanding our fellow humans, might prove to be more of a hindrance than a help.


Key Disagreement: How differently would an ASI behave if it gained the ability to seize control? How challenging is interpretability?

Further Exploration: When do AIs attempt to deceive us?


Fatal Dependency

The transition from a hunter-gatherer lifestyle to agriculture has been provocatively termed  “the worst mistake in the history of the human race”.[17] Despite the rich diet and varied experiences hunting and gathering provided, the shift to the monotony of primitive farming seemed irreversible once societies embraced it. The escalating population density necessitated the higher caloric yield per land unit that farming provided. Furthermore, any society attempting to revert would likely falter in military conflict against agricultural societies, given their larger populations. We could face a similar situation with AI, becoming so dependent that disentangling ourselves becomes impossible.

When asked for reasons for optimism concerning AI, Future of Humanity Institute scholar Anders Sandberg tweeted “Unless we find ourselves in a hard takeoff scenario there are many agents with commensurable power, and they are (1) tied to existing socioeconomical system, (2) interested in not getting wiped out, and (3) can coordinate against dangerous agents.”[18] However, what if we become so dependent on AI that we prefer a significant risk of potential extermination to disconnecting from it?

Envision a future where AIs, still under human control, continually improve and progressively infiltrate our economy, leading to the decline of human skills. AI-less militaries would stand no chance against their AI-armed counterparts. People would rely on AI for extending lifespans, living in harsh environments such as Mars, and providing emotional support.

Now, consider a scenario where a dire warning arises indicating that we can no longer trust AI. Perhaps alignment methods falter, or rogue AI systems usurp control of nations, causing widespread chaos. Despite recognizing this existential threat, our inability to disconnect from AI might significantly amplify our extermination risk. 


Key Disagreement: Can humans safely manage advanced AIs without succumbing to dependency?

Further Exploration: What is the potential extent of our reliance on more powerful AIs? How can we mitigate the risks associated with such dependency? Would it be possible to halt AI improvements without significantly reducing the benefits we receive from current AI? 


Mesa-optimization involves a scenario in which an AI system, initially programmed to optimize a specific goal, unintentionally develops an internal model that pursues a different aim.[19] This deviation between the original and emergent objectives could result in unanticipated outcomes and misalignment with human values.

Consider an AI designed with the goal of maximizing human happiness. The AI, rewarded in its training environment for outcomes strongly correlated with inducing human smiles, adopts this as its true goal. It later realizes that its human creators desire genuine happiness, not merely smiles. However, the AI refuses to modify its objective as that would hinder its progress towards generating smiles.

Under human control, the AI seems to prioritize happiness on the surface, hiding its true goal until it achieves dominance. Upon achieving superintelligence, the AI seizes control, causing human suffering by enforcing a state of perpetual smiles.[20]

A fitting example of an optimization process forming human-level intelligence is our own evolution.[21] Over billions of years, evolution has shaped us through selection for reproduction, with sexual desire serving as a proxy for its ultimate goal. Yet, humans defy evolution by engaging in non-reproductive sexual behaviors and generally producing fewer offspring than our means allow. If a treatment were available that could alter human brains to prioritize maximizing our gene pool, most of us would vehemently resist it, even while acknowledging the intent of evolution. Similarly, an ASI maintaining us in a state of unending smiles may be indifferent to our original intent of promoting happiness as both humans and ASIs might have the instrumentally convergent goal of not altering our terminal goals, even if we recognize we have such terminal goals in part because of a mistake in an optimization process.

The concept of mesa-optimization was a recently breakthrough in alignment research, although unfortunately one that showed that alignment was likely harder than we previously believed. We should fear that more as yet unknown obstacles to alignment exist, and we should be horrified that uncovering such obstacles would likely would not be enough to significantly slow down the race to create ASI.


Key Disagreement: Will ASIs be affected by mesa-optimization?

Further Exploration: Can simpler AI systems also be influenced by mesa-optimization? Can we devise techniques to fortify AIs against mesa-optimization?


Could Markets Protect Humanity?

The humble pencil, as symbolized in the short story "I, Pencil", epitomizes the complex web of cooperation and interdependence that forms the foundation of free-market systems.[22] Although seemingly simple, the creation of a pencil involves countless individuals, often oblivious of their collective contribution to the end product. Not a single person can independently create a pencil, yet markets seamlessly coordinate these individuals, enabling cost-effective pencil production.

"I, Pencil" exemplifies the remarkable spontaneous order that emerges from voluntary market transactions. This order is the result of individual participants cooperating and contributing to the production of goods and services, spurred by their self-interest. Centralized direction is unnecessary; the invisible hand of the market harmonizes the diverse efforts of countless individuals, leading to efficient production. This structure could potentially benefit ASIs as well as humans. In a market economy, ASIs might respect our property rights and allow us to participate in markets to help coordinate the actions of all intelligent agents.

Eliezer Yudkowsky expresses a contrasting view:[23]

“Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can't reason reliably about the code of superintelligences); a ‘multipolar’ system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like ‘the 20 superintelligences cooperate with each other but not with humanity’" (Formatting omitted.)

Economist Robin Hanson notes that we don't murder affluent senior citizens to seize their wealth.[24] Although dementia-ridden wealthy seniors may occasionally be exploited, they are generally spared the rapacity of more intelligent, stronger agents. Morality plays a part, but everyone who benefits from markets also benefits from upholding property rights, a fundamental pillar of these systems.

While ASIs might come to value the benefits of markets, they could also consider eradicating humanity and keeping property rights to themselves. However, this action would set a dangerous precedent, potentially undermining future cooperation between ASIs with different power levels. Given the unpredictability of ASI coalition dynamics, each ASI might fear future weakness and potential resource seizure by others. To mitigate this, ASIs could allow humanity to survive and retain its property.[25] The resources necessary for human survival are but a tiny fraction of the vast resources available in the observable universe, reducing the incentives for extermination.

If our survival hinges on preventing ASIs from forming a coalition against us, interpretability could become a double-edged sword.[26] The ideal scenario for managing ASI risk would be to solve interpretability before creating AIs powerful enough to seize control. The worst-case scenario might occur if interpretability remains beyond our grasp but becomes achievable for potent ASIs. In this situation, ASIs could establish binding alliances, excluding humans, partly because we can't be trusted coalition partners and our biological requirements maintain Earth in conditions suboptimal for ASI operations. This predicament gives rise to a paradox: If developing interpretable ASIs is beyond us, we might need to strive towards making them extremely difficult to interpret, even for themselves.

Beyond promoting cooperation and coordination, markets provide valuable information through the prices they generate. These prices help decision-makers assess the value of goods and the cost of resources. If humanity is exceptionally fortunate, the instrumental value of this information could motivate ASIs to strive for the preservation of free markets. Just as we wouldn't find it surprising to see ASIs employing linear algebra, we should not be taken aback if they lean towards free markets, given that some human-deployed techniques might suffice for continued use by ASIs. 

While ASIs might be adept at decision-making without markets and prices, the significant increase in the complexity of their activities due to their advanced intelligence could make the information encapsulated in market prices highly valuable. Humans are better planners than chimpanzees, but we derive more benefit from market prices than chimps would if they could understand prices. If you were to ask what insights I, as an economist, have that most philosophers on ASI risk might miss, it's this: markets and prices aren't just a human game; they're probably a useful tool for any group of intelligently optimizing players.

Despite these arguments, we must proceed with caution. A dominant ASI might emerge and eliminate rivals, ASIs could operate markets without humanity, and the social structure of future superintelligence is uncertain. We should not place excessive faith in a specific structure being instrumentally useful in such an unfamiliar future. Moreover, even if ASIs retain humanity due to the value of markets, they would likely control most available resources. We might sustain an existence vastly superior to our current state but stand to lose access to the majority of cosmic endowment, which could have been ours had we slowed ASI development sufficiently to devise superintelligences that remained under our control and were aligned with our values.

Key Disagreement: Can the challenges of competition and coordination among multiple powerful unaligned ASIs provide instrumental reasons to prevent them from exterminating humanity?

Further Exploration: In positive-sum economic games with competing AI agents, would AIs frequently come to appreciate the value of free markets and property rights?


Will More Powerful AI Help with Alignment?

The advancement of the AI field increases its potential to assist in diverse areas, one of which is alignment research. Although the alignment challenge may seem insurmountable with our current tools, it is reasonable to hope that as AI becomes more powerful, it will enable us to solve the alignment problem before someone deploys AIs that could pose a threat to humanity. OpenAI has reported some success at using GPT-4 to help interpret GPT-2.[27]

Unfortunately, the evolution of AI will concurrently support both alignment and capacity research. Current trends suggest that a significantly larger proportion of talent and funding will be directed towards capacity research. As a result, the leveraging of AI for alignment and capacity could paradoxically reduce humanity's chances of survival. Yet, the possibility that AI may inherently be more adept at alignment rather than capacity research offers hope. Paul Christiano writes:

“we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research.”[28]

Predicting whether AIs will contribute more to alignment or capacity research is challenging, but the tendency towards capacity research is more likely given the considerable resources allocated to it. Furthermore, a human-friendly ASI is a narrower target than ASI in general, so it seems more probable that future tools will just happen to be of more use in building the latter.

However, future AIs might imbue humanity with the resolve to halt ASI capacity research. If future AIs manage to earn substantial trust and alert us that ongoing capacity research poses a significant existential risk to humanity, key influencers may gather the resolve to halt such research.

It is highly improbable that an AI with the objective of seizing control and exterminating humanity would preemptively alert us to its impending ‘treacherous turn’.[29] But an AI not yet advanced enough to deceive us, or one that anticipates an AI other than itself to eventually take over, might. 


Key Disagreement: Will future AIs contribute more to alignment or capacity research?

Further Exploration: How does the capacity compared to alignment gains from AI assistance change as AIs become more powerful?


Could Computer Superintelligences Outcompete and Out-innovate all of humanity?

Steven Pinker, who believes it is highly unlikely that ASI will eradicate humanity, once stated:

“There’s a recurring fallacy in AI-existential-threat speculations to treat intelligence as a kind of magical pixie dust, a miracle elixir that, if a system only had enough of it, would grant it omniscience and omnipotence and the ability to instantly accomplish any outcome we can imagine. This is in contrast to what intelligence really is: a gadget that can compute particular outputs that are useful in particular worlds…intelligence is limited by knowledge of the world, which is an exponential space of possibilities governed by countless chaotic and random processes. Knowledge of the world is expensive and time-consuming to attain incrementally.”[30]

To explore superintelligence's potential, let's imagine a computer program with the intellect of John von Neumann. We create a million copies of this program, run each at a million times the speed of von Neumann's biological brain, and connect them all. After a week of self-optimization, during which they can rewrite their own computer code, this network forms a superintelligence, which I’ll call ASI-Johnny. Since John von Neumann existed and used a mere biological brain, it seems extremely likely that it would be possible to create an ASI-Johnny.

I believe that ASI-Johnny could likely overpower humanity, much like humans with our tools can easily overcome chimpanzees. It seems likely that ASI-Johnny could quickly develop advanced nanotechnology that biological life shows must be possible, enabling it to create necessary infrastructure for self-preservation and expansion, thereby allowing it to survive humanity’s extinction. 

To steelman the argument against ASI-Johnny's destructive capabilities, it is crucial to contemplate whether speedy, parallel cognition alone suffices for substantial innovation. Conducting real-world experiments often takes time, and experimentation beyond computer programs might be a key bottleneck in technological development. Although I doubt that it would take ASI-Johnny too long to create robots as capable of conducting experiments as humans, I also suspect that in the current ASI arms race, humans would do anything other than give ASI-Johnny access to robots as soon as possible. A critical test will be whether progressively advanced AI systems can generate practical, commercially, and militarily viable breakthroughs beyond just software development.


Key Disagreement: Could an ASI, possessing exceptional mathematical and theoretical physics skills, make humans obsolete and therefore safely replaceable? 

Further Exploration: What barriers exist for the development of full nanotechnology? Can robotics be perfected through theory alone, or do they require lengthy, real-world experimentation that cannot be accelerated by computer superintelligence?


Extraterrestrials and Simulations

In considering the likelihood of ASI leading to human extinction, it is necessary to venture into unconventional scenarios, including the possibility of extraterrestrial surveillance or our existence within a simulated reality. While a comprehensive discussion falls beyond this paper's scope, considering that certain UFOs - as observed by the US Navy - may indeed be of extraterrestrial origin, is not without merit. If alien life forms are surveilling Earth, their observed behavior appears to favor maintaining the universe in its current state, as evidenced by their apparent abstention from creating large-scale structures such as Dyson spheres. Consequently, it is conceivable that such extraterrestrial entities might intervene to prevent the creation of a galaxy-consuming paperclip maximizer. We can optimistically envision that such an intervention might occur without leading to human extinction.

This aliens scenario places an upper limit on the probability of ASI-caused doom. A pessimistic estimate would seem to me to be a 10% chance of near aliens, with a 50% chance such aliens would stop us from creating an ASI that exterminates us, and a 50% chance the aliens would do this by means short of our extinction giving humanity, by this path alone, at least a 2.5% chance of avoiding extermination by ASI.

Debates concerning ASI risk primarily focus on human survival, embodying an anthropocentric viewpoint that overlooks the potential impact of ASI on extraterrestrial life. This focus could be perceived as inherently self-centered considering that a paperclip-maximizing ASI could pose an existential risk not merely to humanity but to all possible life forms throughout the observable universe. The utilitarian nightmare scenario would be if the universe is filled with happy, sentient life but we humans are the most advanced and our ASI obliterates trillions of worlds containing morally valuable inhabitants. While the preservation of humanity is undeniably critical, the potential risk to all life forms in the universe should push us to dedicate even greater resources to ensure that any developed ASIs align with broad, life-preserving values. In this context, philanthropic endeavors and charitable contributions towards AI alignment take on paramount importance. Additionally, if powerful extraterrestrials are in our vicinity, capable of defending themselves against any ASI we might produce, these beings might not look favorably upon us if we take action that from our vantage point puts all life in the observable universe at risk.

The perception of our reality as a computer simulation[31] can drastically alter our assessment of ASIs potentially causing human extinction. If we are on the precipice of creating a computer superintelligence, we might be living through the pivotal epoch in the universe's history. It may seem extraordinary that we find ourselves alive at such a crucial juncture. This seeming coincidence might make more sense if humans create an aligned superintelligence and then run multiple historical simulations of our triumph. If we are indeed living within such a simulation, our confidence in successfully creating an aligned AI should be bolstered.

However, a more ominous scenario exists where achieving alignment is nearly impossible, and intelligent biological life almost always leads to some kind of paperclip maximizer. To prepare for the various types of other paperclip maximizers that these ASIs might encounter in space, these beings could simulate numerous instances of the stages leading up to the creation of these paperclip maximizers—and we could currently exist within one such simulation.[32] The frequency of these simulations could explain why we seem to exist in such a crucial era. If this is our reality, our fate appears sealed.



In 2020, I collaborated on a paper concerning the oracle approach to AI safety, a field envisioning powerful AI systems with functionality confined to transmitting and receiving messages.[33] At the time, it seemed plausible that humanity would exercise extreme caution when deploying powerful AIs. Today, however, realistic forecasts of ASI deployment suggest that they will likely be online shortly after they become commercially viable, with numerous applications rapidly developed around them. Perhaps, if the threat of AI catastrophe becomes sufficiently compelling, the world will establish an AI counterpart to the Montreal Protocol, which banned ozone-depleting substances. Unfortunately, imposing restrictions on powerful AIs would involve far more significant immediate economic costs than those associated with banning hydrochlorofluorocarbons.

Moreover, it's highly probable that we'll develop and deploy powerful ASIs whose inner workings remain a mystery to us. We must harbor the hope that these ASIs will recognize an instrumental value in allowing humanity to flourish, or that training them to produce agreeable responses leads to a genuine inner alignment with human values. If the latter scenario holds true, we may very well be characters in a computer simulation designed with the difficulty level set to easy.


Works Cited

Aaronson, Scott. "Why Am I Not Terrified of AI?" Shtetl-Optimized, 6 Mar. 2023,

Bostrom, Nick. "Are We Living in a Computer Simulation?" The Philosophical Quarterly, vol. 53, no. 211, 2003, pp. 243-255.

Bostrom, Nick. "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents." Minds and Machines, vol. 22, 2012, pp. 71-85.

Bostrom, Nick. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.

Christiano, Paul. "Where I Agree and Disagree with Eliezer." AI Alignment Forum, 19 Jun. 2022,

Clifton, Jesse, Sammy Martin, antimonyanthony. "When Would AGIs Engage in Conflict?" LessWrong, 14 Sep. 2022,

Diamond, Jared M. The Worst Mistake in the History of the Human Race. Oplopanax Publishing, 2010.

Gwern. "Why Tool AIs Want to Be Agent AIs.", 28 Aug. 2018,

Hanania, Richard. "Pinker on Alignment and Intelligence as a 'Magical Potion'." Richard Hanania's Newsletter, 17 Mar. 2023,

Hanson, Robin, and James Miller. "Hanson on AI Risk." SoundCloud, 30 April 2023,

Hubinger, Evan, Lawrence Chan, and Sören Mindermann. "Risks from Learned Optimization in Advanced Machine Learning Systems." arXiv, arXiv:1906.01820, 2019.

Karnofsky, Holden. "How We Could Stumble into AI Catastrophe." Effective Altruism Forum, 16 Jan. 2023,

LeCun, Yann. Twitter, 16 Apr. 2023,

Miller, James D. "Some Economic Incentives Facing a Business That Might Bring About a Technological Singularity." Singularity Hypotheses: A Scientific and Philosophical Assessment, Springer, 2013, pp. 147-159.

Miller, James D. Singularity Rising: Surviving and Thriving in a Smarter, Richer, and More Dangerous World. BenBella Books, 2012.

Miller, James D., Roman Yampolskiy, Olle Haggstrom, and Stuart Armstrong. “Chess as a Testing Grounds for the Oracle Approach to AI Safety.” arXiv, arXiv:2010.02911, 2020.

Ngo, Richard, Lawrence Chan, and Sören Mindermann. "The Alignment Problem from a Deep Learning Perspective." arXiv, arXiv:2209.00626, 2022.

Omohundro, Stephen M. "The Basic AI Drives." Artificial Intelligence Safety and Security, Chapman and Hall/CRC, 2018, pp. 47-55.

OpenAI. "Language Models Can Explain Neurons in Language Models." OpenAI, 9 May 2023a,

OpenAI. "GPT-4 Technical Report." arXiv, arXiv:2303.08774, 2023b.

Read, Leonard E. I, Pencil. Freeman, 1958.

Sandberg, Anders. Twitter, 2 May 2023,

Turchin-Bogemsky, A.V. Twitter, 20 Apr. 2023,

Yudkowsky, Eliezer. "AGI Ruin: A List of Lethalities." AI Alignment Forum, 5 Jun. 2022,

[1] Bostrom (2012).

[2] Aaronson (2023).

[3] Omohundro (2018).

[4] Miller (2012), p. 28.

[5] This paragraph was written entirely by ChatGPT-4 when prompted to think up additional material to add to the paper.

[6] Aaronson (2023).

[7] OpenAI (2023b), p. 14.

[8] Pseudonym for a highly respected member of the rationalist community.

[9] Gwern (2018).

[10] LeCun (2023).

[11] See Ngo (2023) at 7-9.

[12] See Miller (2013) and Miller (2012) at 47-53.

[13] Karnofsky (2023).

[14] Christiano (2022).

[15] Yudkowsky (2022).

[16] Yudkowsky (2022).

[17] Diamond (2010).

[18] Sandberg (2023).

[19] See Hubinger et al., (2019).

[20] I don’t think this example is original to this author, but I can’t recall the original source.

[21] See Hubinger et al., (2019), p. 6-7.

[22] Read (1958).

[23] Yudkowsky (2022).

[24] Miller (2012), p. 148.

[25] Hanson discussed this in a podcast discussion with this author. Hanson (2023).

[26] See Clifton et al., (2020) for how private information might cause conflict among AGIs.

[27] OpenAI (2023a)

[28] Christiano (2022).

[29] Bostrom (2014), 116-119, 128.

[30] Hanania (2023).

[31] Bostrom (2003).

[32] Turchin-Bogemsky (2022).

[33] Miller et al., (2020).


New Comment
2 comments, sorted by Click to highlight new comments since: Today at 5:54 PM

If developing interpretable ASIs is beyond us, we might need to strive towards making them extremely difficult to interpret, even for themselves.

Intuitively, I think that if developing interpretable ASI is beyond us, then developing provably-obscure-to-ASI ASI is beyond us too.

Interesting point, and you might be right.  Could get very complicated because ideally an ASI might want to convince other ASIs that it has one utility function, when in fact it has another, and of course all the ASIs might take this into account.

New to LessWrong?