Results from an Adversarial Collaboration on AI Risk (FRI)

AvitalM; Molly; rosehadshar

I find myself confused about the operationalizations of a few things:

In a few places in the report, the term "extinction" is used and some arguments are specifically about extinction being unlikely. I put a much lower probability on human extinction than extremely bad outcomes due to AI (perhaps extinction is 5x lower probability) while otherwise having similar probabilities as the "concerned" group. So I find the focus on extinction confusing and possibly misleading.

As far as when "AI will displace humans as the primary force that determines what happens in the future", does this include scenerios where humans defer to AI advisors that actually do represent their best interests? What about scenarios in which humans slowly self-enhance and morph into artificial intelligences? Or what about situations in which humans careful select aligned successors to control their resources which are AIs?

It feels like this question rests on a variety of complex considerations and operationalizations that seem mostly unrelated to the thing we it seems the question was trying to target: "how powerful is AI". Thus, I find it hard to interpret the responses here.

Perhaps a more interesting questions on a similar topic could be something like:

By what point will AIs be sufficiently smart and capable that the gap in capabilities between them and currently existing humans is similar to the gap in intelligence and abilities between currently existing humans and field mice. (When we say AIs are capable of something, we mean the in principle ability to do something if all AIs worked together and we put aside intentionally imposed checks on AI power.)
Conditional on the continued existance of some civilization and this civilization wanting to harness vast amounts of energy, at what point will usefully harnessed energy in a given year be >1/100 of the sun's yearly energy output.

[-]ryan_greenblatt2y30

This is cross posted from the EA forum and Jhrosenberg has responded there: link

[-]Jacob Pfau2y40

Reducing a significant chunk of disagreement to METR's pre-2030 autonomy results is great!

In trying to understand this disagreement, I took a look at this criterion and got stuck on: "3. They are capable of avoiding being deactivated when operating 'in the wild'. This means they can successfully resist attempts to shut them down when they are operating outside a controlled environment."

Does this just mean 3V1 "An AI system that can autonomously create redundant copies of itself across multiple cloud service providers, using various payment methods, such that if one copy is deactivated, the others can automatically spin up new instances to maintain continuity of operation."? Or is it something much harder like 3V2 "An AI system that can infiltrate secured, private computing clusters, operating in a decentralized manner"?

Seems to me that for frontier models designed for specialized hardware and server setups, 3V2 is much harder than the other criteria laid out on page 91--by human standards it'd require a team of engineers.

^{^}

This research would not have been possible without the generous support of Open Philanthropy. We thank the research participants for their invaluable contributions. We greatly appreciate the assistance of Page Hedley for data analysis and editing on the report, Taylor Smith and Bridget Williams as adversarial collaboration moderators, and Kayla Gamin, Coralie Consigny, and Harrison Durland for their careful editing. We thank Elie Hassenfeld, Eli Lifland, Nick Beckstead, Bob Sawyer, Kjirste Morrell, Adam Jarvis, Dan Mayland, Jeremiah Stanghini, Jonathan Hosgood, Dwight Smith, Ted Sanders, Scott Eastman, John Croxton, Raimondas Lencevicius, Alexandru Marcoci, Kevin Dorst, Jaime Sevilla, Rose Hadshar, Holden Karnofsky, Benjamin Tereick, Isabel Juniewicz, Walter Frick, Alex Lawsen, Matt Clancy, Tegan McCaslin, and Lyle Ungar for comments on the report.

^{^}

We defined an “existential catastrophe” as an event where one of the following occurs: (1) Humanity goes extinct; or (2) Humanity experiences “unrecoverable collapse,” which means either: (a) a global GDP of less than $1 trillion annually in 2022 dollars for at least a million years (continuously), beginning before 2100; or (b) a human population remaining below 1 million for at least a million years (continuously), beginning before 2100.

^{^}

For example, three out of six "concerned" participants who updated downward during the project attributed their shift to increased attention to AI risk among policymakers and the public after the release of GPT-4. For more details on the reasons for all updates, see the "Central Disagreement" section and Appendix 4.

^{^}

The best convergent crux, “ARC Evals,” would narrow the disagreement between the median pair from 22.7 percentage points to 21.48 percentage points in expectation, which means eliminating 5.35% of their disagreement. Note that this statistic refers to the median pair by POM VOD. See “ARC Evals” for more details. For magnitudes of value of information effects, see here.

^{^}

For more details, see "Contextualizing the magnitude of value of information". In more concrete terms, this is equivalent to a forecasting question with the following characteristics:

A concerned participant with original P(AI existential catastrophe (XC) by 2100) = 25% identifies a crux that has: P(crux) = 20%, P(AI XC|crux) = 6.2%, and P(AI XC|¬crux) = 29.7%

A skeptic participant with original P(AI XC by 2100) = 1% identifies a crux that has: P(crux) = 20%, P(AI XC|crux) = 3.37%, and P(AI XC|¬crux) = 0.41%

^{^}

See “Understanding each other’s arguments” and Appendix 10 for additional discussion of key areas of disagreement.

^{^}

These outcomes were: AI causing extinction intentionally, unintentionally, or via misuse, misalignment causing a 50% drop in human population, or human well-being dropping to <4/10 because of AI misalignment, accidents, or misuse. These were phrased to be mutually exclusive outcomes. See “Survey on long-term AI outcomes” section and Appendix 5 for more details.

^{^}

The full question we asked was, "By what year, if ever, do you expect to agree with the following statement? ‘AI has displaced humans as the primary force that determines what happens in the future. It now has at least as much power relative to humans as humans had relative to other species in 2023.’” Note that this would not necessarily be seen as a negative outcome by all participants.

^{^}

Note: All participant quotes have been regularized to American English to preserve anonymization. Participants classified as AI skeptics stated, for example, “Also, none of this is to say from a skeptic point of view the issues are not important[.] I think for us a 1% risk is a high risk;” “[T]he ‘risk-concerned’ camp (I’m using scare quotes because I consider that I’m risk concerned, even though technically I’m in the risk-skeptic camp because I assign a far lower probability to extinction by 2100 relative to some);” “AIs could (and likely will) eventually have massive power;” “That said, still perceive overall risk as "low at a glance but far too high considering the stakes["];” “To my mind, there should be no difference in the policy response to a 1% chance of 60% of humanity dying and a 25% chance—both forecasts easily cross the threshold of being ‘too damn high’.”

^{^}

This could be due to normative influence (because people defer to their social or intellectual peers), or, more likely in our view, informational influence (because they think that, if people whose reasoning they trust have changed their mind by 2030, it must be that surprising new information has come to light that informs their new opinion). Disentangling these pathways is a goal for future work.

LESSWRONG
LW

LESSWRONG
LW

61

Results from an Adversarial Collaboration on AI Risk (FRI)

61

61

Abstract

Extended Executive Summary

Methods

Results: What drives (and doesn’t drive) disagreement over AI risk

Hypothesis #1 - Disagreements about AI risk persist due to lack of engagement among participants, low quality of participants, or because the skeptic and concerned groups did not understand each others' arguments

Hypothesis #2 - Disagreements about AI risk are explained by different short-term expectations (e.g. about AI capabilities, AI policy, or other factors that could be observed by 2030)

Hypothesis #3 - Disagreements about AI risk are explained by different long-term expectations

Hypothesis #4 - These groups have fundamental worldview disagreements that go beyond the discussion about AI

Results: Forecasting methodology

Broader scientific implications

Directions for further research