As part of the 2020 Effective Altruism Student Summit, AI safety researcher Buck Shlegeris advised newcomers to "think through the entire problem," determining what the concrete dangers of unsafe AI actually are and how to address them. In an earnest effort to do just that, Nicholas L. Turner, Aidan Kierans, Morpheus, Quinn Dougherty, Jeffrey Yun, and others started a small AI Safety Forecasting Workshop to practice defining, estimating, and forecasting different outcomes. We've perceived a small degree of success in making questions more concrete, and so we decided to share our thoughts to help other aspirants. The following is adapted from our notes for this week's meeting, and we've added context or elaboration when necessary.

Specifically, Nick, Aidan, Morpheus, and Quinn pulled up Elicit's x-risk tag and went on a deep dive of one of the questions. Source.

Chance that "AI safety basically [doesn't need] to be solved, we’ll just solve it by default unless we’re completely completely careless"

As one might imagine, this is a very useful question to consider for a group of newbies. The engagement on Elicit with this question was modest (with two participants, the average estimate of the chance currently sits at 17.5%), but it captured our intuitions well enough to start getting our hands dirty.


We started by defining some terms, as resolving at least some of the ambiguities of this prompt allows us to better understand the relationship between the prompt, our intuitions about it, and states of the world.

  • "solving AI safety"
  • "default"
  • "careless"
  • "we"

Solving AI safety

First, what is the goal that we're trying to reach by "solving" AI safety? At which point would we consider AI safety solved or not? Moreover, how should we define AI safety?

Option 1. An AI is safe when it is transformative without being catastrophic.

When thinking about future applications of AI, people often imagine a point in time when the technology fundamentally alters society. Some have used the term Transformative AI (TAI) to describe this state of the technology. We borrow that definion here, though this opens up a bit more ambiguity. What separates TAI slightly less "transformative" applications?

One helpful approach that we've learned from our discussions is to make these definitions more precise by comparison. There are several other points in time where human society has been transformed, and we can use those points as benchmarks for AI as well. As long as a decent proportion of commentators (X%) can make an intuitive comparison between the two events, then we're reasonably close to a line between "transformative" and "not transformative".

With this technique in hand, we define TAI as AI that is roughly as impactful as the industrial revolution. We imagine a point where the relationship between humans and their work is fundamentally changed.

The crux of AI satefy, then, is reaching TAI without also creating a signficant catastrophe. This opens up one more sub-definition: how significant does this catastrophe need to be in order to count? Using the same line of reasoning as above, we define a significant catastrophe as something that's "worse than COVID-19" on one or more metrics such as its toll on lives, economies, etc., following a suggestion from Quinn to that effect.

All of these considerations result in the following refined definition:

Option 2. AI safety has been solved when it fundamentally alters our society without directly bringing about negative effects worse than COVID-19 on key metrics.


What it means to solve AI safety "by default" is a core ambiguity of this particular Elicit prompt. The ambiguity can be slightly resolved by trying to identify precisely whose behavior defines the default. Can all AI developers follow their current incentives without any interruption? Can some AI safety research be considered a part of the default?

Partially motivated by our position, we decided to take the perspective of a relative outsider with at least some awareness of the problems ahead. We considered all current AI research to be part of the default, including the current level of AI safety and investment. Instead, our own impacts on the ecosystem of AI development and safety would be the "non-default" that perturbs the system. What does this imply?

Current AI development is largely driven by capitalist incentives. Companies race to develop the best AI technology in order to offer competitive products to their customers. Nations fuel AI research (at least partially) to become as economically competitive as possible at a broader scale. To a first-order approximation then, we can consider these incentives to be dominant, and consider "default" behavior to follow them.

Option 1: A problem is solved by default if solved by "normie capitalism" without coordination problems.

Quinn defines "normie capitalism" as a black box that takes people and time as input and returns a class of products as output. This class includes whatever emerges from commercial incentives, and excludes things that require additional altruism or coordination. Andrew Critch discusses at length how AI safety is divided into two classes of subproblems along these lines here.

We also attempt to account for this second class of problems as well, as the first-order approximation felt overly restrictive.

Option 2: A problem can also solved by default if nonprofit and academic research is maintained at the same relative level of involvement in the field.

We should note that a relative level of involvement allows for a fixed growth rate in absolute terms, though not a growth in its share of GWP or of total AI research expenditure, and not requiring unusual coordination.


In the framing of the question, it seems that carelessness is something that leads to worse-than-default outcomes. We interpret that to have two possibilities:

Option 1: Nonprofit and academic research into AI safety is shrinking as a proportion of overall AI research.

Option 2: We somehow misuse the information we have:

This second option could imply that we have ideas of how to make safe AIs but they get disregarded. This might happen if safety solutions are poorly competitive, or safe AIs are unreliable. Alternatively, it could mean that we don't have ideas of how to make safe AIs, but TAI comes about anyway without enough foresight to protect against potential problems.


Who exactly is the "we" in question? Amongst our group, we discussed relevant populations of individuals, institutions, and their different constraints and opportunities. We finally came up with two possible definitions:

Option 1: Everyone who is currently thinking about this problem

Option 2: Relevant institutional actors

Notably, our definition of "default" leans much more heavily on the second definition (institutional actors) than the first, though we also wanted to allow for some recruitment with "default" dynamics as described for nonprofit and academic research.


With these definitions in mind, how can we finally assign a likelihood to this Elicit prompt? We made some progress towards this within a one-hour session, but we didn't give any numbers at all this week. Instead, we started to formalize our thoughts and present our reasoning below.

First, we want to massage the question as it appears on Elicit to the chance we solve AI safety by default given our definitions. We figure that carelessness is sort of compressed into the meaning of default, since default almost by definition is a lower bound on carelessness.

Having done this, we decided that the likelihood would take the form of a product or interaction variable:

The chance we solve AI safety by default = the chance that a solution exists * the chance that the solution is on the default path.

A crucial assumption we make is that any intervention attempting to put us on a"better than careless" path is not destroying society's ability to find a solution on the default path. We assume that no matter how large a share of AI expenditure is better than careless, there will always be more careless actors stumbling upon whatever exists on the default path. We assume that the more conscientious actors among the population do not cause a failure on the default path. Moreover, I think we assume that non-careless actors will be able to see what lies on the default path from their vantage point.


We haven't reached our goal to estimate the chance that a solution is on the default path yet, but we still hope that describing our progress might be at least interesting to others. In particular, we're interested to get feedback on (1) our technique of definition by comparison, (2) our first steps into formalizing the problem, and (3) the assumptions and definitions that we've made so far. With any luck, we'll complete this line of reasoning and give an update with a full estimate in the upcoming sessions.


New Comment

New to LessWrong?