AI Forecasting Dictionary (Forecasting infrastructure, part 1)

Bird Concept; Ben Goldhaber

This post introduces the AI Forecasting Dictionary, an open-source set of standards and conventions for precisely interpreting AI and auxiliary terms. It is the first part in a series of blog posts which motivate and introduce pieces of infrastructure intended to improve our ability to forecast novel and uncertain domains like AI.

The Dictionary is currently in beta, and we're launching early to get feedback from the community and quickly figure out how useful it is.

Background and motivation

1) Operationalisation is an unsolved problem in forecasting

A key challenge in (AI) forecasting is to write good questions. This is tricky because we want questions which both capture important uncertainties, and are sufficiently concrete that we can resolve them and award points to forecasters in hindsight. For example:

Will there be a slow take-off?

is a question that’s important yet too vague.

Will there be a 4-year doubling of world output before the first 1-year doubling of world output?

is both important and concrete, yet sufficiently far-out that it’s unclear if standard forecasting practices will be helpful in resolving it.

Will there be a Starcraft II agent by the end of 2020 which is at least as powerful as AlphaStar, yet uses <$10.000 of publicly available compute?

is more amenable to standard forecasting practices, but at the cost of being only tangentially related to the high-level uncertainty we initially cared about. And so on.

Currently, forecasting projects reinvent this wheel of operationalisation all the time. There’s usually idiosyncratic and time-consuming processes of writing and testing questions (this might take many hours for a single question) [1], and best practices tend to evolve organically but without being systematically recorded and built upon [2].

2) The future is big, and forecasting it might require answering a lot of questions

This is an empirical claim which we’ve become more confident by working in this space over the last year.

One way of seeing this is by attempting to break down an important high-level question into pieces. Suppose we want to get a handle on AI progress by investigating key inputs. We might branch those into progress on hardware, software, and data (including simulations). We might then branch hardware into economics and algorithmic parallelizability. To understand the economics, we must branch it into the supply and demand side, and we must then branch each of those to understand how they interface with regulation and innovation. This involves thousands of actors across academia, industry and government, and hundreds of different metrics for tracking progress of various kinds. And we’ve only done a brief depth-first search on one of the branches of the hard-software-data tree, which in turn is just one way of approaching the AI forecasting problem.

Another way of guesstimating this: the AI Impacts archives contains roughly 140 articles. Suppose this is 10% of the number of articles they’d need to accomplish their mission. If they each contains 1-30 uncertain claims that we’d ideally like to gather estimates on, that’s 1400 to 42000 uncertainties -- each of which would admit many different ways of being sufficiently operationalised. For reference, over the 4 years of the Good Judgement Project, roughly 500 questions were answered.

We’d of course be able to prune this space by focusing on the most important questions. Nonetheless, there seems to be a plausible case that scaling our ability to answer many questions is important if we want our forecasting efforts to succeed.

We see some evidence of this from the SciCast project, a prediction tournament on science and technology that ran from 2013-2015. The tournament organizers note the importance of scaling question generation through templates and the creation of a style guide. (See the 2015 Annual report, p. 86.)

3) So in order to forecast AI we must achieve economies-of-scale – making it cheap to write and answer the marginal question by efficiently reusing work across them.

AI Forecasting Dictionary

As a piece of the puzzle to solve the above problems, we made the AI Forecasting Dictionary. It is an open-source set of standards and conventions for precisely interpreting AI and auxiliary terms.

Here’s an example entry:

Automatable

See also: Job

A job is automatable at a time t if a machine can outperform the median-skilled employee, with 6 months of training or less, at 10,000x the cost of the median employee or less. Unless otherwise specified, the date of automation will taken to be the first time this threshold is crossed.

Examples:

*As of 2019, Elevator Operator

Non-examples:

*As of 2019, Ambulance Driver

*As of 2019, Epidemiologist

(This definition is based on Luke Muelhauser’s here.)

There are several mechanisms whereby building a dictionary helps solve the problems outlined above.

Less overhead for writing and forecasting questions

The dictionary reduces overhead in two ways: writers don’t have to reinvent the wheel whenever they operationalise a new thought, and forecasters can reduce the drag of constantly interpreting and understanding new resolutions. This makes it cheaper to both generate and answer the marginal question.

A platform for spreading high initial costs over many future use cases

There are a number of common pitfalls that can make a seemingly valid question ambiguous or misleading. For example, positively resolving the question:

Will an AI lab have been nationalized by 2024?

by the US government nationalising GM as a response to a financial crisis, yet GM nonetheless having a self-driving car research division.

Or forecasting:

When will there be a superhuman Angry Birds agent using no hardcoded knowledge?

and realizing that there seems to be little active interest in the yearly benchmark competition (with performance even declining over years). This means that the probability entirely depends on whether anyone with enough money and competence decides to work on it, as opposed to what key components make Angry Birds difficult (e.g. physics-based simulation and planning) and how fast progress is in those domains.

Carefully avoiding such pitfalls comes with a high initial cost when writing the question. We can make that cost worth it by ensuring it is amortized across many future questions, and broadly used and built upon. A Dictionary is a piece of infrastructure that provides a standardised way of doing this. If someone spends a lot of time figuring out how to deal with a tricky edge case or a “spurious resolution”, there is now a Schelling point where they can store that work, and expect future users to read it (as well as where future users can expect them to have stored it).

Version management

When resolving and scoring quantitative forecasting questions, it’s important to know exactly what question the forecaster was answering. This need for precision often conflicts with the need to improve the resolution conditions from questions as we learn and stress-test them over time. For the Dictionary, we can use best practices for software version management to help solve this problem. As of this writing, the Dictionary is still in beta, with the latest release being v0.3.0.

Open-source serendipity

The Dictionary might be useful not just for forecasting, but also for other contexts where precisely defined AI terms are important. We open-sourced it in order to allow people to experiment with such use cases. If you do so in a substantial way, please let us know.

How to use the dictionary

If you use the Dictionary for forecasting purposes, please reference it to help establish it as a standard of interpretation.

One way of doing this is by appending the tag [ai-dict-vX.Y.Z] at the end of the relevant string. For example:

I predict that image classification will be made robust against unrestricted adversarial examples by 2023. [ai-dict-v2]

Will there be a superhuman Starcraft agent trained using less than $10.000 of publicly available compute by 2025? [ai-dict-v0.4]

In some cases you might want to tweak or change the definitions of a term to match a particular use case, thereby departing from the Dictionary convention. If so, then you SHOULD mark the terms receiving a non-standard interpretation with the “^” symbol. For example:

I expect unsupervised language models to be human-level^ by 2024. [ai-dict-v1.3]

You might also want to add the following notice:

For purposes of resolution, these terms are interpreted in accordance with the Technical AI Forecasting Resolution Dictionary vX.Y.Z, available at ai-dict.com. Any term whose interpretation deliberately departs from this standard has been marked with a ^."

How to contribute to the dictionary

The AI Forecasting Dictionary is open-source, and you can contribute by making pull-requests to our GitHub or suggestions in the Google Doc version (more details here). We especially welcome:

Attempts to introduce novel definitions that capture important terms in AI (current examples include: “module”, “transformative AI” and “compute (training)”)
Examples of forecasting questions which you wrote and which ended up solving/making progress on some tricky piece of operationalisation, such that others can build on that progress

Footnotes

[1] Some people might be compelled by an analogy to mathematics here: most of the work often lies in setting up the right formalism and problem formulation rather than in the actual proof (for example, Nash’s original fixed point theorems in game theory aren’t that difficultle once the set-up is in place, but realising why and how this kind of set-up was applicable to a large class of important problems was highly non-trivial).

[2] English Common Law is the clear example of how definitions and policies evolve over time to crystallize judgements and wisdom.