Weak vs Quantitative Extinction-level Goodhart's Law

VojtaKovarik; Ida Mattsson

This post overlaps with our recent paper Extinction Risk from AI: Invisible to Science?.

tl;dr: With claims such as "optimisation towards a misspecified goal will cause human extinction", we should be more explicit about the order of quantifiers (and the quantities) of the underlying concepts. For example, do we mean that that for every misspecified goal, there exists a dangerous amount of optimisation power? Or that there exists an amount of optimisation power that is dangerous for every misspecified goal? (Also, how much optimisation? And how misspecified goals?)

Central to the worries of about AI risk is the intuition that if we even slightly misspecify our preferences when giving them as input to a powerful optimiser, the result will be human extinction. We refer to this conjecture as Extinction-level Goodhart's Law^[1].

Weak version of Extinction-level Goodhart's Law

To make Extinction-level Goodhart's Law slightly more specific, consider the following definition:

Definition 1: The Weak Version of Extinction-level Goodhart's Law is the claim that: "Virtually any^[2] goal specification, pursued to the extreme, will result in the extinction^[3] of humanity."^[4]

Here, the "weak version" qualifier refers to two aspects of the definition. The first is the limit nature of the claim --- that is, the fact that the law only makes claims about what happens when the goal specification is pursued to the extreme. The second is best understood by contrasting Definition 1 with the following claim:

Definition 2: The Uniform Version of Extinction-level Goodhart's Law is the claim that: "Beyond a certain level of optimisation power, pursuing virtually any goal specification will result in the extinction of humanity."

In other words, the difference between Definitions 1 and 2 is the difference between

(goal G s.t. [conditions]) ( $\exists$ opt. power O) : Optimise(G, O) $⇝$ extinction
( $\exists$ opt. power O) ( $\forall$ goal G s.t. [conditions]) : Optimise(G, O) $⇝$ extinction.

Quantitative version of Goodhart's Law

While the distinction^[5] between the weak and uniform versions of Extinction-level Goodhart's Law is important, it is too coarse to be useful for questions such as:

For a given goal specification, what are the thresholds at which (i) additional optimisation becomes undesirable, (ii) we would have been better off not optimising at all, and (iii) the optimisation results in human extinction?
Should we expect warning shots?
Does iterative goal specification work? That is, can we keep optimising until things start to get worse, then spend a bit more effort on specifying the goal, and repeat?

This highlights the importance of finding the correct quantitative version of the law:

Definition 3: A Quantitative Version of Goodhart's Law is any claim that describes the relationship between a goal specification, optimisation power used to pursue it, and the (un)desirability of the resulting outcome.

We could also look specifically at the quantitative version of extinction-level Goodhart's law, which would relate the quality of the goal specification to the amount of optimisation power that can be used before pursuing this goal specification results in extinction.

Implications for AI-risk discussions

Finally, note that the weak version of Extinction-level Goodhart's Law is consistent with AI not posing any actual danger of human extinction. This could happen for several reasons, including (i) because the dangerous amounts optimisation power are unreachable in practice, (ii) because we adopt goals which are not amenable to using unbounded (or any) amounts of optimisation^[6], or (iii) because iterative goal specification is in practice sufficient for avoiding catastrophes. As a result, we view the weak version of Extinction-level Goodhart's Law as a conservative claim, which the proponents and sceptics of AI risk might agree on while still disagreeing on the quantitative details of Goodhart's law.

^{^}
Admittedly, "Extinction-level Goodhart's Law" is a very cumbersome name. So better name is a definitely desirable. We also considered the more catchy "Catastrophic Goodhart's Law", but we were afraid of misinterpretations where "catastrophe" is interpreted metaphorically. (Perhaps somebody should get Yudkowsky and Goodhart to co-author something on this topic, and then we could call it "Goodkowsky's Law".)
^{^}
There are some obvious counterexamples that need to be included in the "virtually any" exception if any version of Extinction-level Goodhart's Law is to hold. The first is trivial goals which do not incentivise the AI to do much of anything --- for example, "take the shut-down action now" or "any sequence of action is as good as any other". I also expect that with enough effort, one can actually specify human values perfectly (with all their quirks and inconsistencies), in a way that avoids Goodhart's Law. But "enough effort" could easily prove to be prohibitively high in practice.

Either way, the goal of this post is to hint at the differences between the different variants of Goodhart's Law, not to provide an authoritative formulation of it.
^{^}
A common reaction is that the definition should also include outcomes that are similarly bad or worse than extinction. While we agree that such definition makes sense, we would prefer to refer to that version as "existential", and reserve "extinction" for the less ambiguous notion of literal extinction.
^{^}
Definition 1, as well as Definition 2, is clearly still too vague to constitute a proper claim which can be evaluated as true or false. For example, making the definitions less vague would require operationalising what we mean by "goal specification", "virtually any goal specification", and "optimisation power".
However, the key purpose of the definitions is to contrast the different versions of Extinction-level Goodhart's Law.
^{^}
One could argue that the weak version of Extinction-level Goodhart's Law automatically implies the uniform version, because we can set $O_{u n i v e r s a l} := {max}_{G} O_{G}$ , where $O_{G}$ denotes the smallest amount of optimisation power at which pursuing $G$ causes extinction. (This is similar to how a continuous function defined on a closed interval is automatically uniformly continuous.)
Apart from the possibility that the maximum could be infinite, this argument is valid. However, it misses the fact that we care about the smallest amount of optimisation power that is dangerous for a given goal. The quantitative version of Extinction-level Goodhart's Law is meant to remedy this issue.
^{^}
For example, one might argue that an AI system that pursues a goal formulated on the basis of a rule or in fulfillment of a virtue is not well-described as optimising a (terminal) goal. Extreme applications of Goodhart's Law might in such cases simply not apply because not all sources of value for advancing the goal can be subject of comparison, and thus tradeoff. (Of course, the question of how to actually build an AI system that robustly does this is a different matter.)

LESSWRONG
LW

27

Weak vs Quantitative Extinction-level Goodhart's Law

27

Ω 12

Weak version of Extinction-level Goodhart's Law

Quantitative version of Goodhart's Law

Implications for AI-risk discussions

27

Ω 12