Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for http://arxiv.org/abs/2403.05540

Abstract: In an effort to inform the discussion surrounding existential risks from AI, we formulate Extinction-level Goodhart’s Law as "Virtually any goal specification, pursued to the extreme, will result in the extinction[1] of humanity'', and we aim to understand which formal models are suitable for investigating this hypothesis. Note that we remain agnostic as to whether Extinction-level Goodhart's Law holds or not. As our key contribution, we identify a set of conditions that are necessary for a model that aims to be informative for evaluating specific arguments for Extinction-level Goodhart's Law. Since each of the conditions seems to significantly contribute to the complexity of the resulting model, formally evaluating the hypothesis might be exceedingly difficult. This raises the possibility that whether the risk of extinction from artificial intelligence is real or not, the underlying dynamics might be invisible to current scientific methods.


Together with Chris van Merwijk and Ida Mattsson, we have recently written a philosophy-venue version of some of our thoughts on Goodhart's Law in the context of powerful AI [link].[2] This version of the paper has no math in it, but it attempts to point at one aspect of "Extinction-level Goodhart's Law" that seems particularly relevant for AI advocacy – namely, that the fields of AI and CS would have been unlikely to come across evidence of AI risk, using the methods that are popular in those fields, even if the law did hold in the real world.

Since commenting on link-posts is inconvenient, I split off some of the ideas from the paper into the following separate posts:

We have more material on this topic, including writing with math[3] in it, but this is mostly not yet in a publicly shareable form. The exception is the post Extinction-level Goodhart's Law as a Property of the Environment (which is not covered by the paper). If you are interested in discussing anything related to this, definitely reach out.

  1. ^

    A common comment is that the definition should also include outcomes that are similarly bad or worse than extinction. While we agree that such definition makes sense, we would prefer to refer to that version as "existential", and reserve the "extinction" version for the less ambiguous notion of literal extinction.

  2. ^

    As an anecdote, it seems worth mentioning that I tried, and failed, to post the paper to arXiv --- by now, it has been stuck there with "on hold" status for three weeks. Given that the paper is called "Existential Risk from AI: Invisible to Science?", there must be some deeper meaning to this. [EDIT: After ~2 months, the paper is now on arXiv.]

  3. ^

    Or rather, it has pseudo-math in it. By which I mean that it looks like math, but it is built on top of vague concepts such as "optimisation power" and "specification complexity". And while I hope that we will one day be able to formalise these, I don't know how to do so at this point.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 3:18 PM

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

For more discussion see here, here, and here.

[I haven't read the sequence, I'm just responding the focus on extinction. In practice, my claims in this comment mean that I disagree with the arguments under "Human extinction as a convergently instrumental subgoal".]

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?

(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)

I would say "catastrophic outcome (>50% chance the AI kills >1 billion people)" or something and then footnote. Not sure though. The standard approach is to say "existential risk".

Quick reaction:

  • I didn't want to use the ">1 billion people" formulation, because that is compatible with scenarios where a catastrophe or an accident happens, but we still end up controling the future in the end.
  • I didn't want to use "existential risk", because that includes scenarios where humanity survives but has net-negative effects (say, bad versions of Age of Em or humanity spreading factory farming across the stars).
  • And for the purpose of this sequence, I wanted to look at the narrower class of scenarios where a single misaligned AI/optimiser/whatever takes over and does its thing. Which probably includes getting rid of literally everyone, modulo some important (but probably not decision-relevant?) questions about anthropics and negotiating with aliens.

Maybe "full loss-of-control to AIs"? Idk.

What about the term uncaring AI? In other words, an AI that would keep humans alive, if offered resources to do so. This can be contrasted with a Suffering Reducing AI (SRAI), which would not keep humans alive in exchange for resources. SRAI is an example of successfully hitting a bad alignment target, which is an importantly different class of dangers, compared to the dangers of an aiming failure leading to an uncaring AI. While an uncaring AI would happily agree to leave earth alone in exchange for resources, this is not the case for SRAI, because killing humans is inherent in the core concept, of reducing suffering. Any reasonable set of definitions, simply leads to a version of SRAI that rejects all such offers (assuming that the AI project that was aiming for the SRAI alignment target, manages to successfully hit this alignment target).

The term Uncaring AI is not meant to imply that the AI does not care about anything. Just that it does not care about anything that humans care about. Such as human lives. Which means that the question of extinction (and everything else that humans care about) is entirely determined by strategic considerations. The dangers stemming from the case where an aiming failure leads to an uncaring AI by accident, is importantly different, from the dangers stemming from a design team that successfully hits a bad alignment target. How about including a footnote, saying that you use Extinction as a shorthand for an outcome where humans are completely powerless, and where the fate of every living human is fully determined by an AI, that does not care about anything, that any human cares about? (and perhaps capitalise Extinction in the rest of the text) (and perhaps mention, in that same footnote, that if a neighbouring AI will pay such an uncaring AI to keep humans alive, then it would happily do so)

If an AI project succeeds, then it matters a lot, what alignment target the project was aiming for. Different bad alignment targets imply different behaviours. Consider a Suffering Maximising AI (SMAI). If a SMAI project successfully hits the SMAI alignment target, the resulting AI would create a lot of people. This is again importantly different from the case where the project fails to provide a reasonable definition of Suffering, and the resulting AI goes on to create little molecular pictures of sad faces or something similar. This molecular-sad-face-AI is, from a human perspective, basically the same as an AI that creates a lot of molecular squiggles, due to some other type of failure, that is unrelated to the definition of Suffering. They are both uncaring, and will both treat humans in a fully strategic manner. And both of these uncaring AIs, lead to the same outcome, as a failed SRAI project would lead to (whether through definitional issues, or something that leads to a lot of squiggles). They all treat everything that humans care about, in a fully strategic manner (they will all lead to Extinction, as defined in the proposed footnote mentioned above). But a successful SMAI project, would be importantly different from a successful SRAI project (which would actually lead to extinction, in the standard usage of that word. But would not lead to Extinction, as defined above). In the case of SMAI, different sets of reasonable definitions, also lead to importantly different outcomes (if the definition of suffering is reasonable, then the outcome of a successful SMAI project, will be a bad outcome. But different definitions still lead to importantly different outcomes). It is important to separate these dangers, from the dangers stemming from an uncaring AI. Because doing so allows us to explain, why it matters, which alignment target, an AI project is aiming for.

Let's say that two well intentioned designers, named Dave and Bill, both share your values. Dave is proposing an AI project aiming for SRAI, and Bill is proposing an AI project aiming for SMAI. Both projects have a set of very clever safety measures. Both Dave and Bill say that if some specific attempt to describe SRAI / SMAI fails, and leads to a bad outcome, then their safety measures will keep everyone safe, and they will find a better description (and they are known to be honest. And are known to share your definition of what counts as a bad outcome. But it is of course not certain, that their safety measures will actually hold). If you can influence the decision, of which project gets funded, then it seems very important to direct the funding away from Bill. Because the idea that clever safety measures, makes the alignment target irrelevant, is just straightforwardly false. An actual project, started in the actual real world, that are fully dedicated to the idea of carefully trying things out with soft maximisation, might of course result in non-soft maximisation, of the alignment target being aimed at (and it is better if this happens to Dave's proposed project, than to Bill's proposed project). Conditioned on a SMAI project being inevitable, then safety measures might be very useful (they might hold up until Bill finally discovers that he is wrong. Until Bill finally sees, that SMAI is a bad alignment target). Using your example safety measure, we can phrase this as: Even if Bill sincerely intends to use iterative goal specification, it is still of course possible, that his proposed SMAI project, will end in a successful implementation, of the aimed for alignment target: SMAI. It is possible that Bill will successfully use iterative goal specification to avoid catastrophe. But this is not guaranteed. Thus, Bill should still not aim for the SMAI alignment target. This conclusion is completely independent of the specifics of the set of safety measures involved. In other words, it is important to separate the issue of safety measures, from the question of whether or not, a project is aiming for a bad alignment target.

In yet other words: it seems important to be able to separate dangers related to successfully hitting bad alignment targets, from dangers related to aiming failures. Because in the case of a project that successfully hits the alignment target that it is aiming for, it matters a lot, which alignment target the project was aiming for (in the case of SMAI, the details of the definitions are also influencing the outcome in an important way). We need to maintain the ability to, for example, differentiate between the type of extinction outcome that SRAI implies, from the type of Extinction outcome that you are discussing.

For more on the dangers involved with successfully hitting the wrong alignment target, see:

A problem with the most recently published version of CEV 

(in the context of the linked philpapers paper, this is a word choice comment about the word: extinction. It is not a comment on the set of conditions that the paper identify as necessary for the purpose of evaluating arguments about these things. If taken extremely literally, your current word choices would imply, that you should add at least one more condition to your set: specifically the condition that the model must be able to talk about the strategic incentives that an uncaring AI would face (for example from a neighbouring AI, that an uncaring AI is expecting to possibly run into, in the distant future). Specifically, the additional condition is necessary for evaluating A2 and / or A3 (depending on how the text is interpreted). A model without this additional condition is useless for evaluating extinction arguments, in the same sense as a model with a static moon, is useless for evaluating arguments about rockets hitting the moon (they both fail on the first step). But I think that the above footnote is probably more in line with what you are trying to do with your paper. In other words: since you presumably do not want to add a strategic-environment-condition to your set of conditions, you will presumably prefer to add a footnote, and switch from extinction to Extinction (since this new condition would be necessary for evaluating arguments about extinction, but  would not be necessary for evaluating arguments about Extinction))