My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about "consequentionalist reasoning").


Formalising Catastrophic Goodhart

Wiki Contributions


Nitpick on the framing: I feel that thinking about "misaligned decision-makers" as an "irrational" reason for war could contribute to (mildly) misunderstanding or underestimating the issue.

To elaborate: The "rational vs irrational reasons" distinction talks about the reasons using the framing where states are viewed as monolithic agents who act in "rational" or "irrational" ways. I agree that for the purpose of classifying the risks, this is an ok way to go about things.

I wanted to offer an alternative framing of this, though: For any state, we can consider the abstraction where all people in that state act in harmony to pursue the interests of the state. And then there is the more accurate abstraction where the state is made of individual people with imperfectly aligned interests, who each act optimally to pursue those interests, given their situation. And then there is the model where the individual humans are misaligned and make mistakes. And then you can classify the reasons based on which abstraction you need to explain them.

[I am confused about your response. I fully endorse your paragraph on "the AI with superior ontology would be able to predict how humans would react to things". But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me --- meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]

I think the problem is that there is a difference between:
(1) AI which can predict how things score in human ontology; and
(2) AI which has "select things that score high in human ontology" as part of its goal[1].
And then, in the worlds where natural abstraction hypothesis is false: Most AIs achieve (1) as a by-product of the instrumental sub-goal of having low prediction error / being selected by our training processes / being able to manipulate humans. But us successfully achieving (2) for a powerful AI would require the natural abstraction hypothesis[2].

And this leaves us two options. First, maybe we just have no write access to the AI's utility function at all. (EG, my neighbour would be very happy if I gave him $10k, but he doesn't have any way of making me (intrinsincally) desire doing that.) Second, we might have a write access to the AI's utility function, but not in a way that will lead to predictable changes in goals or behaviour. (EG, if you give me full access to weights of an LLM, it's not like I know how to use that to turn that LLM into an actually-helpful assistant.)
(And both of these seem scary to me, because of the argument that "not-fully-aligned goal + extremely powerful optimisation ==> extinction". Which I didn't argue for here.)

  1. ^

    IE, not just instrumentally because it is pretending to be aligned while becoming more powerful, etc.

  2. ^

    More precisely: Damn, we need a better terminology here. The way I understand things, "natural abstraction hypothesis" is the claim that most AIs will converge to an ontology that is similar to ours. The negation of that is that a non-trivial portion of AIs will use an ontology that is different from ours. What I subscribe to is that "almost no powerful AIs will use an ontology that is similar to ours". Let's call that "strong negation" of the natural abstraction hypothesis. So achieving (2) would be a counterexample to this strong negation.
    Ironically, I believe the strong negation hypothesis because I expect that very powerful AIs will arrive at similar ways of modelling the world --- and those are all different from how we model the world.

Nitpicky edit request: your comment contains some typos that make it a bit hard to parse ("be other", "we it"). (So apologies if my reaction misunderstands your point.)

[Assuming that the opposite of the natural abstraction hypothesis is true --- ie, not just that "not all powerful AIs share ontology with us", but actually "most powerful AIs don't share ontology with us":]
I also expect that an AI with superior ontology would be able to answer your questions about its ontology, in a way that would make you feel like[1] you understand what is happening. But that isn't the same as being able to control the AI's actions, or being able to affect its goal specification in a predictable way (to you). You totally wouldn't be able to do that.

([Vague intuition, needs work] I suspect that if you had a method for predictably-to-you translating from your ontology to the AI's ontology, then this could be used to prove that you can easily find a powerful AI that shares an ontology with us. Because that AI could be basically thought of as using our ontology.)

  1. ^

    Though note that unless you switched to some better ontology, you wouldn't actually understand what is going on, because your ontology is so bogus that it doesn't even make sense to talk about "you understanding [stuff]". This might not be true for all kinds of [stuff], though. EG, perhaps our understanding of set theory is fine while our understanding of agency, goals, physics, and whatever else, isn't.

As a quick reaction, let me just note that I agree that (all else being equal) this (ie, "the AI understanding us & having superior ontology") seems desirable. And also that my comment above did not present any argument about why we should be pessimistic about AI X-risk if we believe that the natural abstraction hypothesis is false. (I was just trying to explain why/how "the AI has a different ontology" is compatible with "the AI understands our ontology".)

As a longer reaction: I think my primary reason for pessimism, if natural abstraction hypothetis is false, is that a bunch of existing proposals might work if the hypothesis were true, but don't work if the hypothesis is false. (EG, if the hypothesis is true, I can imagine that "do a lot of RLHF, and then ramp up the AIs intelligence" could just work. Similarly for "just train the AI to not be deceptive".)

If I had to gesture at an underlying principle, then perhaps it could be something like: Suppose we successfully code up an AI which is pretty good at optimising, or create a process which gives rise to such an AI. [Inference step missing here.] Then the goals and planning of this AI will be happening in some ontology which allows for low prediction error. But this will be completely alien to our ontology. [Inference step missing here.] And, therefore, things that score very highly with respect to these ("alien") goals will have roughly no value[1] according to our preferences.
(I am not quite clear on this, but I think that if this paragraph was false, then you could come up with a way of falsifying my earlier description of how it looks like when the natural abstraction hypothesis is false.)

  1. ^

    IE, no positive value, but also no negative value. So no S-risk.

Simplifying somewhat: I think that my biggest delta with John is that I don't think the natural abstraction hypothesis holds. (EG, if I believed it holds, I would become more optimistic about single-agent alignment, to the point of viewing Moloch as higher priority.) At the same time, I believe that powerful AIs will be able to understand humans just fine. My vague attempt at reconciling these two is something like this:

Humans have some ontology, in which they think about the world. This corresponds to a world model. This world model has a certain amount of prediction errors.

The powerful AI wants to have much lower prediction error than that. When I say "natural abstraction hypothesis is false", I imagine something like: If you want to have a much lower prediction error than that, you have to use a different ontology / world-model than humans use. And in fact if you want sufficiently low error, then all ontologies that can achieve that are very different from our ontology --- either (reasonably) simple and different, or very complex (and, I guess, therefore also different).

So when the AI "understands humans perfectly well", that means something like: The AI can visualise the flawed (ie, high prediction error) model that we use to think about the world. And it does this accurately. But it also sees how the model is completely wrong, and how the things, that we say we want, only make sense in that model that has very little to do with the actual world.

(An example would be how a four-year old might think about the world in terms of Good people and Evil people. The government sometimes does Bad things because there are many Evil people in it. And then the solution is to replace all the Evil people by Good people. And that might internally make sense, and maybe an adult can understand this way of thinking, while also being like "this has nothing to do with how the world actually works; if you want to be serious about anything, just throw this model out".)

An illustrative example, describing a scenario that is similar to our world, but where "Extinction-level Goodhart's law" would be false & falsifiable (hat tip Vincent Conitzer):

Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand "humanity" as the collection of all humans, including those in the unreachable colonies. Then any AI that we build, no matter how smart, would be unable to harm these portions of humanity. And thus full-blown human extinction, from AI we build here on Earth, would be impossible. And you could "prove" this using a simple, yet quite rigorous, physics argument.[1]

(To be clear, I am not saying that "AI X-risk's unfalsifiability is justifiable ==> we should update in favour of AI X-risk compared to our priors". I am just saying that the justifiability means we should not update against it compared to our priors. Though I guess that in practice, it means that some people should undo some of their updates against AI X-risk... )

  1. ^

    And sure, maybe some weird magic is actually possible, and the AI could actually beat speed of light. But whatever, I am ignoring this, and an argument like this would count as falsification as far as I am concerned.

FWIW, I acknowledge that my presentation of the argument isn't ironclad, but I hope that it makes my position a bit clearer. If anybody has ideas for how to present it better, or has some nice illustrative examples, I would be extremely grateful.

tl;dr: "lack of rigorous arguments for P is evidence against P" is typically valid, but not in case of P = AI X-risk.

A high-level reaction to your point about unfalsifiability:
There seems to be a general sentiment that "AI X-risk arguments are unfalsifiable ==> the arguments are incorrect" and "AI X-risk arguments are unfalsifiable ==> AI X-risk is low".[1] I am very sympathetic to this sentiment --- but I also think that in the particular case of AI X-risk, it is not justified.[2] For quite non-obvious reasons.

Why I believe this?
Take this simplified argument for AI X-risk:

  1. Some important future AIs will be goal-oriented, or will behave in a goal-oriented way in sometimes[3]. (Read: If you think of them as trying to maximise some goal, you will make pretty good predictions.[4])
  2. The "AI-progress tech-tree" is such that discontinous jumps in impact are possible. In particular, we will one day go from "an AI that is trying to maximise some goal, but not doing a very good job of it" to "an AI that is able to treat humans and other existing AIs as 'environment', and is going to do a very good job at maximising some goal".
  3. For virtually any[5] goal specification, doing a sufficiently[6] good job at maximising that goal specification leads to an outcome where every human is dead.

FWIW, I think that having a strong opinion on (1) and (2), in either direction, is not justified.[7] But in this comment, I only want to focus on (3) --- so let's please pretend, for the sake of this discussion, that we find (1) and (2) at least plausible. What I claim is that even if we lived in a universe where (3) is true, we should still expect even the best arguments for (3) (that we might realistically identify) to be unfalsifiable --- at least given realistic constraints on falsification effort and assumming that we use rigorous standards for what counts as a solid evidence, like people do in mathematics, physics, or CS.

What is my argument for "even best arguments for (3) will be unfalsifiable"?
Suppose you have an environment  that contains a Cartesian agent (a thing that takes actions in the environment and -- let's assume for simplicity -- has perfect information about the environment, but whose decison-making computation happens outside of the environment). And suppose that this agent acts in a way that maximises[8] some goal specification[9] over . Now,  might or might not contain humans, or representations of humans. We can now ask the following question: Is it true that, unless we spend an extremely high amont of effort (eg, >5 civilisation-years), any (non-degenerate[10]) goal-specification we come up with will result in human extinction[11] in E when maximised by the agent. I refer to this as "Extinction-level Goodhart's Law".

I claim that:
(A) Extinction-level Goodhart's Law plausibly holds in the real world. (At least the thought expertiments I know, eg here or here, of suggest it does.)
(B) Even if Extinction-level Goodhart's Law was true in the real world, it would still be false in environments where we could verify it experimentally (today, or soon) or mathematically (by proofs, given realistic amounts of effort).
==> And (B) implies that if we want "solid arguments", rather than just thought expertiments, we might be kinda screwed when it comes to Extinction-level Goodhart's Law.

And why do I believe (B)? The long story is that I try to gesture at this in my sequence on "Formalising Catastrophic Goodhart". The short story is that there are many strategies for finding "safe to optimise" goal specifications that work in simpler environments, but not in the real-world (examples below). So to even start gaining evidence on whether the law holds in our world, we need to investigate envrionments where those simpler strategies don't work --- and it seems to me that those are always too complex for us to analyse mathematically or run an AI there which could "do a sufficiently good job a trying to maximise the goal specification".
Some examples of the above-mentioned strategies for finding safe-to-optimise goal specifications: (i) The environment contains no (representations of) humans, or those "humans" can't "die", so it doesn't matter. EG, most gridworlds. (ii) The environment doesn't have any resources or similar things that would give rise to convergent instrumental goals, so it doesn't matter. EG, most gridworlds. (iii) The environment allows for a simple formula that checks whether "humans" are "extinct", so just add a huge penalty if that formula holds. (EG, most gridworlds where you added "humans".) (iv) There is a limited set of actions that result in "killing" "humans", so just add a huge penalty to those. (v) There is a simple formula for expressing a criterion that limits the agent's impact. (EG, "don't go past these coordinates" in a gridworld.)

All together, this should explain why the "unfalsifiability" counter-argument does not hold as much weight, in the case of AI X-risk, as one might intuitively expect.

  1. ^

    If I understand you correctly, you would endorse something like this? Quite possibly with some disclaimers, ofc. (Certainly I feel that many other people endorse something like this.)

  2. ^

    I acknowledge that the general heuristic "argument for X is unfalsifiable ==> the argument is wrong" holds in most cases. And I am aware we should be sceptical whenever somebody goes "but my case is an exception!". Despite this, I still believe that AI X-risk genuinely is different from invisible dragons in your garage and conspiracy theories.

    That said, I feel there should be a bunch of other examples where the heuristic doesn't apply. If you have some that are good, please share!

  3. ^

    An example of this would be if GPT-4 acted like a chatbot most of the time, but tried to take over the world if you prompt it with "act as a paperclipper".

  4. ^

    And this way of thinking about them is easier -- description length, etc -- than other options. EG, no "water bottles maximising being a water battle".

  5. ^

    By "virtual any" goal specification (leading to extinction when maximised), I mean that finding a goal specification for which extinction does not happen (when maximised) is extremely difficult. One example of operationalising "extremely difficult" would be "if our civilisation spent all its efforts on trying to find some goal specification, for 5 years from today, we would still fail". In particular, the claim (3) is meant to imply that if you do anything like "do RLHF for a year, then optimise the result extremely hard", then everybody dies.

  6. ^

    For the purposes of this simplified AI X-risk argument, the AIs from (2), which are "very good at maximising a goal", are meant to qualify for the "sufficiently good job at maximising a goal" from (3). In practice, this is of course more complicated --- see e.g. my post on Weak vs Quantitative Extinction-level Goodhart's Law.

  7. ^

    Or at least there are no publicly available writings, known to me, which could justifiy claims like "It's >=80% likely that (1) (or 2) holds (or doesn't hold)". Of course, (1) and (2) are too vague for this to even make sense, but imagine replacing (1) and (2) by more serious attempts at operationalising the ideas that they gesture at.

  8. ^

    (or does a sufficiently good job of maximising)

  9. ^

    Most reasonable ways of defining what "goal specification" means should work for the argument. As a simple example, we can think of having a reward function R : states --> R and maximising the sum of R(s) over any long time horizon.

  10. ^

    To be clear, there are some trivial ways of avoiding Extinction-level Goodhart's Law. One is to consider a constant utility function, which means that the agent might as well take random actions. Another would be to use reward functions in the spirit of "shut down now, or get a huge penalty". And there might be other weird edge cases.
    I acknowledge that this part should be better developed. But in the meantime, hopefully it is clear -- at least somewhat -- what I am trying to gesture at.

  11. ^

    Most environments won't contain actual humans. So by "human extinction", I mean the "metaphorical humans being metaphorically dead". EG, if your environment was pacman, then the natural thing would be to view the pacman as representing a "human", and being eaten by the ghosts as representing "extinction". (Not that this would be a good model for studying X-risk.)

Assumption 2 is, barring rather exotic regimes far into the future, basically always correct, and for irreversible computation, this always happens, since there's a minimum cost to increase the features IRL, and it isn't 0.

Increasing utility IRL is not free.

I think this is a misunderstanding of what I meant. (And the misunderstanding probably only makes sense to try clarifying it if you read the paper and disagree with my interpretation of it, rather than if your reaction is only based on my summary. Not sure which of the two is the case.)

What I was trying to say is that the most natural interpretation of the paper's model does not allow for things like: In state 1, the world is exactly as it is now, except that you decided to sleep on the floor every day instead of in your bed (for no particular reason), and you are tired and miserable all day. State 2 is exactly the same as state 1, except you decided that it would be smarter to sleep in your bed. And now, state 2 is just strictly better than state 1 (at least in all respects that you would care to name).
Essentially, the paper's model requires, by assumption, that it is impossible to get any efficiency gains (like "don't sleep on the floor" or "use this more efficient design instead) or mutually-beneficial deals (like helping two sides negotiate and avoid a war).

Yes, I agree that you can interpret the model in ways that avoid this. EG, maybe by sleeping on the floor, your bed will last longer. And sure, any action at all requires computation. I am just saying that these are perhaps not the interpretations that people initially imagine when reading the paper,. So unless you are using an interpretation like that, it is important to notice those strong assumptions.

I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)

Load More