VojtaKovarik

My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about "consequentionalist reasoning").

Sequences

Formalising Catastrophic Goodhart

Wiki Contributions

Comments

I agree that "we can't test it right now" is more appropriate. And I was looking for examples of things that "you can't test right now even if you try really hard".

Good point. Also, for the purpose of the analogy with AI X-risk, I think we should be willing to grant that the people arrive at the alternative hypothesis through theorising. (Similarly to how we came up with the notion of AI X-risk before having any powerful AIs.) So that does break my example somewhat. (Although in that particular scenario, I imagine that sceptic of Newtonian gravity would came up with alternative explanations for the observation. Not that this seems very relevant.)

I agree with all of this. (And good point about the high confidence aspect.)

The only thing that I would frame slightly differently is that:
[X is unfalsifiable] indeed doesn't imply [X is false] in the logical sense. On reflection, I think a better phrasing of the original question would have been something like: 'When is "unfalsifiability of X is evidence against X" incorrect?'. And this amended version often makes sense as a heuristic --- as a defense against motivated reasoning, conspiracy theories, etc. (Unfortunately, many scientists seem to take this too far, and view "unfalsifiable" as a reason to stop paying attention, even though they would grant the general claim that [unfalsifiable] doesn't logically imply [false].)

I don't think it's foolish to look for analogous examples here, but I guess it'd make more sense to make the case directly.

That was my main plan. I was just hoping to accompany that direct case by a class of examples that build intuition and bring the point home to the audience.

Answer by VojtaKovarik10

Some partial examples I have so far:
 

Phenomenon: For virtually any goal specification, if you pursue it sufficiently hard, you are guaranteed to get human extinction.[1]
Situation where it seems false and unfalsifiable: The present world.
Problems with the example: (i) We don't know whether it is true. (ii) Not obvious enough that it is unfalsifiable.


Phenomenon: Physics and chemistry can give rise to complex life.
Situation where it seems false and unfalsifiable: If Earth didn't exist.
Problems with the example: (i) if Earth didn't exist, there wouldn't be anybody to ask the question, so the scenario is a bit too weird. (ii) The example would be much better if it was the case that if you wait long enough, any planet will produce life.

Phenomenon: Gravity -- all things with mass attract each other. (As opposed to "things just fall in this one particular direction".)
Situation where it seems false and unfalsifiable: If you lived in a bunker your whole life, with no knowledge of the outside world.[2]
Problems with the example: The example would be even better if we somehow had some formal model that: (a) describes how physics works, (b) where we would be confident that the model is correct, (c) and that by analysing that model, we will be able to determine whether the theory is correct or false, (d) but the model would be too complex to actually analyse. (Similarly to how chemistry-level simulations are too complex for studying evolution.)

Phenomenon: Eating too much sweet stuff is unhealthy.
Situation where it seems false and unfalsifiable: If you can't get lots of sugar yet, and only rely on fruit etc.
Problems with the example: The scenario is a bit too artificial. You would have to pretend that you can't just go and harvest sugar from sugar cane and have somebody eat lots of it.

  1. ^

    See here for comments on this. Note that this doesn't imply AI X-risk, since "sufficiently hard" might be unrealistic, and also we might choose not to use agentic AI, etc.

  2. ^

    And if you didn't have any special equipment, etc.

Nitpick on the framing: I feel that thinking about "misaligned decision-makers" as an "irrational" reason for war could contribute to (mildly) misunderstanding or underestimating the issue.

To elaborate: The "rational vs irrational reasons" distinction talks about the reasons using the framing where states are viewed as monolithic agents who act in "rational" or "irrational" ways. I agree that for the purpose of classifying the risks, this is an ok way to go about things.

I wanted to offer an alternative framing of this, though: For any state, we can consider the abstraction where all people in that state act in harmony to pursue the interests of the state. And then there is the more accurate abstraction where the state is made of individual people with imperfectly aligned interests, who each act optimally to pursue those interests, given their situation. And then there is the model where the individual humans are misaligned and make mistakes. And then you can classify the reasons based on which abstraction you need to explain them.

[I am confused about your response. I fully endorse your paragraph on "the AI with superior ontology would be able to predict how humans would react to things". But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me --- meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]

I think the problem is that there is a difference between:
(1) AI which can predict how things score in human ontology; and
(2) AI which has "select things that score high in human ontology" as part of its goal[1].
And then, in the worlds where natural abstraction hypothesis is false: Most AIs achieve (1) as a by-product of the instrumental sub-goal of having low prediction error / being selected by our training processes / being able to manipulate humans. But us successfully achieving (2) for a powerful AI would require the natural abstraction hypothesis[2].

And this leaves us two options. First, maybe we just have no write access to the AI's utility function at all. (EG, my neighbour would be very happy if I gave him $10k, but he doesn't have any way of making me (intrinsincally) desire doing that.) Second, we might have a write access to the AI's utility function, but not in a way that will lead to predictable changes in goals or behaviour. (EG, if you give me full access to weights of an LLM, it's not like I know how to use that to turn that LLM into an actually-helpful assistant.)
(And both of these seem scary to me, because of the argument that "not-fully-aligned goal + extremely powerful optimisation ==> extinction". Which I didn't argue for here.)

  1. ^

    IE, not just instrumentally because it is pretending to be aligned while becoming more powerful, etc.

  2. ^

    More precisely: Damn, we need a better terminology here. The way I understand things, "natural abstraction hypothesis" is the claim that most AIs will converge to an ontology that is similar to ours. The negation of that is that a non-trivial portion of AIs will use an ontology that is different from ours. What I subscribe to is that "almost no powerful AIs will use an ontology that is similar to ours". Let's call that "strong negation" of the natural abstraction hypothesis. So achieving (2) would be a counterexample to this strong negation.
    Ironically, I believe the strong negation hypothesis because I expect that very powerful AIs will arrive at similar ways of modelling the world --- and those are all different from how we model the world.

Nitpicky edit request: your comment contains some typos that make it a bit hard to parse ("be other", "we it"). (So apologies if my reaction misunderstands your point.)

[Assuming that the opposite of the natural abstraction hypothesis is true --- ie, not just that "not all powerful AIs share ontology with us", but actually "most powerful AIs don't share ontology with us":]
I also expect that an AI with superior ontology would be able to answer your questions about its ontology, in a way that would make you feel like[1] you understand what is happening. But that isn't the same as being able to control the AI's actions, or being able to affect its goal specification in a predictable way (to you). You totally wouldn't be able to do that.

([Vague intuition, needs work] I suspect that if you had a method for predictably-to-you translating from your ontology to the AI's ontology, then this could be used to prove that you can easily find a powerful AI that shares an ontology with us. Because that AI could be basically thought of as using our ontology.)

  1. ^

    Though note that unless you switched to some better ontology, you wouldn't actually understand what is going on, because your ontology is so bogus that it doesn't even make sense to talk about "you understanding [stuff]". This might not be true for all kinds of [stuff], though. EG, perhaps our understanding of set theory is fine while our understanding of agency, goals, physics, and whatever else, isn't.

As a quick reaction, let me just note that I agree that (all else being equal) this (ie, "the AI understanding us & having superior ontology") seems desirable. And also that my comment above did not present any argument about why we should be pessimistic about AI X-risk if we believe that the natural abstraction hypothesis is false. (I was just trying to explain why/how "the AI has a different ontology" is compatible with "the AI understands our ontology".)

As a longer reaction: I think my primary reason for pessimism, if natural abstraction hypothetis is false, is that a bunch of existing proposals might work if the hypothesis were true, but don't work if the hypothesis is false. (EG, if the hypothesis is true, I can imagine that "do a lot of RLHF, and then ramp up the AIs intelligence" could just work. Similarly for "just train the AI to not be deceptive".)

If I had to gesture at an underlying principle, then perhaps it could be something like: Suppose we successfully code up an AI which is pretty good at optimising, or create a process which gives rise to such an AI. [Inference step missing here.] Then the goals and planning of this AI will be happening in some ontology which allows for low prediction error. But this will be completely alien to our ontology. [Inference step missing here.] And, therefore, things that score very highly with respect to these ("alien") goals will have roughly no value[1] according to our preferences.
(I am not quite clear on this, but I think that if this paragraph was false, then you could come up with a way of falsifying my earlier description of how it looks like when the natural abstraction hypothesis is false.)

  1. ^

    IE, no positive value, but also no negative value. So no S-risk.

Simplifying somewhat: I think that my biggest delta with John is that I don't think the natural abstraction hypothesis holds. (EG, if I believed it holds, I would become more optimistic about single-agent alignment, to the point of viewing Moloch as higher priority.) At the same time, I believe that powerful AIs will be able to understand humans just fine. My vague attempt at reconciling these two is something like this:

Humans have some ontology, in which they think about the world. This corresponds to a world model. This world model has a certain amount of prediction errors.

The powerful AI wants to have much lower prediction error than that. When I say "natural abstraction hypothesis is false", I imagine something like: If you want to have a much lower prediction error than that, you have to use a different ontology / world-model than humans use. And in fact if you want sufficiently low error, then all ontologies that can achieve that are very different from our ontology --- either (reasonably) simple and different, or very complex (and, I guess, therefore also different).

So when the AI "understands humans perfectly well", that means something like: The AI can visualise the flawed (ie, high prediction error) model that we use to think about the world. And it does this accurately. But it also sees how the model is completely wrong, and how the things, that we say we want, only make sense in that model that has very little to do with the actual world.

(An example would be how a four-year old might think about the world in terms of Good people and Evil people. The government sometimes does Bad things because there are many Evil people in it. And then the solution is to replace all the Evil people by Good people. And that might internally make sense, and maybe an adult can understand this way of thinking, while also being like "this has nothing to do with how the world actually works; if you want to be serious about anything, just throw this model out".)

An illustrative example, describing a scenario that is similar to our world, but where "Extinction-level Goodhart's law" would be false & falsifiable (hat tip Vincent Conitzer):

Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand "humanity" as the collection of all humans, including those in the unreachable colonies. Then any AI that we build, no matter how smart, would be unable to harm these portions of humanity. And thus full-blown human extinction, from AI we build here on Earth, would be impossible. And you could "prove" this using a simple, yet quite rigorous, physics argument.[1]

(To be clear, I am not saying that "AI X-risk's unfalsifiability is justifiable ==> we should update in favour of AI X-risk compared to our priors". I am just saying that the justifiability means we should not update against it compared to our priors. Though I guess that in practice, it means that some people should undo some of their updates against AI X-risk... )

  1. ^

    And sure, maybe some weird magic is actually possible, and the AI could actually beat speed of light. But whatever, I am ignoring this, and an argument like this would count as falsification as far as I am concerned.

Load More