Introduction to Reducing Goodhart

by Charlie Steiner3 min read26th Aug 20218 comments


Ω 18

Goodhart's LawValue LearningAI
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I - Prologue

Two months ago, I wanted to write about AI designs that evade Goodhart's law. But as I wrote that post, I became progressively more convinced that framing things that way was leading me to talk complete nonsense. I want to explore why that is and try to find a different (though not entirely original, see Rohin et al., Stuart 1, 2, 3) framing of core issues, which avoids assuming that we can model humans as idealized agents.

This post is the first of a sequence of five posts - in this introduction I'll be making the case that we expect problems to arise in the straightforward application of Goodhart's law to value learning. I'm interested in hearing from you if you remain unconvinced or think of things I missed.

II - Introduction

Goodhart's law tells us that even when there are normally only small divergences between what we optimize for and our real standards, the outcome can be quite bad by our real standards. To use Scott Garrabrant's terminology from Goodhart Taxonomy, suppose that we have some true preference function V (for "True Values") over worlds, and U is some proxy that has been correlated with V in the past. Then there are several reasons given in Scott's post why maximizing U may score poorly according to V.

But here's the problem: humans have no such V (see also Scott A., Stuart 1, 2). Inferring human preferences depends on:

  • what state the environment is in.
  • what physical system to infer the preferences of.
  • how to make inferences from that physical system.
  • how to resolve inconsistencies and conflicting dynamics.
  • how to extrapolate the inferred preferences into new and different contexts.

There is no single privileged way to do all these things, and different choices can give very different results. And yet the framing of Goodhart's law, as well as much of our intuitive thinking about value learning, rests on the assumption that the True Values are out there.


Goodhart's law is important - we use it all over the place (e.g. 1, 2, 3). In AI alignment we want to use Goodhart's law to crystallize a pattern of bad behavior in AI systems (e.g. 1, 2, 3, 4), and to design powerful AIs that don't have this bad behavior (e.g. 1, 2, 3, 4, 5, 6).  But if you try to use Goodhart's law to design solutions to these problems it has a single prescription for us: find V (or at least bound your error relative to it). Since there is no such V, not only is that advice useless, it actually denies the possibility of success.

The goal, then, is deconfusion. We still want to talk about the same stuff, the same patterns, but we want a framing of what-we-now-call-Goodhart's-law that helps us think about what successful AI could look like in the real world.

III - Preview of the sequence

We'll start the next post with the classic question: "Why do I think I know what I do about Goodhart's law?"

The obvious answers to this question involve talking about how humans model each other. But this raises yet more questions, like "why can't the AI just model humans that way?" This requires two responses: first, breaking down what we mean when we casually say that humans "model" things, and second, talking about the limitations of such models compared to the utility-maximization picture. The good news is that we can rescue some version of common sense, the bad news is that this doesn't solve our problems.

Next we'll take a deeper look at some typical places to use Goodhart's law that are related to value learning:

  • Curve fitting and overfitting.
  • Hard-coded utility functions.
  • Adversarial examples.
  • Hard-coded human models.

Goodhart's law reasoning is used both in the definition of these problems, and also in talking about proposed solutions such as quantilization. I plan to talk at excessive length about all of these details, with the object of building up pictures of our reasoning in these cases that never needs to mention the word "Goodhart" because it's at a finer level of magnification.

However, these pictures aren't all going to be consistent, because what humans think of as success or failure can depend on the context, and extrapolating beyond that context will bring our intuitions into conflict. Thus we'll have to revisit the abstract notion of human preferences and really hash out what happens (or what we think happens) at the boundaries and interchanges between human models of the world.

Finally, the hope is to conclude with some sage advice. Not a solution, because I haven't got one. But maybe some really obvious-seeming sage advice can tie together the concepts introduced in the sequence into something that feels like progress.

We'll see.


Ω 18

8 comments, sorted by Highlighting new comments since Today at 12:15 AM
New Comment

I appreciate how much detail you've used to lay out why you think a lack of human agency is a problem -- compared to our earlier conversations, I now have a better sense of what concrete problem you're trying to solve and why that problem might be important. I can imagine that, e.g., it's quite difficult to tell how well you've fit a curve if the context in which you're supposed to fit that curve is vulnerable to being changed in ways whose goodness or badness is difficult to specify. I look forward to reading the later posts in this sequence so that I can get a sense of exactly what technical problems are arising and how serious they are.

That said, until I see a specific technical problem that seems really threatening, I'm sticking by my opinion that it's OK that human preferences vary with human environments, so long as (a) we have a coherent set of preferences for each individual environment, and (b) we have a coherent set of preferences about which environments we would like to be in. Right, like, in the ancestral environment I prefer to eat apples, in the modern environment I prefer to eat Doritos, and in the transhuman environment I prefer to eat simulated wafers that trigger artificial bliss. That's fine; just make sure to check what environment I'm in before feeding me, and then select the correct food based on my environment. What do you do if you have control over my environment? No big deal, just put me in my preferred environment, which is the transhuman environment. 

What happens if my preferred environment depends on the environment I'm currently inhabiting, e.g., modern me wants to migrate to the transhumanist environment, but ancestral me thinks you're scary and just wants you to go away and leave me alone? Well, that's an inconsistency in my preferences -- but it's no more or less problematic than any other inconsistency. If I prefer oranges when I'm holding an apple, but I prefer apples when I'm holding an orange, that's just as annoying as the environment problem. We do need a technique for resolving problems of utility that are sensitive to initial conditions when those initial conditions appear arbitrary, but we need that technique anyway -- it's not some special feature of humans that makes that technique necessary; any beings with any type of varying preferences would need that technique in order to have their utility fully optimized. 

It's certainly worth noting that standard solutions to Goodhart's law won't work without modification, because human preferences vary with their environments -- but at the moment such modifications seem extremely feasible to me. I don't understand why your objections are meant to be fatal to the utility of the overall framework of Goodhart's Law, and I hope you'll explain that in the next post.

Thanks for the comment :)

I don't agree it's true that we have a coherent set of preferences for each environment.

I'm sure we can agree that humans don't have their utility function written down in FORTRAN on the inside of our skulls. Nor does our brain store a real number associated with each possible state of the universe (and even if we did, by what lights would we call that number a utility function?).

So when we talk about a human's preferences in some environment, we're not talking about opening them up and looking at their brain, we're talking how humans have this propensity to take reasonable actions that make sense in terms of preferences. Example: You say "would you like doritos or an apple?" and I say "apple," and then you use this behavior to update your model of my preferences.

But this action-propensity that humans have is sometimes irrational (bold claim I know) and not so easily modeled as a utility function, even within a single environment.

The scheme you talk about for building up human values seems to have a recursive character to it: you get the bigger, broader human utility function by building it out of smaller, more local human utility functions, and so on, until at some base level of recursion there are utility functions that get directly inferred from facts about the human. But unless there's some level of human action where we act like rational utility maximizers, this base level already contains the problems I'm talking about, and since it's the base level those problems can't be resolved or explained by recourse to a yet-baser level.

Different people have different responses to this problem, and I think it's legitimate to say "well, just get better at inferring utility functions" (though this requires some actual work at specifying a "better"). But I'm going to end up arguing that we should just get better at dealing with models of preferences that aren't utility functions.

That was quite a stimulating post! It pushed me to actually go through the cloud of confusion surrounding these questions in my mind, hopefully with a better picture now.

First, I was confused about your point on True Values. I was confused by what you even meant. If I understand correctly, you're talking about a class of parametrized models of human: the agent/goal-directed model, parametrized by something like the beliefs and desires of Dennett's intentional stance. With some non-formalized additional subtleties like the fact that desires/utilities/goals can't just describe exactly what the system do, but must be in some sense compressed and sparse.

Now, there's a pretty trivial sense in which there is no True Values for the parameters: because this model class lacks realizability, no parameter describes exactly and perfectly the human we want to predict. That sounds completely uncontroversial to me, but also boring.

Your claim, in my opinion, is that there are no parameters for which this model is close to good enough at predicting the human. Is that correct?

Assuming for the moment it is, this post doesn't really argue for that point in my opinion; instead it argues for the difficulty in inferring such good parameters if they existed. For example this part:

But here's the problem: humans have no such V (see also Scott A., Stuart 1, 2). Inferring human preferences depends on:

  • what state the environment is in.
  • what physical system to infer the preferences of.
  • how to make inferences from that physical system.
  • how to resolve inconsistencies and conflicting dynamics.
  • how to extrapolate the inferred preferences into new and different contexts.

There is no single privileged way to do all these things, and different choices can give very different results

is really about inference, as none of your points make it impossible for a good parameter to exist -- they just argue for the difficulty of finding/defining one.

Note that I'm not saying what you're doing with this sequence is wrong; looking at Goodhart from a different perspective, especially one which tries to dissolve some of the inferring difficulties, sounds valuable to me.

Another thing I like about this post it that you made me realize why the application of Goodhart's law to AI risk doesn't require the existence of True Values: it's an impossibility result, and when proving an impossibility, the more you assume the better. Goodhart is about the difficulty of using proxies in the best case scenario when there are indeed good parameters. It's about showing the risk and danger in just "finding the right values", even in the best world where true values do exist. So if there are no true values, the difficulty doesn't disappear, it gets even worse (or different at the very least)

I'm mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the "True Values"). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?

Hum, but I feel like you're claiming that this framing is wrong while arguing that it is too difficult to apply to be useful. Which is confusing.

Still agree that your big question is interesting though.

Thanks, this is useful feedback in how I need to be more clear about what I'm claiming :) In october I'm going to be refining these posts a bit - would you be available to chat sometime?

Glad I could help! I'm going to comment more on your following post in the next few days/next week, and then I'm interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)

You know, I feel like trying to avoid Goodhart divergences may be neglecting the underlying principle/agent alignment problem in pursuit of better results on one specific metric.