Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I - Obvious answers to obvious questions

"Why do I think I know what I do about Goodhart's law?"

There's an obvious answer (or at least an obvious outer layer of the answer-onion): we use Goodhart's law because we usually model humans as if we have values. Not necessarily True Values over the whole world, but at least contextual values over some easy-to-think-about chunk of the world. Modeling humans as goal-driven agents is never going to be perfect at predicting us, nor is it what you'd do if you had infinite computing power, but it still works really well for our purposes.

We infer human values by building a simplified model of the world that features human actions as a fundamental part. Such an abstract model doesn't need to be our absolute best model of the world, it just needs to capture useful regularities. If we keep the model fixed, then the human values are determined by the data, but the model doesn't necessarily have to be fixed. If we uncurry this inference process, inferred human values depend both on the data and on the way we model the data.

Consider historical examples of Goodhart's law, like the dolphin trainers who rewarded dolphins for each piece of trash removed from the pool, which taught the dolphins not to clean the pool, but to hoard trash and rip it up before delivering each piece. Upon hearing this story we immediately make a simple model of the situation, using variables like amount of trash in the pool, whether the dolphins are delivering trash, and whether they got a fish. In this model it's super easy to represent that the dolphin trainers want the pool clean and the dolphins want fish.

Once we take this model of the world as given, then everything makes sense as an example of Goodhart's law: the humans wanted the pool to be clean, they created an incentive for the dolphins to take an action that used to be correlated with the pool being clean, and as a result the correlation broke down. We can gather up many such stories that all feature the same pattern of the human creating an incentive that's correlated with their modeled values and then not getting what they want - these stories are our empirical evidence for Goodhart's law.

What's my problem with using Goodhart's law for value learning AI, then? If it lets us predict a regularity about the future of dolphin trainers (and more), can't we trust it to predict what happens when we run a computer program?

In fact, often we can. See examples of applying it to AI. But comparing the outcome to what we actually want only works smoothly when what we "actually want" is sufficiently obvious. When human preferences aren't obvious, i.e. when there's not one obvious way to model the world to extract them, we have to do something else. By necessity, that something else is going to look like considering multiple different ways of modeling humans and the world, and having some standards to judge them by.

The rest of this post introduces a notion of "competence" for inferred human preferences. It's like a souped-up Bayesian version of revealed preferences. This isn't the final destination, but it's a good stepping-stone and guide to the imagination.

II - Competent banana

When I say humans have a "competent" value, that just means that modeling us as an agent with this value is very useful. Like if I competently prefer two bananas over one banana, this means that you can predict my actions reliably across a wide range of contexts (especially contexts I might expect to face in real life) using an agenty model of me in which I prefer two bananas.

On the other hand, suppose I disagreed with other people similar to me about one vs. two bananas, and you could get different answers from me if I'd read pro/anti arguments in the opposite order or if you asked me in different ways, and that treating this as a preference of mine didn't help form a simple explanation of my other preferences, and so on for several more foibles. Then even if I verbally tell you that I prefer two bananas, modeling me as an agent that prefers two bananas isn't so useful - that's an incompetent preference.

A competent banana.

The values described in this way can be small - it's okay to say I prefer two bananas to one banana, you don't have to give a complete model of all my behavior at once. To make a physics analogy, values are like the ideal gas law, not like the Standard Model of particle physics. The ideal gas law doesn't try to explain everything at once, and it makes no claim at being fundamental, it's just a useful regularity that we can use to predict part of certain systems, within its domain of validity[1]. Different inferred values can fit the data to different degrees, can have larger or smaller domains of validity, can be more or less easily explained with a non-agential model, and so on.

This intuitive notion of competent preferences is a useful starting point for imagining how a learning algorithm might end up assigning preferences to humans. But it's not totally specified - keep an eye out for steps that require magic[2] - we're going to try to shrink the amount of magic needed.

One potential way to break down competence is to measure the total competence of a model of a human by bits of predictive usefulness (over some timescale) minus bits of model complexity. There will be some model of me that is most parsimonious at predicting what I'm going to do, so it's tempting to just say that whatever preferences that model says I have are my most competent preferences.

But for a model of me to say I have preferences at all, it has to be part of the set of agent-shaped models. The definition of this set is somewhat magical, so we'll return to it later.

In fact, we don't just want models of ourselves to be agent-shaped, we also want them to have some semblance of human limitations. For example, I don't take actions that would cure cancer, despite them being physically possible. And yet, in the sense I care about, I don't have a preference for cancer. For now, just use common sense when imagining what things count as available actions. This common sense ties into meta-preferences, which we'll return to later.

A further difference between competent preferences and "True Values" is that competence is not a substitute for importance. I might hold dearly to some value that I'm still a bit fuzzy on, but be relatively rational in exchanges involving something trivial. The definition of competence cares about the usefulness to a predictor, not the feelings of the predictee. Nevertheless, there's a limit to how much importance we're willing to attribute to an incompetent preference - if a preference is basically worthless for predicting me, how important can it really be?

III - Application

Can we have an AI notice when it's doing bad things by having it check for violations of humans' competent preferences? Not quite, because of preference conflicts. For example, I could have both a desire for bacon and a desire to keep slim; both help you predict my behavior, both are competent, but they conflict[3]. Any future for the universe messes up according to some preferences.

But often, the consensus is overwhelming. We expect to make much, much better predictions of the dolphin trainers if we suppose that they want the pool to be clean, as opposed to them wanting the pool to be full of small pieces of trash. When there's such a consensus, I'll call this a one-sided competent preference. These roughly correspond to the cases I meant in section 1, when I said that Goodhart's law needs "what we actually want" to be sufficiently obvious.

In the next post, when we look at more concrete examples of Goodhart's law related to AI, basically all of the easy cases are easy because we can see why they violate one of humans' one-sided competent preferences. A harder case might look like an AI having some incentive to produce outputs that humans aren't competent at evaluating - by construction we're going to have no competent preference against the outputs, yet it still feels like something has gone wrong. 

What counts as obvious has to be informed by human meta-preferences[4]. If we can't point out what's going wrong on the object level but feel like the AI is using the wrong abstract process, it's probably a meta-preference. We have meta-preferences about how preference conflicts should be resolved that can be fairly involved - think of how we classify some of our patterns of behavior as bias or addiction, not "real preferences."

IV - Conclusion

Goodhart's law makes sense when it's useful to model humans as having preferences. We naturally do this modeling all the time to help us explain both the outer world and the inner world.

Neither humans nor an AI inferring competent values find something that fits the bill of the "True Values" used in Absolute Goodhart. We have lots of equivalents of the ideal gas law, but no Theory of Everything. It's actually impressive how well we can make common-sense predictions about people and talk about their values, without ever learning a utility function for humans.

Keeping how we know about Goodhart's law in mind, we can take a deeper look at how Goodhart's law manifests in value learning. Next post.

  1. ^

    This is somewhat analogous to hypotheses in infra-Bayesian reasoning

  2. ^

    term of art meaning something we think should be possible, but don't actually know how to do.

  3. ^

    We need to assume the nontrivial ability to check whether preferences in two different ontologies are in conflict.

  4. ^

    In this sequence I use "meta-preferences" to mean preferences that use or reference the preference-inference process itself. This is a little broader than the usual usage because it includes preferences about how we want our preferences to be modeled by other people.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 9:38 PM

So a highly competent preference helps predict those preferences. But I'm confused on how "violating one-sided competent preferences" makes sense with Goodhart's law. 

As an example, "Prefer 2 bananas over 1" can be very competent if it correctly predicts preference in a wide range of scenarios (eg different parts of the day, after anti-banana propaganda,etc) with incompetent meaning it's prediction is wrong (max entropy or opposite of correct?). Assuming it's competent, what does violating this preference mean? That the AI predicted 1 banana over 2 or that the simple rule "Prefers 2 over 1" didn't actually apply?

By "violate a preference," I mean that the preference doesn't get satisfied - so if the human competently prefers 2 bananas but only got 1 banana, their preference has been violated.

But maybe you mean something along the lines of "If competent preferences are really broadly predictive, then wouldn't it be even more predictive to infer the preference 'the human prefers 2 bananas except when the AI gives them 1', since that would more accurately predict how many bananas the humans gets? This would sort of paint us into a corner where it's hard to violate competent preferences as defined."

My response would be that competence is based off of how predictive and efficient the model is (just to reiterate, preferences live inside a model of the world), not how often you get what you want. Even if you never get 2 bananas and have only gotten 1 banana your entire life, a model that predicts that you want 2 bananas can still be competent if the hypothesis of you wanting 2 bananas helps explain how you've reacted to your life as a 1-banana-getter.