Goodhart's Law and Genies

thomascolthurst

[Epistemic status: Written in 2010. Possibly of historic interest only.]

Goodhart's Law states that "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."

"Control" here means "policy", such as conducted by a government or central bank. To give just one classic example, the Federal Reserve might observe a strong correlation between inflation and unemployment rates which breaks down once the Fed tries to manipulate the inflation rate to reduce unemployment. [1]

The "any" in Goodhart's Law overstates the case -- I regularly use the observed statistical regularity between eating and feeling less hungry for control purposes, and as far as I can tell, this regularity hasn't collapsed yet -- but it isn't hard to see why it might often be the case. If A and B are statistically correlated, there are a number of different causal patterns that could give this correlation: A causes B, B causes A, C causes A and B, etc. Only one of the patterns "works" for the purposes of changing B through A.

The real world situations which Goodhart's Law attempts to explain are even more complicated still. For the inflation/unemployment example, one plausible explanation is that before the Fed started messing with them, inflation rates carried information about macroeconomic conditions, and businesses used this encoded information when making hiring decisions. Post-Fed manipulation, inflation rates no longer contain or correlate with the macro info, so businesses stop using them, and you end up with 70's style stagflation.

A variant of Goodhart's Law makes designing a super-powerful entity harder than it might first appear. (I know, I know, you have this problem all the time.) Specifically:

The actions of any super-powerful constructed entity that identifies important concepts through observed statistical regularities will tend to break the connection between those concepts and the observed statistical regularities.

For example, let's say you want some mangos; this naturally leads you to make a robot to get you some mangos. Now, you could have all sorts of trouble with your robot. It might break down before you get your mangos, it might steal the mangos from a store rather than taking them from your kitchen, it might bring you an orange instead of a mango because it can't tell the difference.

But, really, you don't know trouble until you make a super-powerful robot to get you some mangos, where by "super-powerful" I mean something like "really really smart and knows how to use nanotech to arbitrarily re-arrange atoms." Such a robot might just decide to rewire your brain so that you think the pen in your hand is a mango. Or rewire everyone's brain so that everyone thinks that all pens are mangos.

Or, it might give you a pseudo-mango, where by "pseudo-mango" I mean "an object which no human would ever call a mango, but which satisfies some short criteria which correctly distinguishes all past mangos from all past non-mangos". For example, let's say that hitherto in history, all mangos have been objects less than one meter in diameter that contain at least one million strands of mango DNA, while no non-mango has had that property. So, you ask for a mango, and your super-powerful robot gives you a small yellow sphere filled with mango DNA. Bon appetit!

Some of you might now be asking: isn't this just the new problem of induction? I.e., isn't the distinction between a mango and a pseudo-mango the same as the distinction between green and grue?

Answer: no. I claim that we know, more or less, how to solve the problem of induction through things like Solomonoff induction or other formalizations of Occam's razor which tell us to expect simplicity. Those solutions don't help us here; a pseudo-mango satisfies a short description of the sort Solomonoff induction tells us to look for, but it still isn't a mango.

In fact, the problem is precisely that human concepts like mango are complicated. We all like to pretend otherwise, that our mental categories correspond to "natural kinds" in the world, but with very few exceptions, this simply isn't the case. [2]

Consider, for example, the evidence from prototype theory, the theory of concepts that currently dominates cognitive science. One of the tenets of prototype theory is that there is a "basic level" of categories which share a number of properties:

They are the highest level of category at which category members have similar shapes,
They are the highest level of category for which a single mental image can reflect the entire category,
They are the highest level of category which provides relatively homogeneous sensory-motor affordances (e.g. "chair" is a basic level category and you physically interact with all chairs in more or less the same ways),
They are learnt earlier than similar non-basic level categories,
They are faster to identify than similar non-basic level categories,
They fall in the middle of our "general to specific" hierarchies.

Now, some of these properties are probably causes of some of the others; for example, we probably form non-basic level categories out of short combinations or restrictions of basic level categories, which would explain the last point. (I've also left out several other properties of basic level categories, such as "having short names", which seem to be transparently derivative.) But even if that is the case, consider all of the things that our poor robot would have to know in order to master any concept (basic level or derived from such):

What features do humans use of shapes or mental images when computing their similarity, and how?
What algorithm do humans use for turning shape similarity into clusters?
Similarly, how do humans structure their sensory-motor actions, how humans turn that structure into a notion of action similarity, and how do humans turn that similarity into homogeneous clusters?
Finally, how do humans manage the trade-offs between these different criteria, along with any other roles that a human concept needs to play? For example, if there is some part of concept space that I could theoretically divide into A's and B's or into C's and D's, where the A/B division results in more compact mental image clusters than the C/D division, but the C/D division results in more compact sensory-motor affordance clusters than A/B, which do I choose?

Basically, I'm making a "poverty of the stimulus" argument that you are never going to get all of this right from naturally occurring training data (even with sophisticated learning algorithms that are otherwise quite good at predicting the world) unless (1) you are a human, in which case it is encoded in your brain or (2) you are taught it really well -- for example, by being handed a human brain and being smart enough to simulate how it works. [3]

In summary, the reason super-powerful entities fall to a Goodhart-type law is that they attempt to learn a concept from statistical regularities, but because the concept is human and therefore complicated, there is only a small overlap between the learnt and the human concept. The "small" overlap is big enough to contain an arbitrary amount of naturally occurring training data, but is still small in the space of all things that the super-powerful entity can construct. Thus, when the super-powerful entity is asked to generate a random instance from its learnt category, or generates an instance according to some criteria which is blind to the human category (such as the instance which takes the least time or negentropy to create), it will with very high probability make something that doesn't satisfy the human concept (i.e., breaks the connection between the statistical regularities and the concept). [4]

Or, to put it yet another way: ask a genie for a mango, get a pseudo-mango. [5]

Let me emphasize that I'm not claiming that this psuedo-mango problem is the only thing we have to worry about when it comes to super-powerful entities, or even that the psuedo-mango problem is hard to solve once you know it exists.

In fact, there are two broad categories of solutions. In the first category are the "make the super-powerful entity smarter" solutions. We have already mentioned one of these, in which the super-powerful entity never attempts to learn what a mango is, but rather does something akin to simulating a human brain and asking the simulation whenever it needs to distinguish between mango and pseudo-mango.

Another potential solution along these lines would be to give the super-powerful entity a whole bunch of human concepts and examples of such, along with instructions to build a compact model that explains all of the concepts and examples. The hope here is that if you gave it enough concepts, the entity would figure out the general structure of human concepts and use that to lock in on the right concept of mango given relatively few examples.

One large drawback of these "be smarter" solutions is that they only work for "non-problematic concepts": concepts where

all or almost all humans more or less agree about what the concept means,6 and
are very unlikely to change their minds in important ways about the concept, even if (A) they know a lot more and/or (B) the universe suddenly becomes populated with lots of boundary cases.

2B is the "Pluto clause": humans are sensitive to the frequency of near-class examples when forming decision boundaries, as evidenced by the fact that Pluto would probably still be considered a planet if our solar system didn't have a Kupier belt.

Anyway, it's not self-evident that there are any non-problematic concepts outside of mathematics. I think that mango is non-problematic, but I admit to not being entirely sure about #2. I mean, I have a hard time imagining a new set of facts about mangos that would cause me to majorly modify my mango mental map -- maybe if some mangos were sentient or somesuch? -- but that's merely an argument from lack of imagination. [7]

On the other hand, there clearly are lots of problematic concepts. "Moral" or "happy" or "justice", for example. What's worse is any sane person would want to give a super-powerful entity goals chock full of these sort of problematic concepts.

Still, it seems like there is a "be really really smarter" solution, which would rely on a simulated human brain to make the mango / non-mango decisions, but augment the brain to also know all of the relevant facts that the super-powerful entity itself knows. If inter-human disagreement about the concept is a problem, you could simulate lots of different human brains and pick an aggregation function suitable to your purpose. This leaves only the concern that there might be "super-problematic concepts", for which there are pertinent pieces of truth that can not be encoded into neural representations.

The other class of solutions to our pseudo-mango problem involves making the super-powerful entity less powerful. For example, you might try to forbid the super-powerful entity from creating anything, by limiting its action set to just moving objects around, say. Or you might severely restrict the entity's ability to generalize, by requiring any created mango to be an atom level duplicate of a training example.

These limitations work around the narrow pseudo-mango problem, but are unsatisfactory in general. Part of the reason to make a super-powerful robot in the first place is to be able to create new things that have never been seen before. In particular, some of the things we want to create are things like "happiness", for which atom level duplicates do not get the job done.

A better idea for a "be weaker" solutions might be called "power regularization": have the super-powerful entity emulate the least powerful entity that could accomplish the task.[8] The presumption is that the least powerful robot that can get me a mango would be the one that uses the easy solution of getting it from the kitchen and not the hard solution of making one from scratch using nanotechnology. You have to be careful about how you define power, though; you can do nanotech with remarkably low expenditures of negentropy. Number of computational steps might be able to fill the role of power here, but is itself hard to define formally for a physical system.

Another power reducing solution is to steal a page from prototype theory and have the super-powerful entity associate a confidence level with its concept labellings. Such a confidence level could be generated, for example, by finding a large set of predicates which correlate with natural mangos, and scoring any potential mango based on how many predicates it satisfies (and perhaps further weighted by the simplicity of the predicates). Any created mango would be required to score very highly on this confidence measure, and if for some reason you wanted to create something that was explicitly not a mango, it would be required to have an extremely low score.

It isn't even necessarily the case that all natural mangos would score highly enough under this criteria. It may be the case for example that "not silver" is a very good predicate for predicting mangoness, so that the super-powerful entity could never be confident that any silver thing was a mango; this is not incompatible with their being a small number of naturally occuring silver mangos. (Painted mangos, perhaps). This is also why "confident concepts" is a power reducing solution: if our goal for the super-powerful entity could only be satisfied by a silver mango, the entity would be unable to acheive the goal.

My concern with this solution is that it captures only one of the characteristics of human concept structure, while ignoring many others that, while currently less understood, might be important.

The take home is that while there are clearly some "be weaker" solutions that work for some goals, and there may be some that I haven't though of, in general I'm currently more bullish about the "be smarter" approaches.

Acknowledgements

The ideas in this post are based on discussions with Anna Salamon and Ben Hoskin.

Footnotes

[1] There are several ideas in economics and the social sciences closely related to Goodhart's Law, such as Campbell's law and the Lucas critique.

[2] The exceptions I'm prepared to make are for mathematical concepts like "five" and maybe some select concepts from physics and chemistry.

[3] If anyone is still unconvinced on the "human concepts are complicated" part, I have bunch more arguments to that point: (A) The 2000+ year long failure of philosophers to give any good explanation of what human concepts are. [I view prototype theory as giving a description of some properties of human concepts, rather than as definitive explanation. See for example Jerry Fodor's _Concepts_ for discussion on the weaknesses of prototype like theories.] (B) The 50+ year long failure of AI researchers to implement any non-trivial human concept in software. (C) Humans have all sorts of weird beliefs about their concepts. For example, essences: humans believe that things like mangos have some sort of invisible set of properties that make mangos mangos. The problem isn't just that this is false; the problem is that believing in essences makes humans draw really strange decision boundaries in order to save appearances.

[4] Note that this problem only occurs for "important" concepts: concepts that are used to specify the superentity's goals. We don't care if the superpowerful robot uses strange, non-human concepts to predict how the world works as long as the results are what we want.

[5] "Genie" is used here as a synonym for "super-powerful entity". For a related result, see http://hyperboleandahalf.blogspot.com/2010/08/this-comic-was-inspired-by-experience-i.html

[6] For some purposes, this clause isn't essential: if I just want some mangos for myself, and there happened to be widespread disagreement among humans about what things are mangos, I could just have my genie simulate my brain's concept of mango and everything would be just fine. On the other hand, if I wanted my genie to make mangos for resale, this wouldn't work.

[7] Any information along these lines is very likely to not only change what I think a mango is, but also what I would want to do with a mango. This post, however, is explicitly only concerned about the first of these two issues.

[8] Technical note: the super-powerful entity is still the one doing the verification that the final state satisfies the goal; the hope is that weak entity that it is emulating can only achieve the subset of that goal space that agrees with the human conception of the goal.

[-]Davidmanheim6y50

Well written - I just wish it had been posted when it was first written!

You may be aware of the post by Scott Garrabrant, and the follow-up paper expanding on it, but if you are not, it formalizes some of the different aspects of Goodhart's law you discussed here. The causal case and the correlational case are not the only ones that matter, and our work differs in that the "not a mango" failure is not really considered a Goodhart's law issue, but it certainly is an underspecified goal in a way similar to what I discussed on Ribbonfarm here, which I noted leads to Goodhart's Law issues.

[-]Gordon Seidoh Worley6y40

I guess many folks just missed this post. This is a great intro to Goodhart and why it's a problem. I think this is finally the go-to intro to Goodhart for AI discussions I've been looking for. Thanks!

[-]Shmi6y20

If you are a genie who thinks it has created a new mango, check with a sample of humans if they think it is one. Treat humans like you treat the rest of the world: an object of research and non-invasive hypothesis testing. You are not a super-genie until you understand humans better than they understand themselves, including what makes them tick, what would delight or horrify them. So, a genie who ends up tiling the universe with smiley faces is not a super-genie at all. It failed to understand the basics of the most important part of its universe.

[-]Nicholas Conrad6y20

"if I just want some mangos for myself, and there happened to be widespread disagreement among humans about what things are mangos, I could just have my genie simulate my brain's concept of mango and everything would be just fine. On the other hand, if I wanted my genie to make mangos for resale, this wouldn't work."

This seems like it would only be true if you yourself don't understand what aspects of quasi-mangoness are desirable on the market. Otherwise your conception of mango that was simulated would include the fuzzy "I don't call subset x 'real' mangos, but lots of people do, and they sell well" data, no?

LESSWRONG
LW

15

Goodhart's Law and Genies

15

15