Goodhart’s Law and Sign Function Collapse

In my last post, where I created a semiotic critique of the orthogonality thesis, I noted that optimizing for intelligence is distinct from all other possible goals, and that therefore, the only goal compatible with all levels of intelligence in principle is optimization for intelligence. However, something about this fact seemed strange to me. Earlier, I had concluded that intelligence itself is a second order sign, but I also stated that all coherent, rational optimizers have utility functions that optimize to first order signs. I realized, soon after I published, there is an extremely fruitful line of inquiry here, which is elucidated by applying this framework to Goodhart’s law. I have tentatively concluded the following line of logic, which I will explicate over the course of this post:

  1. All coherent, rational optimizers have a utility function which optimizes to a first order sign. 
  2. Intelligence is a higher order sign. 
  3. A coherent, rational optimizer can therefore not optimize for intelligence. 

If you would like to learn more about Semiotics, the science of signs and symbols, I would suggest starting with Umberto Eco’s book The Theory of Semiotics which serves as a useful introduction and also highlights how the field is based on assumptions and ideas from information theory and cybernetics. In particular, the fact that signs have meaning because they are distinct relative to other signs can be thought of an extension of the cybernetic concept of variety. 


Why coherent, rational optimizers can’t optimize for higher order signs

I believe that all these claims are derivable from the semiotic principle that to have meaning, signs and symbols must be defined relationally, and be distinct, however, to save time, I will rely on a second principle, Goodhart’s Law, which states “When a measure becomes a target, it ceases to be a good measure,” or in the form of the Less Wrong glossary “when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.” 

Both the original formulation of Goodhart’s Law and the Less Wrong version gesture at something more fundamental which I believe escapes them both. If we were to place Goodhart’s Law into a semiotic framework, this might be a simple way of phrasing it: “optimization to first order signs is not optimization to higher order signs; attempts to optimize for higher order signs by directly measuring the higher order sign will result in the higher order sign collapsing to a lower order sign.” As per my last post, first order signs refers to signs that stand in for some real world phenomenon, like how quantities of degrees can stand in for temperature or the words bright and dim can stand in for the amount of light in a room, and second order signs stand in for other signs, in order to be distinct such signs need to stand in for at least two other signs, these include things like the idea of life, the idea of eating, and even something that seems as concrete as the theory of relativity. It may seem obvious that all these ideas have some real world correlates, but it's also the case that something important is lost when we stop treating these things as higher order signs and instead measure those correlates as a stand-in for their meaning. 

Let’s say we’re optimizing for a sign in the second order code of the thermostat. What is the second order code in the thermostat? Well, it’s the one which outlines the correlated positions between the sign for temperature (the flex of the piece of metal), and the on/off state of the AC. Because the flexing metal acts as the switch for the AC, there are only two states, assuming the AC is in working order: the metal is flexed/the AC is on, the metal is not flexed/the AC is off. Up until now, there have been two first order codes, and one second order code in operation with the machine. How do we optimize for a goal that is a sign in the second order code? Well, to do so we need to measure that second order operation, we need to create a sign which corresponds with metal flexed/AC on, for example, and we need to have a set of preferences which to correlate the metal flexed/AC on state with. But suddenly something has shifted here - in this new second order code, the metal flexed/AC on state is no longer a higher level sign, if we are measuring it directly. The metal flexing, while correlated to temperature, doesn’t stand in for temperature anymore, doesn’t measure it. The AC on state was correlated with the metal flexing and completing the electrical circuit, but the metal can be flexed and the AC on without one being a sign for the other. Such that, suddenly, we begin to see Goodhart’s Law appear: the metal flexed/AC on state can just as well be accomplished by screwing the metal so it is permanently flexed and attaching it to an already running AC and a sufficiently intelligent, rational and coherent agent could arrive at this solution. 

For an agent with a rational coherent utility function, to order all the possible states of the original higher order code in the thermostat it is necessary for those sates to stand in for some material reality. Measurement happens to be a way to create such a sign, and indeed, measurement is required to check that it is really standing in for material reality in a true way. But standing something in for material reality, or measuring, is the process of creating a first order code, it is creating signs directly correlated to some phenomena rather than creating signs for signs that stands in for that phenomena. The measure of a higher order code means that the signs of lower order codes no longer compose it, therefore the original higher order code is now a 1st order code. We can call this process Sign Function Collapse, and this is what fundamentally underlies Goodhart’s Law. If the thing being measured was the thing being optimized, there would be nothing counterintuitive. In order for the original goal to be captured by the first order sign created by measurement, it must be correlated with at least one other sign in addition to the proxy being measured, which is to say, it must be a higher order sign. 

Why is sign function collapse a problem? After all, if the first order signs accurately signified underlying reality, shouldn’t a measurement of the second order sign represent the same thing as the two second order signs combined? The problem is that the second order sign is a representation of a representation of reality and in real life, signification, including first order signification, has to be an active process, which is to say, measurement is an active process. Checking that the second order sign corresponds to reality reduces that sign to the particular material reality which represents those first order signs. To use the second order sign “correctly” we would instead check that the first order signs that compose it are corresponding to reality. If we do so, and make the second order sign of metal flexed/AC on our goal, then we would simply place a heat source in the room near the thermostat. 

We can see sign function collapse in all the examples Eliezer Yudkowsky uses in his essay ”Coherent decisions imply consistent utilities”. At first glance, the coherent utility optimization taking place with the example of the subject ordering pizza is optimizing over a second order sign: some combination of either money and pizza type, or time and pizza type. But this is actually an example of sign function collapse. To show why this is the case, let’s take two numeric variables, which we can imagine as two first order codes, and try to rank them into a list. What we would do is list every possible combination of the two variables and rank those combinations. Now we are no longer dealing with two lists, two first order codes, but one list, one first order code. Keep in mind, both of the original codes were also supposed to stand for something, they were the results of measurements of material reality in some way, through some sensor in a machine or biological organism. If we try to optimize for a sign in the second order code and implement that through collapsing the sign function, it means that we’re selecting a world state which isn’t necessarily the same as the one the original first order codes stood for. Which means that, if we wanted to make a second order sign our goal in the “authentic” way described by placing the heater by the thermostat, there’s no guarantee we could do it by ordering world states and selecting the one closest to the second order sign we want. 

If we try to rank higher order signs by utility without collapsing the sign function, then we make possible inconsistencies in ranking of world-states. Indeed, if we create one sign that’s meant to stand in for a given agents first order utility, and a sign meant to stand in for that given agents higher order utility, to the extent they create distinct utility functions when correlated to world-states and are thus meaningful signs, we can say that any such distinct higher order utility function is irrational and incoherent according to the definition used in the formal decision theory employed by Yudkowsky (that we get the world-state we value most). I do not know if it is possible to prove definitively that all agents capable of higher order utility have distinct utility functions from their first order utility function, but we can see empirically that this is generally true for humans. 

Yudkowsky comes to similar conclusions, a rational, coherent agent with a consistent utility function wouldn’t take actions that result in real things it wants not being maximized. The realness of the things here is important, these are things we can measure, that we can represent with a first order sign. I’d like to quote him at length as it is a useful illustration of the distinction being drawn about first and higher order signs: 

“Another possible excuse for certainty bias might be to say: "Well, I value the emotional feeling of certainty."

In real life, we do have emotions that are directly about probabilities, and those little flashes of happiness or sadness are worth something if you care about people being happy or sad. If you say that you value the emotional feeling of being certain of getting $1 million, the freedom from the fear of getting $0, for the minute that the dilemma lasts and you are experiencing the emotion—well, that may just be a fact about what you value, even if it exists outside the expected utility formalism.

And this genuinely does not fit into the expected utility formalism. In an expected utility agent, probabilities are just thingies-you-multiply-utilities-by. If those thingies start generating their own utilities once represented inside the mind of the person who is an object of ethical value, you really are going to get results that are incompatible with the formal decision theory. [emphasis mine]

However, not being viewable as an expected utility agent does always correspond to employing dominated strategies. You are giving up something in exchange, if you pursue that feeling of certainty. You are potentially losing all the real value you could have gained from another $4 million, if that realized future actually would have gained you more than one-ninth the value of the first $1 million. Is a fleeting emotional sense of certainty over 1 minute, worth automatically discarding the potential $5-million outcome? Even if the correct answer given your values is that you properly ought to take the $1 million, treasuring 1 minute of emotional gratification doesn't seem like the wise reason to do that. The wise reason would be if the first $1 million really was worth that much more than the next $4 million.

The danger of saying, "Oh, well, I attach a lot of utility to that comfortable feeling of certainty, so my choices are coherent after all" is not that it's mathematically improper to value the emotions we feel while we're deciding. Rather, by saying that the most valuable stakes are the emotions you feel during the minute you make the decision, what you're saying is, "I get a huge amount of value by making decisions however humans instinctively make their decisions, and that's much more important than the thing I'm making a decision about." This could well be true for something like buying a stuffed animal. If millions of dollars or human lives are at stake, maybe not so much.”

Making a second order sign a goal and assigning it utility without sign function collapse results in “dominated strategies” and therefore incoherent, irrational behavior. It is also clear from Yudkowsky’s example that this is something humans do quite often. Where he’s mistaken is just where the limit of this signification lies. Consider the question of human lives, for nearly all people human life is a higher order sign that has utility. People value human life in the abstract because it stands in for a bunch of things that have independent meaning to them as first order signs that stand in for stuff like empathy and intimacy, what have you. Valuing human life as a first order sign essentially means valuing a global population counter, which people might value a little bit, but not as much as those other things. Even Yudkowsky’s entire goal in writing that essay was in the service of a higher order sign, I can say with near certainty if he did not value the process of optimizing utility for real things (the higher order sign in question) he would not have written it. 

Why intelligence is a higher order sign

Valuing higher order signs does sometimes produce circular, self-defeating behavior, as pretty much any pattern of behavior can be a higher order sign, but it also produces a lot of very novel things. As per Yudkowsky’s examples, a utility function, and optimization, are themselves higher order signs, which also means something like evolution is a higher order sign. What would it mean to optimize for evolution? Certainly, you couldn’t do it by just measuring world states/phenomena we correlate with rapid evolution, that would run into the same problems of sign function collapse. You’d have to do it by making that higher level sign of evolution a goal. For experimental evolutionary biologists, cultivating things in a lab, this is a practical matter. Each experimental set-up is a first order sign, and over time these first order signs will change the meaning of the higher order sign of evolution through association with it, but just as importantly, the higher order sign guides the creation of these first order signs it gives the scientists an idea of what to look for. 

You may have noticed that there can be lots of different first order signs for the same thing, celsius, kelvin, fahrenheit as well as hot and cold are all first order codes for the phenomenon of temperature. As you can see, some of these signs are more precise than others, and have different sets of meanings by association to other signs. The movement from hot and cold to precisely measured degrees of temperature required the use of higher order signs. In the case of fahrenheit this was through the use of a mercury thermometer, where the expansion and contraction of the mercury in the thermometer was a first order sign, and the scale of fahrenheit broke up that expansion and contraction into a discrete second order code. Most people can take this chain of signification for granted, and collapse the fahrenheit, and mercury thermometer, sign function such that it directly stands for temperature, but scientists know that mercury thermometers have limits, for one they can melt at high enough temperatures. It wouldn’t make sense for scientists to collapse the thermometer sign function, they need to get at a theory of temperature as a second order sign. Accordingly, they keep inventing better ways of measuring temperature because their understanding of reality is informed by more than the first order signs currently available to them, it is also informed by the connections between those signs. These connections which form higher order signs can be extremely sophisticated, and hence powerful. 

It may be objected that a concept like temperature cannot be a second order sign, it corresponds to a real thing. In one sense this is true, as I mentioned, sometimes we do use temperature as a first order sign (specific signifiers can be matched with more than one signified, such that a sign in one context might be 1st order and higher order in another), but in another sense this attitude conflates realism with first order codes. In what sense is the theory of general relativity real? Well, in a stronger sense than Newton’s law of universal gravitation, but less so than the direct observations we have of the phenomenon it is supposed to predict (if they were not, then we could have no recourse to reject or accept it). To the extent that a sign can be used to lie, it can be used to tell the truth, which is to say to the extent higher order signs are meaningful, they can be used to express things which reflect reality, and things which don’t. Things that are purely representational in the human mind, that is, representing only other signs can refer to real objects, and in fact most of the real things that we can think of aren’t innately intelligible from pure observation, that is from first order codes by themselves. 

As I established in the last post, any feedback mechanism, as a system of codes, is going to be a higher order code, as it necessarily involves correlating at least two first order signs together. This means that any definition of intelligence which has it be some sort of feedback mechanism makes intelligence a second order sign. Additionally, any definition of intelligence which relates it to planning ability relative to a given goal, as Bostrom does, will also be a second order sign, as it involves ranking possible plans against the goal. What about compression? Surely, if we optimize for compression in terms of representation of reality we are optimizing for intelligence. Well, notice I had to sneak “of reality” in there, which means we have to measure reality first to maximize any type of compression of it, and since any measure of compression is a measure of how well signs correspond to reality, that's a second order sign too. Not to mention trying to find the most efficient algorithm for a given problem is the type of thing subject to the halting problem, and therefore it is impossible to say we know what the most intelligent possible configuration of matter is, much less all the possible configurations. What sign function collapse says, is that far from not knowing what the best possible configuration of matter is for intelligence, we cannot even know what a better configuration of intelligence is without higher order signs. 

To understand why, we can look at Yudkowsky’s explanation of reductive materialism

“So is the 747 made of something other than quarks?  No, you're just modeling it with representational elements that do not have a one-to-one correspondence with the quarks of the 747.  The map is not the territory.”

“But this is just the brain trying to be efficiently compress an object that it cannot remotely begin to model on a fundamental level.  The airplane is too large.  Even a hydrogen atom would be too large.  Quark-to-quark interactions are insanely intractable.  You can't handle the truth.

But the way physics really works, as far as we can tell, is that there is only the most basic level—the elementary particle fields and fundamental forces.  You can't handle the raw truth, but reality can handle it without the slightest simplification.  (I wish I knew where Reality got its computing power.)

The laws of physics do not contain distinct additional causal entities that correspond to lift or airplane wings, the way that the mind of an engineer contains distinct additional cognitive entities that correspond to lift or airplane wings.

This, as I see it, is the thesis of reductionism.  Reductionism is not a positive belief, but rather, a disbelief that the higher levels of simplified multilevel models are out there in the territory.  Understanding this on a gut level dissolves the question of "How can you say the airplane doesn't really have wings, when I can see the wings right there?"  The critical words are really and see.”

Everything he says here is true, but the trouble is what remains unsaid, that is, the trouble is he doesn't take his reductionism far enough. He does not mention that there is no entity of representation in the laws of physics as a representation of reality, there is no such thing as a “map” in the realm of quarks. A map as an idea is a “higher level of a simplified multilevel model”. And if that's the case, the question of “how to build a better map” isn't one that can be answered only with reference to physics or any representation of the base level of reality. 

This is the crux of the issue: the process of signification, of creating signs and codes, necessarily means cutting up reality into chunks, or sets or categories. Without this division, we couldn't have meaningful units of information. But these divisions do not exist in reality except in the way the representation is materially encoded onto a smaller part of reality. Even when we speak of a quark, why should we speak of it as a specific entity when we could speak of the quantum wave function, and why speak of one particular wave function when all that really exists is the wave function for the whole universe. If we said that the universe stood for itself as a symbol it wouldn't be a meaningful one, there is nothing you could compare it to which would be “not-universe”. In order to tell the truth, you have to be able to lie, and all lies are things which “don't exist out there”. Even first order codes need signs to represent the non-true states. So if you want to say a code is real or not by measuring it, even a first order code, you’ll always end up in sign function collapse, reducing a sign to its material encoding rather than the thing it represents. 

Accordingly, there is no purely physical process which maximizes intelligence, even evolution does not maximize for it because evolution is simply the process for reproducing systems to become more correlated to their environment through adaptations. We live in a particular pocket of time where intelligence might increase evolutionary fitness, and therefore there is a pressure, up to a point, for more intelligence. Just like any other measure, evolutionary fitness as expressed through success in reproduction and survival is just a correlate rather than the thing itself, and this correlation will collapse at certain points. If we wanted to pursue intelligence as a goal, the only way would be to value it as a higher order sign. And that's precisely what we see empirically in the field of AI research. 


The ability to pursue higher order signs as goals is what allows us to create richer representations of reality, it is what encompasses the whole project of science, for one. As was mentioned in my critique of the orthogonality thesis, finding value in second order signs is the only way for an agent to be arbitrarily curious or intelligent. It's for this reason that I reject Yudkowsky’s conclusions that “Probabilities, utility functions, and expected utility” are properties a machine intelligence should have if it isn't engaged in circular, self destructive behavior. Instead, I offer a different property any sufficiently advanced machine intelligence would have: higher order signification and utility.

This post originally appeared on my blog

New Comment
5 comments, sorted by Click to highlight new comments since:

So if I understand your point correctly, you expect something like "give me more compute" at some point fail to deliver more intelligence since intelligence isn't just "more compute"?

Yes. And in one sense that is trivial, there's plenty of algorithms you can run on extremely large compute that do not lead to intelligent behavior, but in another sense it is non-trivial because all the algorithms we have that essentially "create maps" as in representations of some reality need to have that domain specified that they're supposed to represent or learn, in order to create arbitrary domains an agent needs to make second order signs their goal - see my last post.

Then I wonder, at what point does that matter?  Or more specifically, when does that matter in the context of ai-risk?

Clearly there is some relationship between something like "more compute" and "more intelligence" since something too simple cannot be intelligent, but I don't know where that relationship breaks down.  Evolution clearly found a path for optimizing intelligence via proxy in our brains, and I think the fear is that you may yet be able to go quite further than human-level intelligence before the extra compute fails to deliver more meaningful intelligence described in your post.  

It seems premature to reject the orthogonality thesis of optimizing for things that "obviously bring more intelligence" before they start to break down.

I think we're seeing where that relationship is breaking down presently, specifically of compute and intelligence, as, while it's difficult to see what's happening inside of top AI companies, it seems like they're developing new systems/techniques, not just scaling up the same stuff anymore. In principle, though, I'm not sure it's possible to know in advance when such a correlation will break down, unless you have a deeper model of the relationship between those correlations (first order signs) and the higher level concept in question, which, in this case we do not. 

As for the orthogonality thesis, my first goal was to dispute its logic, but I think there are also some very practical lessons here. From what I can tell, the limit on intelligence created by an inability to create higher order values kicks in at a pretty basic level, and relate to the limits all current machine learning and LLM based AI that we see emerge on out of distribution tasks. Up till now, we've just found ways to procure more data to train on, but if machine agents can never be arbitrarily curious like humans are through making higher order signs our goals, then they'll never be more generally intelligent than us.