**Clarifying Thoughts on Optimizing and Goodhart Effects - Part 2**

Previous Post: Re-introducing Selection vs Control for Optimization In the post, I reviewed Abram's selection/control distinction, and suggested how it relates to actual design. I then argue that there is a bit of a continuum between the two cases, and that we should add an addition extreme case to the typology, direct solution.

Here, I will revisit the question of what optimization means.

NOTE: This is not completely new content, and is instead split off from the previous version and rewritten to include an *(Added) *discussion of Eliezer's definition for measuring optimization power, from 2008. Hopefully this will make the sequence clearer for future readers.

In the next post, Applying over-Optimization in Selection and Control, I apply these ideas, and concretize the discussion a bit more before moving on to discussing Mesa-Optimizers in Part 4.

## What does Optimization Mean, Again?

This question has been discussed a bit, but I still don't think its clear. So I want to start by revisiting a post Eliezer wrote in 2008, where he suggested that optimization power was ability to select states from a preference ordering over different states, and could be measured with entropy. He notes that this is not computable, but gives us insight. I agree, except that I think that the notion of the state space is difficult, for some of the reasons Scott discussed when he mentioned that he was confused about the relationship between gradient descent and Goodhart's law. In doing so, Scott proposed a naive model that looks very similar to Eliezer's;

simple proxy of "sample points until I get one with a large U value" or "sample n points, and [select] the one with the largest U value" when I think about what it means to optimize something for U. I might even say something like " bits of optimization" to refer to sampling points. I think this is not a very good proxy for what most forms of optimization look like."

I want to start by noting that this is absolutely and completely a "selection" type of optimization, in Abram's terms. As Scott noted, however, it's not a good model for what most optimization looks like, and that's part of why I think Eliezer's model is less helpful than I did when I originally read it.

There's a much better model for gradient descent optimization, which is... gradient descent. It is a bit closer to control than direct optimization, since in some sense we're navigating through the space, but for almost all actual applications, it is still selection, not control. To review how it works, points are chosen iteratively, and the gradient is assessed at each point. The gradient is used to *select* a new point at some (perhaps very clever, dynamically chosen next point.) Some stopping criteria is checked, and it iterates at that new point. This is almost always tons more efficient than generating random points and examining them.

*(Addded)* It's far better than a grid search, usually, for most landscapes, but also makes it clear why I think it's hard to discuss optimization power in Eliezer's terms on a practical level, at least when dealing with a continuous system. The problem I'm alluding to is that any list of preferences over states depends on number of states. Gradient descent type optimization is really good at focusing on specific sections of the state space, especially compared to grid search. We might find a state where grid search would require a tremendously high resolution, but we don't ever compute a preference ordering over states. With gradient descent, we instead compute preferences for a local area and (hopefully) zoom-in, potentially ignoring other parts of the space. An optimizer that focuses very narrowly can have high-resolution but miss the non-adjacent region with far better outcomes, or can have fairly low resolution but perform far better - and the second optimizer is clearly more powerful, but I don't know how to capture this.

But to return to the main discussion, the process of gradient descent is also somewhere between selection and control - and that's what I want to explain.

In theory, the evaluation of each point in the test space could involve an actual check of the system. I build each rocket, watch to see whether it fails or succeeds according to my metric. For search, I'd just pick the best performers, and for more clever approaches, I can do something like find a gradient by judging performance of parameters to see if increasing or decreasing those that are amenable to improvement would help. (I can be even more inefficient, but find something more like a gradient, by building many similar rockets, each an epsilon away in several dimensions, and estimating a gradient that way. Shudder.)

In practice, we use a proxy model - and this is one place that allows for the types of overoptimization misalignment we are discussing. (But it's not the only one.) The reason this occurs is laid out clearly in the Categorizing Goodhart paper as one of the two classes of extremal failure - either model insufficiency, or regime change. This also allows for (during simulation undetectable) causal failures, if the proxy model gets a causal effect wrong.Even without using a proxy model, we can be led astray by the results if we are not careful. Rockets might look great, even in practice, and only fail in untested scenarios because we optimized something too hard - extremal model insufficiency. (Lower weight is cheaper, and we didn't notice a specific structural weakness induced by ruthlessly eliminating weight on the structure.) For our purposes, we want to talk about things like "how much optimization pressure is being applied." This is difficult, and I think we're trying to fit incompatible conceptual models together rather than finding a good synthesis, but I have a few ideas on what selection pressure leading to extremal regions means here.

- Extreme proxy values (in comparison to most of the space) seems similar to having lots of selection pressure. If we have a insanely tall and narrow peak, we may be finding something strange rather than simply improving.
- Extreme input values (unboundedly large or small values) may indicate a worrying area vis-a-vis overoptimization failures.
- Lots of search time alone does NOT indicate extremal results - it indicates lots of things about your domain, and perhaps the inefficiency of your search, but not overoptimization. (This is in contrast to the naive grid-search model, where lots of points visited means more optimizing.)

As an aside, Causal Goodhart is different. It doesn't really seem to rely on extremes, but rather on manipulating new variables, ones that could have an impact on our goal. This can happen because we change the value to a point where it changes the system, similar to extremal Goodhart, but does not need to. For instance, we might optimize filling a cup by getting the water level near the top. Extremal regime change failure might be overfilling the cup and having water spill everywhere. Causal failure might be moving the cup to a different point, say right next to a wall, in order to capture more water, but accidentally break the cup against the wall.Notice that this doesn't require much optimization pressure - Causal Goodhart is about moving to a new region of the distribution of outcomes by (metaphorically or literally) breaking something in the causal structure, rather than by over-optimizing and pushing far from the points that have been explored.This completes the discussion so far - and note that none of this is about control systems. That's because in a sense, most current examples don't optimize much, they simply execute an adaptive program.

One critical case of a control system optimizing is a mesa-optimizer, but that will be deferred until after the next post, which introduces some examples and intuitions around how Goodhart-failures occur in selection versus control systems.