Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Clarifying Thoughts on Optimizing and Goodhart Effects - Part 3

Previous Posts: Re-introducing Selection vs Control for Optimization, What does Optimization Mean, Again? -

Following the previous two posts, I'm going to try to first lay out the way Goodhart's Law applies in the earlier example of rockets, then try to explain why this differs between selection and control. (Note: Adversarial Goodhart isn't explored, because we want to keep the setting sufficiently simple.) This sets up the next post, which will discuss Mesa-Optimizers.

Revisting Selection vs. Control Systems

Basically everything in the earlier post that used the example process of rocket design and launching is susceptible to some form of overoptimization, in different ways. Interestingly, there seem to be clear places where different types of overoptimization is important. Before looking at this, I want to revisit the selection-control dichotomy from a new angle.

In a (pure) control system, we cannot sample datapoints without navigating to them. If the agent is an embedded agent, and has sufficient span of control to cause changes in the environment, we cannot necessarily reset and try over. In a selection system, we only sample points in ways that do not affect the larger system. Even when designing a rocket, our very expensive testing has approximately no longer term effects. (We'll leave space debris from failures aside, but get back to it below.)

This explains why we potentially care about control systems more than selection systems. It also points to why Oracles are supposed to be safer than other AIs - they can't directly impact anything, so their output is done in a pure selection framework. Of course, if they are sufficiently powerful, and are relied on, the changes made become irreversible, which is why Oracles are not a clear solution to AI safety.

Goodhart in Selection vs. Control Systems

Regressional and Extremal Goodhart are particularly pernicious for selection, and potentially less worrying for control. Regressional Goodhart is always present if we are insufficiently aware of our goals, but in general Causal Goodhart failures seems more critical in control, because it is often narrower. To keep this concrete, I'll go through the classes of failure, and note how they could occur at each stage of rocket design. To do so, we need to clarify goals at each stage. Our goal in stage 1 is to find a class of designs and paths to optimize. In stage 2, we build, test, and refine a system. In many ways, this stage is intended to circumvent goodhart-failures, but testing does not always address extremal cases, so our design may still fail.

Regressional Goodhart hits us if we have any divergence between our metric and our actual goal. For example, in stages 1 and 2, finding an ideal complex or chaotic path that is dependent on exact positions of planets in a multibody system would be bad, or a path involving excessive G-forces or other dangerous things might be more fuel efficient than a simpler path. For example, a gravitational slingshot around the sun might be cheap, but fry or crush the astronauts. Alternatively, a design with a shape that does not allow people to fit inside might be found when optimizing. Each of these impact goals potentially not included in the model. Regressional goodhart is less common in control for this case, since we kept the mesa-optimizer limited to optimizing a very narrow goal already chosen by the design-optimization.

Extremal Goodhart is always a model failure. It can be because the model is insufficiently accurate, (Model Insufficiency) or because there is a regime change. Regime changes seem particularly challenging in systems that design mesa-optimizers, since I think the mesa-optimization is narrower in some way than the global optimizer (if not, it's more efficient to have an executing system rather than a mesa-optimizer.)

Causal Goodhart is by default about an irreversible change. In selection systems, it means that our sampling accidentally broke the distribution. For example, we test many rockets, creating enough space debris to make further tests vulnerable to collisions. We wanted the tests to sample from the space, but we accidentally changed the regime while sampling.

In the current discussion, we care about metric-goal divergence because the cost of the divergence is high - typically, once we get there, some irreversible consequence happens, as explained above. This isn't exclusively true of control systems, as the causal Goodhart example shows, but it's clearly more common in such systems. Once we're actually navigating and controlling the system, we don't have any way to reset to base conditions, and causal changes create regime changes - and if these are unexpected, the control system is suddenly in a position of opitimizing using an irrelevant model.

And this is a critical fact, because as I'll argue in the next post, mesa-optimizers are control systems of a specific type, and have some new overoptimization failure modes because of that.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 3:26 PM

I tentatively think it makes the most sense to apply Goodhart exclusively to selection processes rather than control, except perhaps for the causal case.

All flavors of Goodhart require a system to be clearly optimizing a proxy. Selection-style optimization gets direct feedback of some kind; this feedback can be "the true function" or "the proxy" - although calling it either requires anthropomorphic analysis of what the system is trying to do - perhaps via observation of how/why the system was set up. So we can talk about Goodhart by comparing what the selection process explicitly optimizes to our anthropomorphic analysis of what it "tries to" optimize.

A control system, on the other hand, is all anthropomorphism. We look at it efficiently steering the world into a narrow space of possibility, and we conclude that's what it is "trying" to do. So where is the true-vs-proxy to compare?

It might be that we have two or more plausible ways to ascribe goals to the controller, and that these conflict. For example, maybe the controller happens to have an explicit specification of a utility function somewhere in its machine code, and on the other hand, we know that it was build for a specific purpose -- and the two differ from each other. This is simple value misalignment (arguably outright misspecification, not Goodhart?).

However, I can't think of a reason why the thing would need an explicit utility function inside of it unless it was internally implementing a selection process, like a planning algorithm or something. So that brings us back to applying the Goodhart concept to selection processes, rather than control.

You mention model errors. If the controller is internally using model-based reasoning, it seems very likely that it is doing selection-style planning. So again, the Goodhart concept seems to apply to the selection part.

There are some situations where we don't need to apply any anthropomorphic analysis to infer a goal for a controller, because it is definitely responsive to a specific form of feedback: namely, reinforcement learning. In the case of reinforcement learning, a "goodhart-ish" failure which can occur is wireheading. Interesting that I have never been quite comfortable classifying wireheading as one of the four types of Goodhart; perhaps that's because I was applying the selection-vs-control distinction implicitly.

I mentioned that causal goodhart might be the exception. It seems to me that causal failure applies to selection processes too, but ONLY to selection which is being used to implement a controller. In effect it's a type of model error for a model-based controller.

All of this is vague and fuzzy and only weakly endorsed.

Thanks for the feedback. I agree that in a control system, any divergence between intent and outcome is an alignment issue, and I agree that this makes overoptimization different in control versus selection. Despite the conceptual confusion, I definitely think the connections are worth noting - not only "wireheading," but the issues with mesa-optimizers. And I definitely think that causal failures are important particularly in this context.

But I strongly endorse how weak and fuzzy this is - which is a large part of why I wanted to try to de-confuse myself. That's the goal of this mini-sequence, and I hope that doing so publicly in this way at least highlights where the confusion is, even if I can't successfully de-confuse myself, much less others. And if there are places where others are materially less confused than me and/or you, I'd love for them to write responses or their own explainers on this.

I think I already want to back off on my assertion that the categories should not be applied to controllers. However, I see the application to controllers as more complex. It's more clear what it means to (successfully) point a selection-style optimization process at a proxy. In a selection setting, you have the proxy (which the system can access), and the true value (which is not accessible). Wireheading only makes sense when "true" is partially accessible, and the agent severs that connection.

I definitely appreciate your posts on this; it hadn't occurred to me to ask whether the four types apply equally well to selection and control.

As always, constructive criticism on whether I'm still confused, or whether the points I'm making are clear, is welcome!