Stuart Armstrong has claimed to beat Goodhart with Bayesian uncertainty -- rather than assuming some particular objective function (which you try to make as correct as possible), you represent some uncertainty. A similar claim was made in The Optimizer's Curse and How to Beat It, the essay which introduced a lot of us to ... well, not Goodhart's Law itself (the post doesn't make mention of Goodhart), but, that kind of failure. I myself claimed that Bayes beats regressional Goodhart, in Robust Delegation:

I now think this isn't true -- Bayes' Law doesn't beat Goodhart fully. It doesn't even beat regressional Goodhart fully. (I'll probably edit Robust Delegation to change the claim at some point.)

(Stuart makes some more detailed claims about AI and the nearest-unblocked-strategy problem which aren't exactly claims about Goodhart, at least according to him. ** I don't fully understand Stuart's perspective, and don't claim to directly address it here.** I am mostly only addressing the question of the title of my post: does Bayes beat Goodhart?)

# If approximate solutions are concerning, why would mixtures of them be unconcerning?

My first argument is a loose intuition: Goodhartian phenomena suggest that somewhat-correct-but-not-quite-right proxy functions are not safe to optimize (and in some sense, the more optimization pressure is applied, the less safe we expect it to be). Assigning weights to a bunch of somewhat-but-not-quite-right possibilities just gets us another somewhat-but-not-quite-right possibility. Why would we expect this to fundamentally solve the problem?

- Perhaps the Bayesian mixture across hypotheses is
*closer to being correct*, and therefore, gives us an approximation which is able to stand up to more optimization pressure before it breaks down. But this is a quantitative distinction, not a qualitative one.*How big*of a difference do we expect that to make? Wouldn't it still break down about as badly when put under tremendous optimization pressure? - Perhaps the point of the Bayesian mixture is that, by quantifying uncertainty about the various hypotheses, it encourages strategies which hedge their bets -- satisfying a broad range of possible utility functions, by avoiding doing something terrible for one utility function in order to get a few more points for another. But this incentive to hedge bets is fairly weak; the optimization is still encouraged to do something really terrible for one function if it leads to a moderate increase for many other utility functions.

My intuition there doesn't address the gears of the situation adequately, though. Let's get into it.

# Overcoming regressional Goodhart requires calibrated learning.

In *Robust Delegation*, I defined regressional Goodhart through the predictable-disappointment idea. Does Bayesian reasoning eliminate predictable disappointment?

Well, it depends on what is meant by "predictable". You could define it as predictable-by-bayes, in which case it follows that Bayes solves the problem. However, I think it is reasonable to at least add a calibration requirement: there should be no way to systematically correct estimates up or down as a function of the expected value.

Calibration seems like it does, in fact, significantly address regressional Goodhart. You can't have seen a lot of instances of an estimate being too high, and still accept that too-high estimate. It doesn't address extremal Goodhart, because calibrated learning can only guarantee that you eventually calibrate, or converge at some rate, or something like that -- extreme values that you've rarely encountered would remain a concern.

(Stuart's "one-in-three" example in the Defeating Goodhart post, and his discussion of human overconfidence more generally, is somewhat suggestive of calibration.)

Bayesian methods are not always calibrated. Calibrated learning is not always Bayesian. (For example, logical induction has good calibration properties, and so far, hasn't gotten a really satisfying Bayesian treatment.)

This might be confusing if you're used to thinking in Bayesian terms. If you think in terms of the diagram I copied from *Robust Delegation*, above: you have a prior which stipulates probability of true utility given observation ; your expectation is the expected value of for a particular value of ; is not predictably correctable with respect to your prior. What's the problem?

The problem is that this line of reasoning assumes that your prior is *objectively correct*. This doesn't generally make sense (especially from a Bayesian perspective). So, it is perfectly consistent for you to collect many observations, and see that has some systematic bias. This may remain true *even as you update on those observations *(because Bayesian learning doesn't guarantee any calibration property in general!).

The faulty assumption that your probability distribution is correct is often replaced with the (weaker, but still problematic) assumption that at least one hypothesis within your distribution is objectively correct -- the realizability assumption.

# Bayesian solutions assume realizability.

As discussed in Embedded World Models, the realizability assumption is the assumption that (at least) one of your hypotheses represents the true state of affairs. Bayesian methods often (though not always) require a realizability assumption in order to get strong guarantees. Frequentist methods rarely require such an assumption (whatever else you may say about frequentist methods). Calibration is an example of that -- a Bayesian can get calibration under the assumption of realizability, but, we might want a stronger guarantee of calibration which holds even in absence of realizability.

## "We quantified our uncertainty as best we could!"

One possible bayes-beats-goodhart argument is: "Once we quantify our uncertainty with a probability distribution over possible utility functions, the best we can possibly do is to choose whatever maximizes expected value. Anything else is decision-theoretically sub-optimal."

Do you think that the true utility function is really sampled from the given distribution, in some objective sense? And the probability distribution also quantifies all the things which can count as evidence? If so, fine. Maximizing expectation is the objectively best strategy. This eliminates all types of Goodhart by positing that we've already modeled the possibilities sufficiently well: extremal cases are modeled correctly; adversarial effects are already accounted for; etc.

However, this is unrealistic due to embeddedness: the outside world is much more complicated than any probability distribution which we can explicitly use, since we are ourselves a small part of that world.

Alternatively, do you think the probability distribution really codifies your precise subjective uncertainty? Ok, sure, that would also justify the argument.

Realistically, though, an implementation of this isn't going to be representing your precise subjective beliefs (to the extent you even *have* precise subjective beliefs). It has to hope to have a prior which is "good enough".

In what sense might it be "good enough"?

An obvious problem is that a distribution might be overconfident in a wrong conclusion, which will obviously be bad. The fix for this appears to be: make sure that the distribution is "sufficiently broad", expressing a fairly high amount of uncertainty. But, why would this be good?

Well, one might argue: it can only be worse that our true uncertainty to the extent that it ends up assigning too little weight to the correct option. So, if the probability function isn't too small for any of the possibilities which we intuitively assign non-negligible weight, things should be fine.

## "The True Utility Function Has Enough Weight"

First, even assuming the framing of "true utility function" makes sense, it isn't obvious to me that the argument makes sense.

If there's a true utility function which is assigned weight , and we apply a whole lot of optimization pressure to the overall mixture distribution, then it is perfectly possible that gets compromised for the sake of satisfying a large number of other . The weight determines a *ratio at which trade-offs can occur,* not a *ratio of the overall resources which we will get* or anything like that.

A first-pass analysis is that has to be more than 1/2 to guarantee any consideration; any weight less than that, and it's possible that is *as low as it can go* in the optimized solution, because some outcome was sufficiently good for all other potential utility functions that it made sense to "take the hit" with respect to . We can't formally say "this probably won't happen, because the odds that the best-looking option is specifically terrible for are low" without assuming something about the distribution of highly optimized solutions.

(Such an analysis might be interesting; I don't know if anyone has investigated from that angle. But, it seems somewhat unlikely to do us good, since it doesn't seem like we can make very nice assumptions about what highly-optimized solutions look like.)

In reality, the worst-case analysis is better than this, because many of the more-plausible should have a lot of "overlap" with ; after all, they were given high weight because they *appeared plausible* somehow (they agreed with human intuitions, or predicted human behavior, etc). We could try to formally define "overlap" and see what assumptions we need to guarantee better-than-worst-case outcomes. (This might have some interesting learning-theoretic implications for value learning, even.)

However, this whole framing, where we assume that there's a and think about its weight, is suspect. Why should we think that there's a "true" utility function which captures our preferences? And, if there is, why should we assume that it has an explicit representation in the hypothesis space?

If we drop this assumption, we get the classical problems associated with non-realizability in Bayesian learning. Beliefs may not converge at all, as evidence accumulates; they could keep oscillating due to inconsistent evidence. Under the interpretation where we still assume a "true" utility function but we don't assume that it is explicitly representable within the hypothesis space, there isn't a clear guarantee we can get (although perhaps the "overlap" analysis can help here). If we don't assume a true utility function at all, then it isn't clear how to even ask questions about how well we do (although I'm not saying there isn't a useful analysis -- I'm just saying that it is unclear to me right now).

Stuart does address this question, in the end:

I've argued that an indescribable hellworld cannot exist. There's a similar question as to whether there exists human uncertainty about U that cannot be included in the AI's model of Δ. By definition, this uncertainty would be something that is currently unknown and unimaginable to us. However, I feel that it's far more likely to exist, than the indescribable hellworld.

Still despite that issue, it seems to me that there are methods of dealing with the Goodhart problem/nearest unblocked strategy problem. And this involves properly accounting for all our uncertainty, directly or indirectly. If we do this well, there no longer remains a Goodhart problem at all.

Perhaps I agree, if "properly accounting for all our uncertainty" includes robustness properties such as calibrated learning, *and* if we restrict our attention to regressional Goodhart, ignoring the other three.

Well... what about the others, then?

# Overcoming adversarial Goodhart seems to require randomization.

The argument here is pretty simple: adversarial Goodhart enters into the domain of game theory, in which mixed strategies tend to be very useful. Quantilization is one such mixed strategy, which seems to usefully address Goodhart to a certain extent. I'm not saying that quantilization is the ultimate solution here. But, it does seem to me like quantilization is significant enough that a solution to Goodhart should say something about the class of problems which quantilization solves.

In particular, a property of quantilization which I find appealing is the way more certainty about the utility function implies that more optimization power can be safely applied to making decisions. This informs my intuition that applying arbitrarily high optimization power does not become safe simply because you've explicitly represented uncertainty about utility functions -- no matter how accurately, short of "perfectly accurately" (which isn't even a meaningful concept), it only seems to justify a limited amount of optimization pressure. This story may be an incorrect one, but if so, I'd like to really understand why it is incorrect.

Unlike the previous sections, this doesn't necessarily step outside of typical Bayesian thought, since this kind of game-theoretic thinking is more or less within the purview of Bayesianism. However, the simple "Bayes solves Goodhart" story doesn't explicitly address this.

*(I haven't addressed causal Goodhart anywhere in this essay, since it opens up the whole decision-theoretic can of worms, which seems somewhat beside the main point. (I suppose, arguably, game-theoretic concerns could be beside the point as well -- but, they feel more directly relevant to me, since quantilization is fairly directly about solving Goodhart.))*

# In summary:

- If optimizing an arbitrary somewhat-but-not-perfectly-right utility function gives rise to serious Goodhart-related concerns, then why does a mixture distribution over such functions alleviate such concerns? Aren't they just averaging together to yield yet another somewhat-but-not-quite-right function?
- Regressional Goodhart seems better-addressed by calibrated learning than it does by Bayesian learning.
- Bayesian learning tends to require a realizability assumption in order to have good properties (including calibration).
- Even assuming realizability, heavily optimizing a mixture distribution over possible utility functions seems dicey -- it can end up throwing away all the real value if it finds a way to jointly satisfy a lot of the wrong ones. (It is possible that we can find reasonable assumptions under which this doesn't happen, however.)
- Overcoming adversarial Goodhart seems to require mixed strategies, which the simple "bayesian uncertainty" story doesn't explicitly address.