Key argument: if we use a non-Bayesian conservative approach, such as a minimum over different utility functions, then we better have a good reason as to why that would work. But if we have that reason, we can use it to make the whole thing into a Bayesian mix, which can also allow us to trade off that advantage against other possible gains.

I've defended using Bayesian averaging of possible utility functions, even in situations where it seems that Goodhart problems should rear their heads.

Thanks to a more recent discussion with Scott, I can now clarify all these old posts, and look more at the issues of Goodhart's law, Bayesianism, and conservatism.

## Where Goodhart will always remain a problem

The Goodhart problem will always remain if we miss some of the key variables. As Stuart Russell said:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.

If the AI's target is phrased in terms of world GDP and human survival, then no function of these variables will make it care about human happiness and flourishing. So the Goodhart problem will always be a problem some key variables are missed.

This is the reason that I'm looking at model-splintering, so that the AI can add variables as its power and knowledge develops.

# Bayes versus Goodhart

So, let's say we know all the variables, but not necessarily whether they are good or bad or neutral, or to what extent.

Split them into different sets: , the set of variables that are potentially valuable (roughly linearly), , the set of potentially necessary variables (eg the existence of conscious minds), and , the set of variables that might be potentially disastrous if they go negative (eg suffering). There can be or or so on with other properties, but we'll ignore that for the moment.

There may be variables that are in more than one category - for example, if happiness/sadness are on the same scale (which is not automatically true), then maybe this would be linearly important in standard situations, but things would go very bad if it got too negative. So if is this variable, we might want to include in and in (for some critical value ). Similarly, if we wanted to stay in the range , we could add both and to (or to ).

Anyway, each set has elements. To encode the fact we're adding extra variables whose values we're uncertain about, we'll assume that of them are "really" in the set, but the AI can't tell which ones. Define as selecting variables from , and

Here is the Heaviside step function, where if and otherwise^{[1]}.

Then if we are completely ignorant about which variables are truly in the sets, the Bayesian utility is:

What can we say about ? Well, if we fix a , then it will appear in of the terms (i.e. all choices of the and variables, and all possible choices of the other variables in the other variables in ). Thus appears in as an additional term which is

How about a fixed ? Well, if goes below , that will kill off of the . So if all other are positive, sending from negative to positive will multiply the expected value by .

Similarly, for fixed , if all are positive, then it will appear in as .

So the Bayesian mix is roughly linear in the , treats any as loosing of the utility if it goes negative, and is linear in the negative exponential of the .

## Being somewhat sloppy

In some ways we cannot afford to be sloppy: assume that should be in but isn't; so the true utility is , but a -maximiser might sacrifice to increase ; thus (ignoring the ) maximising may set to .

Conversely, if should be in but isn't, and the true utility is . Then a -maximiser might sacrifice to maximise , at a cost of some multiple of , which might be enormous.

So, just as forgetting a variable triggers a standard Goodhart effect (see the section above), forgetting to assign it to the right category also triggers a Goodhart effect.

On the other hand, if we get the numbers , , or wrong, then we scale various pieces of the utility by a constant factor. This tends to have a small effect, as compared with the exponential penalty of the or the "kill off " effects of the .

In the general case, we wouldn't have with of them that matter, but we'd have a probability distribution over the possible variables in that matter (this would be a probability distribution over , the set of subsets of ). In that case, we can be a bit sloppy with as long as we don't miss out variables we should include.

So, along certain axes, we can be sloppy and imprecise without the system breaking. This is a change from the standard Goodhart/AI alignment situations, where sloppiness almost always guarantees disaster.

# Conservatism

If we were being more conservative, instead of maximising a weighted sum of all , we might want to maximise the minimum value of these, and hence the utility function

Let's compare what maximising and maximising might get us.

For , maximising involves maximising the average of the minimal values of the , while maximises the whole average. Here, is superior to the , since variables are explicitly designed to be traded off linearly against each other - if we had reasons to think they behaved differently when low-valued, they should instead/also be in or .

If we have a single that gets very negative, will get a penalty of while gets a penalty of . Up to that constant factor, they behave similarly.

If any is negative, then ignores all of , while scales the utility function by . Thus can take the risk of letting some values in go negative (in exchange for enough bonus), while won't take the risk at all. We could see as behaving like if .

Which is correct there? Most likely, . Imagine if while . Then is saying that "there are a trillion trillion of these variables in . It's bad for *one* of them to be negative. However, you should behave as if it's bad for *any* of them to be negative, and not take that risk."

That's an extreme example; but the point is that incorporates our knowledge or estimate of tradeoffs, but does not. Since it does not know the tradeoffs, we can set up situations where -maximising is not just sub-optimal, but quite dire. For example, if one of the is already negative, then the -maximiser will accept to reduce to just above zero (no matter how high it was initially) in exchange for an infinitesimal chance of increasing above . So the amount of -value that the -maximiser is willing to throw away is potentially unbounded.

# Bayesianism versus conservatism

## Bayes beats conservatism

Both -maximising and -maximising are forms of conservatism; neither of them is just saying "maximise the linearly". When designing our conservative utility, we'd like to incorporate our knowledge about tradeoffs. The -style approach does this very badly in many circumstances, and so can lose a lot of value.

Why might we nevertheless use a -style approach rather than a -style one? If we feel that there is some circumstance where could also lose a lot of value, and that we haven't fully accounted for. After all, there may be situations where we don't know the tradeoffs; maybe there are some tradeoffs where -maximising is terrible, and we just haven't thought of them yet.

But why would -maximising avoid these problems? Maybe there are theorems about how badly -maximising can mess up; since it is based around a minimum, it is plausible that we can bound the downside risk in some circumstances. But if we knew these theorems, we could use them to design a better .

There might be a practical argument for or . Maybe humans are too clever when we try to account for all our genuine preferences, so, in practice, -maximising is safer than -maximising. We'd like an AI to take this into account, and automatically create a good , but maybe that's also too clever - maybe we can't program an AI to successfully take into account the biases we have when estimating our own biases.

Still, seeing the definite failures of and the ability to apply most of our -insights to -design, there's a strong argument that -maximising would likely be safer than -maximising.

## Bayes *is* conservatism

I've been a bit unfair there. I constructed with some care and effort; I could have made it more realistic, but it was a reasonably complex and involved effort.

When constructing , on the other hand, I simply took the previously defined and took a hard minimum across them. If I wanted to construct a better , I could have put in more effort to define functions over which to take a minimum, taking into account all the likely tradeoffs...

We can see the convergence here: this sounds a lot like building . My argument here, not fully proven, is that if we are constructing any objective function that incorporates all our knowledge, we can do as well or better with a Bayesian mix of utility functions.

If we have an argument for why a non-Bayesian approach (such as a minimum or a smooth minimum) might work better, then that very argument becomes a tool to improve our Bayesian mix. Saying "if we use the Bayesian mix, we have problem " becomes "how bad is , and what would be pay to avoid it?", which modifies and improves the Bayesian mix.

If we wanted a more gradual transition, we could use , the smoothstep/clamping function defined by if , if , and otherwise. ↩︎

These might be some typos:

From the following, the second V3 should be V2:

as in

In the following, the second term (∏v1∈σ3H(v2)) should instead be multiplied over v2∈σ2.

I believe the following should be "from positive to negative" rather than "from negative to positive".

And here, it seems "(ignoring the V3)" might be "(ignoring the v)".

Thanks to Rupert McCallum for help in identifying some of these typos.