Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Bayesianism versus conservatism versus Goodhart

3Kenoubi

1sj9999

New Comment

It seems like there might be a problem with this argument if the true are not just unknown, but adversarially chosen. For example, suppose the true are the actual locations of a bunch of landmines, from a full set of possible landmine positions . We are trying to get a vehicle from A to B, and all possible paths go over some of the . We may know that the opponent placing the landmines only has landmines to place. Furthermore, suppose each landmine only goes off with some probability even if the vehicle drives over it. If we can mechanistically predict where the opponent placed the landmines, or even mechanistically derive a probability distribution over the landmine placements, this is no problem, we can just use that to minimize the expected probability of driving over a landmine that goes off. However, suppose we can't predict the opponent that way, but we do know the opponent is trying to maximize the probability that the vehicle drives over a landmine that isn't a dud. It seems like we need to use game theory here, not just probability theory, to figure out what mixed strategy the opponent would be using to maximize the probability that we drive over a landmine, and then use that game-theoretic strategy to choose a mixed strategy for which path to take. It seems like the game theory here involves a step where we look for the worst (according to our utility function) probability distribution over where the landmines are placed, because this is how the opponent will have actually chosen where to put the landmines. Doesn't this look a lot like using rather than as our utility function?

These might be some typos:

From the following, the second should be :

What can we say about ? Well, if we fix a , then it will appear in of the terms (i.e. all choices of the and variables, and all possible choices of the other variables in the other variables in ).

as in

(i.e. all choices of the and variables, and all possible choices of the other variables in the other variables in ).

In the following, the second term should instead be multiplied over .

Define as selecting variables from , and

I believe the following should be "from positive to negative" rather than "from negative to positive".

How about a fixed ? Well, if goes below , that will kill off of the . So if all other are positive, sending from negative to positive will multiply the expected value by .

And here, it seems "(ignoring the )" might be "(ignoring the )".

In some ways we cannot afford to be sloppy: assume that should be in but isn't; so the true utility is , but a -maximiser might sacrifice to increase ; thus (ignoring the ) maximising may set to .

Thanks to Rupert McCallum for help in identifying some of these typos.

Key argument: if we use a non-Bayesian conservative approach, such as a minimum over different utility functions, then we better have a good reason as to why that would work. But if we have that reason, we can use it to make the whole thing into a Bayesian mix, which can also allow us to trade off that advantage against other possible gains.

I've defended using Bayesian averaging of possible utility functions, even in situations where it seems that Goodhart problems should rear their heads.

Thanks to a more recent discussion with Scott, I can now clarify all these old posts, and look more at the issues of Goodhart's law, Bayesianism, and conservatism.

## Where Goodhart will always remain a problem

The Goodhart problem will always remain if we miss some of the key variables. As Stuart Russell said:

If the AI's target is phrased in terms of world GDP and human survival, then no function of these variables will make it care about human happiness and flourishing. So the Goodhart problem will always be a problem some key variables are missed.

This is the reason that I'm looking at model-splintering, so that the AI can add variables as its power and knowledge develops.

## Bayes versus Goodhart

So, let's say we know all the variables, but not necessarily whether they are good or bad or neutral, or to what extent.

Split them into different sets: V1, the set of variables that are potentially valuable (roughly linearly), V2, the set of potentially necessary variables (eg the existence of conscious minds), and V3, the set of variables that might be potentially disastrous if they go negative (eg suffering). There can be V4 or V5 or so on with other properties, but we'll ignore that for the moment.

There may be variables that are in more than one category - for example, if happiness/sadness are on the same scale (which is not automatically true), then maybe this would be linearly important in standard situations, but things would go very bad if it got too negative. So if v is this variable, we might want to include v in V1 and v−ω in V3 (for some critical value ω). Similarly, if we wanted v to stay in the range [a,b], we could add both v−a and b−v to V3 (or to V2).

Anyway, each set Vi has Ni elements. To encode the fact we're adding extra variables whose values we're uncertain about, we'll assume that ni of them are "really" in the set, but the AI can't tell which ones. Define σ=(σ1,σ2,σ3) as selecting ni variables from Vi, and

Uσ=(∑v1∈σ1v1)⋅(∏v1∈σ3H(v2))+∑v3∈σ3−e−v3.

Here H is the Heaviside step function, where H(x)=0 if x≤0 and H(x)=1 otherwise

^{[1]}.Then if we are completely ignorant about which variables are truly in the sets, the Bayesian utility is:

U=UV1,V2,V3=∑σUσ(N1n1)(N2n2)(N3n3).

What can we say about U? Well, if we fix a v3∈V3, then it will appear in (N1n1)(N2n2)(N3−1n3−1) of the Uσ terms (i.e. all choices of the V1 and V3 variables, and all possible choices of the other n3−1 variables in the other N3−1 variables in V3). Thus v3 appears in U as an additional term which is

−e−v3(N3−1n3−1)(N3n3)=−e−v3n3N3.

How about a fixed v2∈V2? Well, if v2 goes below 0, that will kill off (N1n1)(N2−1n2−1)(N3n3) of the Uσ. So if all other v′2∈V2 are positive, sending v2 from negative to positive will multiply the expected value by 1−n2/N2.

Similarly, for fixed v1∈V1, if all v2∈V2 are positive, then it will appear in U as +v1(n1/N1).

So the Bayesian mix is roughly linear in the V1, treats any v∈V2 as loosing n2/N2 of the utility if it goes negative, and is linear in the negative exponential of the V3.

## Being somewhat sloppy

In some ways we cannot afford to be sloppy: assume that v should be in V2 but isn't; so the true utility is U=UV1,V2∪{v},V3, but a U′=UV1,V2,V3-maximiser might sacrifice v to increase U′; thus (ignoring the V3) maximising U′ may set U to 0.

Conversely, if v should be in V3 but isn't, and the true utility is U=UV1,V2,V3∪{v}. Then a U′-maximiser might sacrifice v to maximise U, at a cost of some multiple of −e−v, which might be enormous.

So, just as forgetting a variable triggers a standard Goodhart effect (see the section above), forgetting to assign it to the right category also triggers a Goodhart effect.

On the other hand, if we get the numbers n1, n2, or n3 wrong, then we scale various pieces of the utility by a constant factor. This tends to have a small effect, as compared with the exponential penalty of the V3 or the "kill off V1" effects of the V2.

In the general case, we wouldn't have Vi with ni of them that matter, but we'd have a probability distribution pi over the possible variables in Vi that matter (this would be a probability distribution over 2Vi, the set of subsets of Vi). In that case, we can be a bit sloppy with pi as long as we don't miss out variables we should include.

So, along certain axes, we can be sloppy and imprecise without the system breaking. This is a change from the standard Goodhart/AI alignment situations, where sloppiness almost always guarantees disaster.

## Conservatism

If we were being more conservative, instead of maximising U=UV1,V2,V3 a weighted sum of all Uσ, we might want to maximise the minimum value of these, and hence the utility function

μ=μV1,V2,V3=minσ(Uσ).

Let's compare what maximising μ and maximising U might get us.

For V1, maximising μ involves maximising the average of the minimal n1 values of the V1, while U maximises the whole average. Here, U is superior to the μ, since V1 variables are explicitly designed to be traded off linearly against each other - if we had reasons to think they behaved differently when low-valued, they should instead/also be in V2 or V3.

If we have a single v3∈V3 that gets very negative, U will get a penalty of −e−v3(n3/N3) while μ gets a penalty of −e−v3. Up to that constant factor, they behave similarly.

If any v2∈V2 is negative, then μ ignores all of V1, while U scales the V1 utility function by 1−(n3/N3). Thus U can take the risk of letting some values in V2 go negative (in exchange for enough V1 bonus), while μ won't take the risk at all. We could see μ as behaving like U if n2=N2.

Which is correct there? Most likely, U. Imagine if n2=1 while N2=1024. Then μ is saying that "there are a trillion trillion of these variables in V2. It's bad for

oneof them to be negative. However, you should behave as if it's bad foranyof them to be negative, and not take that risk."That's an extreme example; but the point is that U incorporates our knowledge or estimate of tradeoffs, but μ does not. Since it does not know the tradeoffs, we can set up situations where μ-maximising is not just sub-optimal, but quite dire. For example, if one of the v2∈V2 is already negative, then the μ-maximiser will accept to reduce ∑v1∈V1v1 to just above zero (no matter how high it was initially) in exchange for an infinitesimal chance of increasing v2 above 0. So the amount of U-value that the μ-maximiser is willing to throw away is potentially unbounded.

## Bayesianism versus conservatism

## Bayes beats conservatism

Both μ-maximising and U-maximising are forms of conservatism; neither of them is just saying "maximise the vi linearly". When designing our conservative utility, we'd like to incorporate our knowledge about tradeoffs. The μ-style approach does this very badly in many circumstances, and so can lose a lot of value.

Why might we nevertheless use a μ-style approach rather than a U-style one? If we feel that there is some circumstance where U could also lose a lot of value, and that we haven't fully accounted for. After all, there may be situations where we don't know the tradeoffs; maybe there are some tradeoffs where U-maximising is terrible, and we just haven't thought of them yet.

But why would μ-maximising avoid these problems? Maybe there are theorems about how badly μ-maximising can mess up; since it is based around a minimum, it is plausible that we can bound the downside risk in some circumstances. But if we knew these theorems, we could use them to design a better U.

There might be a practical argument for μ or U. Maybe humans are too clever when we try to account for all our genuine preferences, so, in practice, μ-maximising is safer than U-maximising. We'd like an AI to take this into account, and automatically create a good U, but maybe that's also too clever - maybe we can't program an AI to successfully take into account the biases we have when estimating our own biases.

Still, seeing the definite failures of μ and the ability to apply most of our μ-insights to U-design, there's a strong argument that U-maximising would likely be safer than μ-maximising.

## Bayes

isconservatismI've been a bit unfair there. I constructed U with some care and effort; I could have made it more realistic, but it was a reasonably complex and involved effort.

When constructing μ, on the other hand, I simply took the previously defined Uσ and took a hard minimum across them. If I wanted to construct a better μ, I could have put in more effort to define functions over which to take a minimum, taking into account all the likely tradeoffs...

We can see the convergence here: this sounds a lot like building U. My argument here, not fully proven, is that if we are constructing any objective function that incorporates all our knowledge, we can do as well or better with a Bayesian mix of utility functions.

If we have an argument for why a non-Bayesian approach (such as a minimum or a smooth minimum) might work better, then that very argument becomes a tool to improve our Bayesian mix. Saying "if we use the Bayesian mix, we have problem P" becomes "how bad is P, and what would be pay to avoid it?", which modifies and improves the Bayesian mix.

If we wanted a more gradual transition, we could use S0, the smoothstep/clamping function defined by S0(x)=0 if x<0, S0(x)=1 if x>1, and S0(x)=x otherwise. ↩︎