Bayesianism versus conservatism versus Goodhart

Stuart_Armstrong

Key argument: if we use a non-Bayesian conservative approach, such as a minimum over different utility functions, then we better have a good reason as to why that would work. But if we have that reason, we can use it to make the whole thing into a Bayesian mix, which can also allow us to trade off that advantage against other possible gains.

I've defended using Bayesian averaging of possible utility functions, even in situations where it seems that Goodhart problems should rear their heads.

Thanks to a more recent discussion with Scott, I can now clarify all these old posts, and look more at the issues of Goodhart's law, Bayesianism, and conservatism.

Where Goodhart will always remain a problem

The Goodhart problem will always remain if we miss some of the key variables. As Stuart Russell said:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.

If the AI's target is phrased in terms of world GDP and human survival, then no function of these variables will make it care about human happiness and flourishing. So the Goodhart problem will always be a problem some key variables are missed.

This is the reason that I'm looking at model-splintering, so that the AI can add variables as its power and knowledge develops.

Bayes versus Goodhart

So, let's say we know all the variables, but not necessarily whether they are good or bad or neutral, or to what extent.

Split them into different sets: , the set of variables that are potentially valuable (roughly linearly), $V_{2}$ , the set of potentially necessary variables (eg the existence of conscious minds), and $V_{3}$ , the set of variables that might be potentially disastrous if they go negative (eg suffering). There can be $V_{4}$ or $V_{5}$ or so on with other properties, but we'll ignore that for the moment.

There may be variables that are in more than one category - for example, if happiness/sadness are on the same scale (which is not automatically true), then maybe this would be linearly important in standard situations, but things would go very bad if it got too negative. So if $v$ is this variable, we might want to include $v$ in $V_{1}$ and $v - ω$ in $V_{3}$ (for some critical value $ω$ ). Similarly, if we wanted $v$ to stay in the range $[a, b]$ , we could add both $v - a$ and $b - v$ to $V_{3}$ (or to $V_{2}$ ).

Anyway, each set $V_{i}$ has $N_{i}$ elements. To encode the fact we're adding extra variables whose values we're uncertain about, we'll assume that $n_{i}$ of them are "really" in the set, but the AI can't tell which ones. Define $σ = (σ_{1}, σ_{2}, σ_{3})$ as selecting $n_{i}$ variables from $V_{i}$ , and

$U_{σ} = (\sum v_{1} \in σ_{1} v_{1}) \cdot (\prod v_{1} \in σ_{3} H (v_{2})) + \sum v_{3} \in σ_{3} - e^{- v_{3}} .$

Here $H$ is the Heaviside step function, where $H (x) = 0$ if $x \leq 0$ and $H (x) = 1$ otherwise^[1].

Then if we are completely ignorant about which variables are truly in the sets, the Bayesian utility is:

$U = U_{V_{1}, V_{2}, V_{3}} = \frac{\sum_{σ} U_{σ}}{(\frac{N_{1}}{n_{1}}) (\frac{N_{2}}{n_{2}}) (\frac{N_{3}}{n_{3}})} .$

What can we say about $U$ ? Well, if we fix a $v_{3} \in V_{3}$ , then it will appear in $(\frac{N_{1}}{n_{1}}) (\frac{N_{2}}{n_{2}}) (\frac{N_{3} - 1}{n_{3} - 1})$ of the $U_{σ}$ terms (i.e. all choices of the $V_{1}$ and $V_{3}$ variables, and all possible choices of the other $n_{3} - 1$ variables in the other $N_{3} - 1$ variables in $V_{3}$ ). Thus $v_{3}$ appears in $U$ as an additional term which is

$- e^{- v_{3}} \frac{(\frac{N_{3} - 1}{n_{3} - 1})}{(\frac{N_{3}}{n_{3}})} = - e^{- v_{3}} \frac{n_{3}}{N_{3}} .$

How about a fixed $v_{2} \in V_{2}$ ? Well, if $v_{2}$ goes below $0$ , that will kill off $(\frac{N_{1}}{n_{1}}) (\frac{N_{2} - 1}{n_{2} - 1}) (\frac{N_{3}}{n_{3}})$ of the $U_{σ}$ . So if all other $v_{2}^{'} \in V_{2}$ are positive, sending $v_{2}$ from negative to positive will multiply the expected value by $1 - n_{2} / N_{2}$ .

Similarly, for fixed $v_{1} \in V_{1}$ , if all $v_{2} \in V_{2}$ are positive, then it will appear in $U$ as $+ v_{1} (n_{1} / N_{1})$ .

So the Bayesian mix is roughly linear in the $V_{1}$ , treats any $v \in V_{2}$ as loosing $n_{2} / N_{2}$ of the utility if it goes negative, and is linear in the negative exponential of the $V_{3}$ .

Being somewhat sloppy

In some ways we cannot afford to be sloppy: assume that $v$ should be in $V_{2}$ but isn't; so the true utility is $U = U_{V_{1}, V_{2} \cup {v}, V_{3}}$ , but a $U^{'} = U_{V_{1}, V_{2}, V_{3}}$ -maximiser might sacrifice $v$ to increase $U^{'}$ ; thus (ignoring the $V_{3}$ ) maximising $U^{'}$ may set $U$ to $0$ .

Conversely, if $v$ should be in $V_{3}$ but isn't, and the true utility is $U = U_{V_{1}, V_{2}, V_{3} \cup {v}}$ . Then a $U^{'}$ -maximiser might sacrifice $v$ to maximise $U$ , at a cost of some multiple of $- e^{- v}$ , which might be enormous.

So, just as forgetting a variable triggers a standard Goodhart effect (see the section above), forgetting to assign it to the right category also triggers a Goodhart effect.

On the other hand, if we get the numbers $n_{1}$ , $n_{2}$ , or $n_{3}$ wrong, then we scale various pieces of the utility by a constant factor. This tends to have a small effect, as compared with the exponential penalty of the $V_{3}$ or the "kill off $V_{1}$ " effects of the $V_{2}$ .

In the general case, we wouldn't have $V_{i}$ with $n_{i}$ of them that matter, but we'd have a probability distribution $p_{i}$ over the possible variables in $V_{i}$ that matter (this would be a probability distribution over $2^{V_{i}}$ , the set of subsets of $V_{i}$ ). In that case, we can be a bit sloppy with $p_{i}$ as long as we don't miss out variables we should include.

So, along certain axes, we can be sloppy and imprecise without the system breaking. This is a change from the standard Goodhart/AI alignment situations, where sloppiness almost always guarantees disaster.

Conservatism

If we were being more conservative, instead of maximising $U = U_{V_{1}, V_{2}, V_{3}}$ a weighted sum of all $U_{σ}$ , we might want to maximise the minimum value of these, and hence the utility function

$μ = μ_{V_{1}, V_{2}, V_{3}} = min σ (U_{σ}) .$

Let's compare what maximising $μ$ and maximising $U$ might get us.

For $V_{1}$ , maximising $μ$ involves maximising the average of the minimal $n_{1}$ values of the $V_{1}$ , while $U$ maximises the whole average. Here, $U$ is superior to the $μ$ , since $V_{1}$ variables are explicitly designed to be traded off linearly against each other - if we had reasons to think they behaved differently when low-valued, they should instead/also be in $V_{2}$ or $V_{3}$ .

If we have a single $v_{3} \in V_{3}$ that gets very negative, $U$ will get a penalty of $- e^{- v_{3}} (n_{3} / N_{3})$ while $μ$ gets a penalty of $- e^{- v_{3}}$ . Up to that constant factor, they behave similarly.

If any $v_{2} \in V_{2}$ is negative, then $μ$ ignores all of $V_{1}$ , while $U$ scales the $V_{1}$ utility function by $1 - (n_{3} / N_{3})$ . Thus $U$ can take the risk of letting some values in $V_{2}$ go negative (in exchange for enough $V_{1}$ bonus), while $μ$ won't take the risk at all. We could see $μ$ as behaving like $U$ if $n_{2} = N_{2}$ .

Which is correct there? Most likely, $U$ . Imagine if $n_{2} = 1$ while $N_{2} = 10^{24}$ . Then $μ$ is saying that "there are a trillion trillion of these variables in $V_{2}$ . It's bad for one of them to be negative. However, you should behave as if it's bad for any of them to be negative, and not take that risk."

That's an extreme example; but the point is that $U$ incorporates our knowledge or estimate of tradeoffs, but $μ$ does not. Since it does not know the tradeoffs, we can set up situations where $μ$ -maximising is not just sub-optimal, but quite dire. For example, if one of the $v_{2} \in V_{2}$ is already negative, then the $μ$ -maximiser will accept to reduce $\sum_{v_{1} \in V_{1}} v_{1}$ to just above zero (no matter how high it was initially) in exchange for an infinitesimal chance of increasing $v_{2}$ above $0$ . So the amount of $U$ -value that the $μ$ -maximiser is willing to throw away is potentially unbounded.

Bayesianism versus conservatism

Bayes beats conservatism

Both $μ$ -maximising and $U$ -maximising are forms of conservatism; neither of them is just saying "maximise the $v_{i}$ linearly". When designing our conservative utility, we'd like to incorporate our knowledge about tradeoffs. The $μ$ -style approach does this very badly in many circumstances, and so can lose a lot of value.

Why might we nevertheless use a $μ$ -style approach rather than a $U$ -style one? If we feel that there is some circumstance where $U$ could also lose a lot of value, and that we haven't fully accounted for. After all, there may be situations where we don't know the tradeoffs; maybe there are some tradeoffs where $U$ -maximising is terrible, and we just haven't thought of them yet.

But why would $μ$ -maximising avoid these problems? Maybe there are theorems about how badly $μ$ -maximising can mess up; since it is based around a minimum, it is plausible that we can bound the downside risk in some circumstances. But if we knew these theorems, we could use them to design a better $U$ .

There might be a practical argument for $μ$ or $U$ . Maybe humans are too clever when we try to account for all our genuine preferences, so, in practice, $μ$ -maximising is safer than $U$ -maximising. We'd like an AI to take this into account, and automatically create a good $U$ , but maybe that's also too clever - maybe we can't program an AI to successfully take into account the biases we have when estimating our own biases.

Still, seeing the definite failures of $μ$ and the ability to apply most of our $μ$ -insights to $U$ -design, there's a strong argument that $U$ -maximising would likely be safer than $μ$ -maximising.

Bayes is conservatism

I've been a bit unfair there. I constructed $U$ with some care and effort; I could have made it more realistic, but it was a reasonably complex and involved effort.

When constructing $μ$ , on the other hand, I simply took the previously defined $U_{σ}$ and took a hard minimum across them. If I wanted to construct a better $μ$ , I could have put in more effort to define functions over which to take a minimum, taking into account all the likely tradeoffs...

We can see the convergence here: this sounds a lot like building $U$ . My argument here, not fully proven, is that if we are constructing any objective function that incorporates all our knowledge, we can do as well or better with a Bayesian mix of utility functions.

If we have an argument for why a non-Bayesian approach (such as a minimum or a smooth minimum) might work better, then that very argument becomes a tool to improve our Bayesian mix. Saying "if we use the Bayesian mix, we have problem $P$ " becomes "how bad is $P$ , and what would be pay to avoid it?", which modifies and improves the Bayesian mix.

If we wanted a more gradual transition, we could use $S_{0}$ , the smoothstep/clamping function defined by $S_{0} (x) = 0$ if $x < 0$ , $S_{0} (x) = 1$ if $x > 1$ , and $S_{0} (x) = x$ otherwise. ↩︎

[-]Kenoubi3y30

It seems like there might be a problem with this argument if the true are not just unknown, but adversarially chosen. For example, suppose the true $v_{2}$ are the actual locations of a bunch of landmines, from a full set of possible landmine positions $V_{2}$ . We are trying to get a vehicle from A to B, and all possible paths go over some of the $V_{2}$ . We may know that the opponent placing the landmines only has $n_{2}$ landmines to place. Furthermore, suppose each landmine only goes off with some probability $p$ even if the vehicle drives over it. If we can mechanistically predict where the opponent placed the landmines, or even mechanistically derive a probability distribution over the landmine placements, this is no problem, we can just use that to minimize the expected probability of driving over a landmine that goes off. However, suppose we can't predict the opponent that way, but we do know the opponent is trying to maximize the probability that the vehicle drives over a landmine that isn't a dud. It seems like we need to use game theory here, not just probability theory, to figure out what mixed strategy the opponent would be using to maximize the probability that we drive over a landmine, and then use that game-theoretic strategy to choose a mixed strategy for which path to take. It seems like the game theory here involves a step where we look for the worst (according to our utility function) probability distribution over where the landmines are placed, because this is how the opponent will have actually chosen where to put the landmines. Doesn't this look a lot like using $μ$ rather than $U$ as our utility function?

[-]sj99994yΩ010

These might be some typos:

From the following, the second should be $V_{2}$ :

What can we say about $U$ ? Well, if we fix a $v_{3} \in V_{3}$ , then it will appear in $(\frac{N_{1}}{n_{1}}) (\frac{N_{2}}{n_{2}}) (\frac{N_{3} - 1}{n_{3} - 1})$ of the $U_{σ}$ terms (i.e. all choices of the $V_{1}$ and $V_{3}$ variables, and all possible choices of the other $n_{3} - 1$ variables in the other $N_{3} - 1$ variables in $V_{3}$ ).

as in

(i.e. all choices of the $V_{1}$ and $V_{2}$ variables, and all possible choices of the other $n_{3} - 1$ variables in the other $N_{3} - 1$ variables in $V_{3}$ ).

In the following, the second term $(\prod_{v_{1} \in σ_{3}} H (v_{2}))$ should instead be multiplied over $v_{2} \in σ_{2}$ .

Define $σ = (σ_{1}, σ_{2}, σ_{3})$ as selecting $n_{i}$ variables from $V_{i}$ , and

U_{σ} = (\sum v_{1} \in σ_{1} v_{1}) \cdot (\prod v_{1} \in σ_{3} H (v_{2})) + \sum v_{3} \in σ_{3} - e^{- v_{3}} .

I believe the following should be "from positive to negative" rather than "from negative to positive".

And here, it seems "(ignoring the $V_{3}$ )" might be "(ignoring the $v$ )".

Thanks to Rupert McCallum for help in identifying some of these typos.

U_{σ} = (\sum v_{1} \in σ_{1} v_{1}) \cdot (\prod v_{1} \in σ_{3} H (v_{2})) + \sum v_{3} \in σ_{3} - e^{- v_{3}} .

15