Proxy misspecification and the capabilities vs. value learning race

[-]VojtaKovarik3yΩ110

I might be interpretting things wrong, but it seems to me that the paper is doing things the wrong way around. That is, (it seems to me that) the paper sets out to prove that Goodhart's law is an issue and picks a setting where this will be the case --- as opposed to picking a setting, then investigating whether/when Goodhart's law is an issue.

By this, I don't mean to say that the paper is bad; it is good. I merely mean to say that we should view it as a nice metaphor that formalises some intuitions about Goodhart's law, rather than as a model that is "causaly related to how Goodhart's law works (or doesn't) in reality".^[1]

Why do I think this? Well, if you look at the assumptions, they say that both the utility function and the costs (constraint function) are strictly increasing in all attributes. First, this is not always how the world works. Second, this means that, by assumption, there will always be tradeoffs, and there will always be issues with Goodhart's law.

^{^}
To be clear, I think the analysis of "if we assume that tradeoffs are unavoidable, what happens?" is informative. I would just prefer to be be very clear that the premise is just a hypothetical assumption, and actually one that is false more often than not.

[-]VojtaKovarik3yΩ110

Incidentally, I am trying to come up with a "better" model for "this stuff", one that would have predictive power over reality. (As opposed to starting out with a clear bottom line.) No solutions yet, but I do have some thoughts. If other people are also actively working on this, I would be happy to talk.

[-]adamShimi4yΩ1-40

Thanks for trying to make the issue more concrete and provide a way to discuss it!

One thing I want to point out is that you don't really need to put the non-constrained variables at the worst possible state; you just have the degree of freedom to put them to whatever helps you and is not too hard to reach.

Using sets, you have a set of world you want, and a proxy that is a superset of this (because you're not able to aim exactly at what you want). The problem is that the AI is optimizing to get in the superset with high guarantees and stay there, and so it's probably aiming for the easiest part of the set to reach and stay in (submitted to the accessibility constraints that you mention). This is what should lead to instrumental convergence and the real issue with the proxies IMO.

It doesn't seem obvious to me how this race will go by default; in fact, the likely trajectories seem to depend on lots of empirical facts about the world that I don't have strong views on.

Let me propose another framing: there are less possible worlds in which the curves are "nice". The good case is more specific, more constrained, and thus there are more ways things can go wrong. This doesn't mean things will definitely go wrong or that there's no argument that could convince us that the situation will be good by default. Just that the burden of proof is on showing that the good but less numerous worlds are somehow privileged by Reality.

^{^}

This isn't the recording from the talk I attended, but it seems to be identical in content.

^{^}

More precisely, we shouldn't model agents as freely choosing states. Rather, we should model them as choosing actions (from a much lower-dimensional space!) which influence states. Each action might influence many of the features $s_{i}$ that we care about, hence the apparent correlation between $s_{i}$ 's. Thanks to Ben Edelman for suggesting this model (which is present in these papers that study proxies which work well despite Goodharting).

^{^}

I'm going to keep using the word "error" informally to mean a measure of dissimilarity between $~ U$ and $U$ , but I don't actually have in mind a particular metric that we could use to quantify how different a proxy utility function is from a given utility function. Sorry if this is confusing.

^{^}

Note also that depending on what you think about a whole bunch of other safety-relevant problems (mesa-optimizers, robustness to distributional shift, etc.), you might think that by default the safe margin of error for high capabilities levels is negative, i.e. that we won't be safe even with a proxy utility function that perfectly matches our true utility function. To be clear, since this is a post about proxy misspecification, I'm implicitly screening off these concerns and imagining that proxy misspecification is the only obstruction to alignment.

LESSWRONG
LW

LESSWRONG
LW

23

Proxy misspecification and the capabilities vs. value learning race

23

Ω 15

23

Ω 15

Our model of proxy misspecification

My complaint: it might be hard to decouple the $s_{i}$ 's

The capabilities vs. value learning race

23

Proxy misspecification and the capabilities vs. value learning race

23

Ω 15

23

Ω 15

Our model of proxy misspecification

My complaint: it might be hard to decouple the si's

The capabilities vs. value learning race

My complaint: it might be hard to decouple the $s_{i}$ 's