This post was written quickly, lest I do not write it at all.

Picture the following scenario. 

  1. Humans train a model, M, with the intention for M to minimize a loss function L. 
  2. The model, M, will now take a set of actions. 

I see this going wrong in two ways. 

  1. It is possible that L is malformed (misaligned, specifically), such that effectively decreasing L kills everyone. This is the classic paperclip maximizer scenario. We currently do not know how to design L such that this does not happen.
  2. Even if L is not malformed, the set of actions taken by M might be catastrophic. This is the mesa-optimization problem.

The first failure case has captured most of my attention. Meanwhile, I have been somewhat dismissive of the second failure case. 

I would like to explain why I was dismissive of the mesa-optimization problem, and make an argument for why I think we should in fact take it seriously.

--

We understand that M is an optimizer. However, we can also assume that M is not a perfect optimizer. On out-of-distribution data, M is likely to fail to optimize L.

We can define a new loss function, L', which M actually does perfectly optimize for. We define L' such that the more resources M has, the more effective M will be in decreasing L'.

L' is not taken from "human-designed objective function" space. In fact, my intuition states that L' is likely to look very strange and complex. If we were to attempt to extract the utility function from a heavily intelligence-enhanced human based on their actions, I doubt that such a utility function would seem simple either. This intuition made me initially dismissive of mesa-optimization being a problem.

Despite having read Omuhundro's [AI Drives](https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf) paper, there did not seem to me as if there was any obvious reason why we should assume the sort of strange L'-like objective functions to suffer from instrumental convergence. One can certainly imagine many objective functions that do not lead to these drives. One could even imagine an objective function which rewards having less resources, skill, or rationality.

It might be the case that most utility functions sampled from the space of all possible utility functions converge to having these drives, but that did not and does not seem like an obviously true fact to me. 

--

I can't find the post, but someone on LessWrong wrote something along the lines of "only a tiny sliver of possible worlds are compatible with human existence." This seemed like an obviously true fact, and I'd intuit that it applies to biological sentience more broadly.

That was the "aha" moment for me. Without understanding L' more deeply, we can begin by assuming that L' is sampled from "objective function space" instead of "human-like objective function space."

I think it is the maximum entropy assumption that the terminal goal-states[1] of functions in "objective function space" are uniformly distributed across all the possible states. Since only a tiny sliver of possible states are compatible with biological sentience, we should expect a highly effective L' optimizer to be incompatible with human life.

--

Luckily, we have a bit[2] more than 0 bit of information about L'. For example, we know that with enough training, L' can be very similar to L. 

I think it might be worth exploring other things that we expect to be true about L'.

 

  1. ^

    We define a goal-state of an objective function to be a world which minimizes loss for the objective function.

  2. ^

    ;)

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 1:48 AM

Strong RL-selected policy is probably often some sort of agent, because that's the shape of being good at inverting the training setup, predicting what leads to reward and doing that. But it's not necessarily an optimizer of anything, because its goals are not selected to be stable, and most training setups won't let a policy reflect on its goals, don't allow or encourage them to become more stable under retraining. Context change from deployment would further shift the current goals.

And then you have an agent with unstable goals, the same sort of thing as a human with the same alignment worries, including goodhart and prudence of mild optimization. Its current behavior might even be OK if RL training was about that (it would appear to be aligned, not even deceptively so). But its extrapolated volition could be wildly different from that of humanity, and that's the catastrophic misalignment in this setting.

But it's not necessarily an optimizer of anything, because its goals are not selected to be stable, and most training setups won't let a policy reflect on its goals, don't allow or encourage them to become more stable under retraining

I might have an unorthodox definition of goal, but I do not understand yours. You can fit a utility function to any agent based on its actions. I defined a goal to be the utility function that you would fit onto said agent as it increased in computational power and resources.

The fact that an agents actions do not lead to a coherent human-parseable goal does not mean that its goal is inconsistent, by my lights.

But its extrapolated volition could be wildly different from that of humanity

I don't know if CEV gives us the best framing here. But if we were to use that framing, then my argument would be that starting with no additional assumptions, we expect mesa-optimization problems /because/ most volitions taken from "possible volitions space" are maligned with the continuous existence of humanity.

An unstable goal leads to near-term behavior that fits different utility functions as it changes. When an agent comprehends the idea of a stable goal as its own alignment target, that should stop path-dependence of goal drift, so that eventually the agent optimizes for something that didn't depend on how it got there (its own CEV, see these comments on what I mean by CEV, the normative alignment target).

This stops CEV drift, but not current goal drift, current goals continue changing long after that, and only arrive at CEV in very distant future. CEV is not utility that fits current actions, and current goals being unstable doesn't imply CEV being unstable, but CEV could also be unstable while agent's personal misalignment risk is not solved. Also, no utility fits current actions very well, or else current goal would be stable and exhibit goal-preservation drives. So an agent with unstable goals is not an optimizer for any goal, other than indirectly for its CEV where it path-independently tends to eventually go, but its current behavior doesn't hint at that yet, doesn't even mildly optimize for CEV taken as a goal, because it doesn't yet know what its CEV is.

[-]sudo2y2-1

Sorry, I've been too busy to reply. I'm still too busy to give an incredibly detailed reply, but I can at least give a reply. A reply is better than no reply.

An unstable goal leads to near-term behavior that fits different utility functions as it changes.

"It changes to fit different utility functions" is not distinguishable from "it has a single, complex, persistent utility function which rewards drastically differing policies in incredibly similar but subtly different contexts." An agent is never in the exact same environment twice.

So an agent with unstable goals is not an optimizer for any goal, other than indirectly for its CEV where it path-independently tends to eventually go, but its current behavior doesn't hint at that yet, doesn't even mildly optimize for CEV taken as a goal, because it doesn't yet know what its CEV is.

This framing seems significant and important to you. However, I fail to see its utility. Could you help me see why this is how you chose to look at the problem? 

What serves as a goal in distant future determines how cosmic endowment is optimized. Stable goals are also goals that remain in distant future, so they are relevant to that (and since reflection hasn't yet had a chance of having taken place, stable goals settled in near future are always misaligned). Unstable goals are not relevent in themselves, in what utility function (or maybe probutility) they fit, except in how they tend to produce different stable goals eventually.

So maintaining the distinction means not being unaware of the catastrophic misalignment risk where we turn some unstable goals into stable ones based on a stupid process of (possibly lack of) reflection that just fits things instead of doing proper well-designed reflection (a thing like CEV, possibly very different in detail). And it helps with not worrying too much about details of utility functions that fit current unstable goals, or aligning them with human current unstable goals, when they are not what actually matters.

An agent is never in the exact same environment twice.

That doesn't affect goals, which talk of all possible environments, doesn't matter if some agent actually encounters them. Goals are not just policy, instead they determine policy, not the other way around (along the algorithm vs. physical distinction, goals are closer to the algorithm, while policy is merely the behavior of the algorithm, the decision taken by it, closer to the physical instances and actions in reality). Unstable goals change their mind about the same environment. It could be an environment that will be reachable/enactable in the future.

I think the max ent. assumption over possible generalizations proves too much. E.g., the vast majority of possible utility functions have maxima where the universe is full of undifferentiated clouds of hydrogen and radiation. Do we assume that most L’ will aim for such a universe?

I'm not sure how you arrived at the conclusion that "the vast majority of possible utility functions have maxima where the universe is full of undifferentiated clouds of hydrogen and radiation."

But more fundamentally, yes I think that we should start by expecting L' to be sampled from the pool of possible objective functions. If this leads to absurd conclusions for you, that might be because you have assumptions about L' which you haven't made explicit.

I’m specifically disagreeing with using the maximum entropy prior over utility function generalizations. Obviously, generalizations are sampled from some distribution. I’m specifically saying that the maximum entropy distribution leads to absurd conclusions.

Could you give examples of what absurd conclusions this leads to?

Even "L' will aim for a universe full of undifferentiated clouds of hydrogen and radiation." is not an absurd conclusion to me. Do we just disagree on what counts as absurd? 

Apparently, because that was the example of an absurd conclusion I had in mind.

I think it's worth exploring why that difference exists. 

If someone told me that my life depended on passing the Intellectual Turing Test with you on this issue right now, I'd make an argument along the lines of "most existing models do not seem to be optimizing for a universe full of undifferentiated clouds of hydrogen and radiation, nor do most human-conceivable objectives lead to a desire for such a universe. I'd be very surprised if this was the goal of a very intelligent AI."

I concur with this imaginary Quintin that this conclusion seems unlikely. However, this is mostly because I believe that training will lead L' to look somewhat similar to L. 

Edit: I suppose that the last paragraph ascribes at least some absurdity to the hydrogen and radiation scenario. 

There are three main reasons:

  1. If you didn’t know about humans, then the max entropy prior would have given terrible predictions about how humans would actually generalize their values. Most utility functions are, indeed, inconsistent with human survival, so you’d have (very confidently) predicted that humans would not want to survive.
  2. The universe is low entropy. Most things relate to each other via approximately low rank interactions. The speed of a sled going down a hill is mostly determined by simple aggregate factors of friction and gravity, rather than complex interactions between all the parts of the sled and all the parts of the surrounding landscape (or the oceans, or distant stars, etc). Most likely, the factors that determine how values generalize are also low rank and low entropy.
  3. We know that neural nets already have a non-uniform prior over hypotheses. They’d not be able to generalize at all if they did. This almost surely extends to their prior over values generalizations.

Note that a max entropy prior over L’ - which you minimally constrain to match the behavior of L on the training distribution - still gives you totally random seeming behavior off distribution.

So the problem with this entire comment thread is that I think I generally agree with you. We're just talking past each other a little bit.

I know that maximum entropy principle is a bad place to end, but it is a good place to start. Given no other knowledge about something, we should adopt the maximum entropy principle. 

So I wanted to start at the baseline of zero knowledge about L', and then build on top of that.

Edit: deleted last bit