Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

by Daniel Kokotajlo 1mo7th Oct 20196 min read39 comments

49

Ω 19


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

[Epistemic Status: My inside view feels confident, but I’ve only discussed this with one other person so far, so I won't be surprised if it turns out to be confused.]

Armstrong and Mindermann (A&M) argue "that even with a reasonable simplicity prior/Occam’s razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple ‘normative’ assumptions, which cannot be deduced exclusively from observations."

I explain why I think their argument is faulty, concluding that maybe Occam's Razor is sufficient to do the job after all.

In what follows I assume the reader is familiar with the paper already or at least with the concepts within it.

Brief summary of A&M's argument:

(This is merely a brief sketch of A&M’s argument; I’ll engage with it in more detail below. For the full story, read their paper.)

Take a human policy pi = P(R) that we are trying to represent in the planner-reward formalism. R is the human’s reward function, which encodes their desires/preferences/values/goals. P() is the human’s planner function, which encodes how they take their experiences as input and try to choose outputs that achieve their reward. Pi, then, encodes the overall behavior of the human in question.

Step 1: In any reasonable language, for any plausible policy, you can construct “degenerate” planner-reward pairs that are almost as simple as the simplest possible way to generate the policy, yet yield high regret (i.e. have a reward component which is very different from the "true"/"Intended" one.)

  • Example: The planner deontologically follows the policy, despite a buddha-like empty utility function
  • Example: The planner greedily maximizes the reward function "obedience-to-the-policy."
  • Example: Double-negated version of example 2.

It’s easy to see that these examples, being constructed from the policy, are at most slightly more complex than the simplest possible way to generate the policy, since they could make use of that way.

Step 2: The "intended" planner-reward pair--the one that humans would judge to be a reasonable decomposition of the human policy in question--is likely to be significantly more complex than the simplest possible planner-reward pair.

  • Argument: It's really complicated.
  • Argument: The pair contains more information than the policy, so it should be more complicated.
  • Argument: Philosophers and economists have been trying for years and haven't succeeded yet.

Conclusion: If we use Occam’s Razor alone to find planner-reward pairs that fit a particular human’s behavior, we’ll settle on one of the degenerate ones (or something else entirely) rather than a reasonable one. This could be very dangerous if we are building an AI to maximize the reward.

Methinks the argument proves too much:

My first point is that A&M’s argument probably works just as well for other uses of Occam’s Razor. In particular it works just as well for the canonical use: finding the Laws and Initial Conditions that describe our universe!

Take a sequence of events we are trying to predict/represent with the lawlike-universe formalism, which posits C (the initial conditions) and then L() the dynamical laws, a function that takes initial conditions and extrapolates everything else from them. L(C) = E, the sequence of events/conditions/world-states we are trying to predict/represent.

Step 1: In any reasonable language, for any plausible sequence of events, we can construct "degenerate" initial condition + laws pairs that are almost as simple as the simplest pair.

  • Example: The initial conditions are an empty void, but the laws say "And then the sequence of events that happens is E"
  • Example: The initial conditions are simply E, and L() doesn’t do anything.

It’s easy to see that these examples, being constructed from E, are at most slightly more complex than the simplest possible pair, since they could use the simplest pair to generate E.

Step 2: The "intended" initial condition+law pair is likely to be significantly more complex than the simplest pair.

  • Argument: It's really complicated.
  • Argument: The pair contains more information than the sequence of events, so it should be more complicated.
  • Argument: Physicists have been trying for years and haven't succeeded yet.

Conclusion: If we use Occam’s Razor alone to find law-condition pairs that fit all the world’s events, we’ll settle on one of the degenerate ones (or something else entirely) rather than a reasonable one. This could be very dangerous if we are e.g. building an AI to do science for us and answer counterfactual questions like “If we had posted the nuclear launch codes on the Internet, would any nukes have been launched?”

This conclusion may actually be true, but it’s a pretty controversial claim and I predict most philosophers of science wouldn’t be impressed by this argument for it--even the ones who agree with the conclusion.

Objecting to the three arguments for Step 2

Consider the following hypothesis, which is basically equivalent to the claim A&M are trying to disprove:

Occam Sufficiency Hypothesis: The “Intended” pair happens to be the simplest way to generate the policy.

Notice that everything in Step 1 is consistent with this hypothesis. The first degenerate pairs are constructed from the policy, so they are more complicated than the simplest way to generate it, so if that way is via the intended pair, they are more complicated (albeit only slightly) than the intended pair.

Next, notice that the three arguments in support of Step 2 don’t really hurt this hypothesis:

Re: first argument: The intended pair can be both very complex and the simplest way to generate the policy; no contradiction there. Indeed that’s not even surprising: since the policy is generated by a massive messy neural net in an extremely diverse environment, we should expect it to be complex. What matters for our purposes is not how complex the intended pair is, but rather how complex it is relative to the simplest possible way to generate the policy. A&M need to argue that the simplest possible way to generate the policy is simpler than the intended pair; arguing that the intended pair is complex is at best only half the argument.

Compare to the case of physics: Sure, the laws of physics are complex. They probably take at least a page of code to write up. And that’s aspirational; we haven’t even got to that point yet. But that doesn’t mean Occam’s Razor is insufficient to find the laws of physics.

Re: second argument: The inference from “This pair contains more information than the policy” to “this pair is more complex than the policy” is fallacious. Of course the intended pair contains more information than the policy! All ways of generating the policy contain more information than it. This is because there are many ways (e.g. planner-reward pairs) to get any given policy, and thus specifying any particular way is giving you strictly more information than simply specifying the policy.

Compare to the case of physics: Even once we’ve been given the complete history of the world (or a complete history of some arbitrarily large set of experiment-events) there will still be additional things left to specify about what the laws and initial conditions truly are. Do the laws contain a double negation in them, for example? Do they have some weird clause that creates infinite energy but only when a certain extremely rare interaction occurs that never in fact occurs? What language are the laws written in, anyway? And what about the initial conditions? Lots of things left to specify that aren’t determined by the complete history of the world. Yet this does not mean that the Laws + Initial Conditions are more complex than the complete history of the world, and it certainly doesn’t mean we’ll be led astray if we believe in the Laws+Conditions pair that is simplest.

Re: third argument: Yes, people have been trying to find planner-reward pairs to explain human behavior for many years, and yes, no one has managed to build a simple algorithm to do it yet. Instead we rely on all sorts of implicit and intuitive heuristics, and we still don’t succeed fully. But all of this can be said about Physics too. It’s not like physicists are literally following the Occam’s Razor algorithm--iterating through all possible Law+Condition pairs in order from simplest to most complex and checking each one to see if it outputs a universe consistent with all our observations. And moreover, physicists haven’t succeeded fully either. Nevertheless, many of us are still confident that Occam’s Razor is in principle sufficient: If we were to follow the algorithm exactly, with enough data and compute, we would eventually settle on a Law+Condition pair that accurately describes reality, and it would be the true pair. Again, maybe we are wrong about that, but the arguments A&M have given so far aren’t convincing.

Conclusion

Perhaps Occam’s Razor is insufficient after all. (Indeed I suspect as much, for reasons I’ll sketch in the appendix) But as far as I can tell, A&M’s arguments are at best very weak evidence against the sufficiency of Occam’s Razor for inferring human preferences, and moreover they work pretty much just as well against the canonical use of Occam’s Razor too.

This is a bold claim, so I won’t be surprised if it turns out I was confused. I look forward to hearing people’s feedback. Thanks in advance! And thanks especially to Armstrong and Mindermann if they take the time to reply.


Many thanks to Ramana Kumar for hearing me out about this a while ago when we read the paper together.


Appendix: So, is Occam’s Razor sufficient or not?

--A priori, we should expect something more like a speed prior to be appropriate for identifying the mechanisms of a finite mind, rather than a pure complexity prior.

--Sure enough, we can think of scenarios in which e.g. a deterministic universe with somewhat simple laws develops consequentialists who run massive simulations including of our universe and then write down Daniel’s policy in flaming letters somewhere, such that the algorithm “Run this deterministic universe until you find big flaming letters, then read out that policy” becomes a very simple way to generate Daniel’s policy. (This is basically just the “Universal Prior is Malign” idea applied in a new way.)

--So yeah, pure complexity prior is probably not good. But maybe a speed prior would work, or something like it. Or maybe not. I don’t know.

--One case that seems useful to me: Suppose we are considering two explanations of someone’s behavior: (A) They desire the well-being of the poor, but [insert epicycles here to explain why they aren’t donating much, are donating conspicuously, are donating ineffectively] and (B) They desire their peers (and their selves) to believe that they desire the well-being of the poor. Thanks to the epicycles in (A), both theories fit the data equally well. But theory B is much more simple. Do we conclude that this person really does desire the well-being of the poor, or not? If we think that even though (A) is more complex it is also more accurate, then yeah it seems like Occam’s Razor is insufficient to infer human preferences. But if we instead think “Yeah, this person just really doesn’t care, and the proof is how much simpler B is than A” then it seems we really are using something like Occam’s Razor to infer human preferences. Of course, this is just one case, so the only way it could prove anything is as a counterexample. To me it doesn’t seem like a counterexample to Occam’s sufficiency, but I could perhaps be convinced to change my mind about that.

--Also, I'm pretty sure that once we have better theories of the brain and mind, we’ll have new concepts and theoretical posits to explain human behavior. (e.g. something something Karl Friston something something free energy?) Thus, the simplest generator of a given human’s behavior will probably not divide automatically into a planner and a reward; it’ll probably have many components and there will be debates about which components the AI should be faithful to (dub these components the reward) and which components the AI should seek to surpass (dub these components the planner.) These debates may be intractable, turning on subjective and/or philosophical considerations. So this is another sense in which I think yeah, definitely Occam’s Razor isn’t sufficient--for we will also need to have a philosophical debate about what rationality is.

49

Ω 19