Strong implication of preference uncertainty

by Stuart_Armstrong1 min read12th Aug 20203 comments


Ω 10

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here is a theory that is just as good as general relativity:

AGR (Angel General Relativity): Tiny invisible angels push around all the particles in the universe in a way that is indistinguishable from the equations of general relativity.

This theory is falsifiable, just as general relativity (GR) itself is. Indeed, since it gives exactly the same predictions as GR, a Bayesian will never find evidence that prefers it over Einstein's theory.

Therefore, I obviously deserve a Nobel prize for suggesting it.

Enter Occam's shaving equipment

Obviously the angel theory is not a revolutionary new theory. Partially because I've not done any of the hard work, just constructed a pointer to Einstein's theory. But, philosophically, the main justification is Occam's razor - the simplest theory is to be preferred.

From a Bayesian perspective, you could see violations of Occam's razor as cheating, using your posterior as priors. There is a whole class of "angels are pushing particles" theories, and AGR is just a small portion of that space. By considering AGR and GR on equal footing, we're privileging AGR above what it deserves[1].

In physics, Occam's razor doesn't matter for strictly identical theories

Occam's razor has two roles: the first is to distinguish between strictly identical theories; the second is to distinguish between theories that give the same prediction on the data so far, but may differ in the future.

Here, we focus on the first case: GR and AGR are strictly identical; no data will ever distinguish them. In essence, the theory that one is right and the other wrong is not falsifiable.

What that means is that, though AGR may be a priori less likely than GR, the relative probability between the two theories will never change: they make the same predictions. And also because they make the same predictions, that relative probability is irrelevant in practice: we could use AGR just as well as GR for predictions.

How preferences differ

Now let's turn to preferences, as described in our paper "Occam's razor is insufficient to infer the preferences of irrational agents".

Here two sets of preferences are "prediction-identical", in the sense of the physics theories above, if they predict the same behaviour for the agent. So that means that two different preference-based explanations for the same behaviour will never change their relative probabilities.

Worse than that, Occam's razor doesn't solve the issue. The simplest explanations of, say, human behaviour, is that humans are fully rational at all times. This isn't the explanation that we want.

Even worse than that, prediction-identical preferences will lead to vastly different consequences if program an AI to maximise them.

So, in summary:

  1. Prediction-identical preferences never change relative probability.
  2. The simplest prediction-identical preferences are known to be wrong for humans.
  3. It could be very important for the future to get the right preference for humans.

  1. GR would make up a larger portion of , "geometric theories of space-time" than AGR makes up of , and would be more likely than anyway, especially after updating on the non-observation of angels. ↩︎



Ω 10