Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here is a theory that is just as good as general relativity:

AGR (Angel General Relativity): Tiny invisible angels push around all the particles in the universe in a way that is indistinguishable from the equations of general relativity.

This theory is falsifiable, just as general relativity (GR) itself is. Indeed, since it gives exactly the same predictions as GR, a Bayesian will never find evidence that prefers it over Einstein's theory.

Therefore, I obviously deserve a Nobel prize for suggesting it.

Enter Occam's shaving equipment

Obviously the angel theory is not a revolutionary new theory. Partially because I've not done any of the hard work, just constructed a pointer to Einstein's theory. But, philosophically, the main justification is Occam's razor - the simplest theory is to be preferred.

From a Bayesian perspective, you could see violations of Occam's razor as cheating, using your posterior as priors. There is a whole class of "angels are pushing particles" theories, and AGR is just a small portion of that space. By considering AGR and GR on equal footing, we're privileging AGR above what it deserves[1].

In physics, Occam's razor doesn't matter for strictly identical theories

Occam's razor has two roles: the first is to distinguish between strictly identical theories; the second is to distinguish between theories that give the same prediction on the data so far, but may differ in the future.

Here, we focus on the first case: GR and AGR are strictly identical; no data will ever distinguish them. In essence, the theory that one is right and the other wrong is not falsifiable.

What that means is that, though AGR may be a priori less likely than GR, the relative probability between the two theories will never change: they make the same predictions. And also because they make the same predictions, that relative probability is irrelevant in practice: we could use AGR just as well as GR for predictions.

How preferences differ

Now let's turn to preferences, as described in our paper "Occam's razor is insufficient to infer the preferences of irrational agents".

Here two sets of preferences are "prediction-identical", in the sense of the physics theories above, if they predict the same behaviour for the agent. So that means that two different preference-based explanations for the same behaviour will never change their relative probabilities.

Worse than that, Occam's razor doesn't solve the issue. The simplest explanations of, say, human behaviour, is that humans are fully rational at all times. This isn't the explanation that we want.

Even worse than that, prediction-identical preferences will lead to vastly different consequences if program an AI to maximise them.

So, in summary:

  1. Prediction-identical preferences never change relative probability.
  2. The simplest prediction-identical preferences are known to be wrong for humans.
  3. It could be very important for the future to get the right preference for humans.

  1. GR would make up a larger portion of , "geometric theories of space-time" than AGR makes up of , and would be more likely than anyway, especially after updating on the non-observation of angels. ↩︎

New Comment
3 comments, sorted by Click to highlight new comments since:
And also because they make the same predictions, that relative probability is irrelevant in practice: we could use AGR just as well as GR for predictions.

There is a subtle sense in which the difference between AGR and GR is relevant. While the difference doesn't change the predictions, it may change the utility function. An agent that cares about angels (if they exist) might do different things if it believes itself to be in AGR world than in GR world. As the theories make identical predictions, the agents belief only depends on its priors (and any irrationality), not on which world it is in. Nonetheless, this means that the agent will pay to avoid having its priors modified. Even though the modification doesn't change the agents predictions in the slightest.

Can you give an example of two sets of preferences which are prediction-identical, but which lead to will lead to "vastly different consequences if [you] program an AI to maximi[z]e them"?

The most basic examples are comparisons between derived preferences that assume the human is always rational (i.e. every action they take, no matter how mistaken it may appear, is in the service of some complicated plan for how the universe's history should go. My friend getting drunk and knocking over his friend's dresser was all planned and totally in accordance with their preferences.), and derived preferences that assume the human is irrational in some way (e.g. maybe they would prefer not to drink so much coffee, but can't wake up without it, and so the action that best fulfills their preferences is to help them drink less coffee).

But more intuitive examples might involve comparison between two different sorts of human irrationality.

For example, in the case of coffee, the AI is supposed to learn that the human has some pattern of thoughts and inclinations that mean it actually doesn't want coffee, and its actions of drinking coffee are due to some sort of limitation or mistake.

But consider a different mistake: not doing heroin. After all, upon trying heroin, the human would be happy and would exhibit behavior consistent with wanting heroin. So we might imagine an AI that infers that humans want heroin, and that their current actions of not trying heroin are due to some sort of mistake.

Both theories can be prediction-identical - the two different sets of "real preferences" just need to be filtered through two different models of human irrationality. Depending on what you classify as "irrational," this degree of freedom translates into a change in what you consider "the real preferences."