## LESSWRONGLW

Would you mind tabooing the word "preference" and re-writing this post? It's not clear to me that the research cited in your "crash course" post actually supports what you seem to be claiming here.

# 13

Suppose we want to use the convergence of humanity's preferences as the utility function of a seed AI that is about to determine the future of its light cone.

We figured out how to get an AI to extract preferences from human behavior and brain activity. The AI figured out how to extrapolate those values. But my values and your values and Sarah Palin's values aren't fully converging in the simulation running the extrapolation algorithm. Our simulated beliefs are converging because on the path to reflective equilibrium our partially simulated selves have become true Bayesians and Aumann's Agreement Theorem holds. But our preferences aren't converging quite so well.

What to do? We'd like the final utility function in the FOOMed AI to adhere to some common-sense criteria:

1. Non-dictatorship: No single person's preferences should dictate what the AI does. Its utility function must take multiple people's (extrapolated) preferences into account.
2. Determinism: Given the same choices, and the same utility function, the AI should always make the same decisions.
3. Pareto efficiency: If every (extrapolated) person prefers action A to action B, the AI should prefer A to B.
4. Independence of irrelevant alternatives: If we — a group of extrapolated preference-sets — prefer A to B, and a new option C is introduced, then we should still prefer A to B regardless of what we think about C.

Now, Arrow's impossibility theorem says that we can only get the FOOMed AI's utility function to adhere to these criteria if the extrapolated preferences of each partially simulated agent are related to each other cardinally ("A is 2.3x better than B!") instead of ordinally ("A is better than B, and that's all I can say").

Now, if you're an old-school ordinalist about preferences, you might be worried. Ever since Vilfredo Pareto pointed out that cardinal models of a person's preferences go far beyond our behavioral data and that as far as we can tell utility has "no natural units," some economists have tended to assume that, in our models of human preferences, preference must be represented ordinally and not cardinally.

But if you're keeping up with the latest cognitive neuroscience, you might not be quite so worried. It turns out that preferences are encoded cardinally after all, and they do have a natural unit: action potentials per second. With cardinally encoded preferences, we can develop a utility function that represents our preferences and adheres to the common-sense criteria listed above.

Whaddya know? The last decade of cognitive neuroscience has produced a somewhat interesting result concerning the plausibility of CEV.