**Prerequisite reading:** Cognitive Neuroscience, Arrow's Impossibility Theorem, and Coherent Extrapolated Volition.

**Abstract:** Arrow's impossibility theorem poses a challenge to viability of coherent extrapolated volition (CEV) as a model for safe-AI architecture: per the theorem, no algorithm for aggregating ordinal preferences can necessarily obey Arrow's four fairness criteria while simultaneously producing a transitive preference ordering. One approach to exempt CEV from these consequences is to claim that human preferences are cardinal rather than ordinal, and therefore Arrow's theorem does not apply. This approach is shown to ultimately fail and other options are briefly discussed.

A problem arises when examining CEV from the perspective of welfare economics: according to Arrow's impossibility theorem, no algorithm for the aggregation of preferences can necessarily meet four common-sense fairness criteria while simultaneously producing a transitive result. Luke has previously discussed this challenge. (See the post linked above.)

Arrow's impossibility theorem assumes that human preferences are ordinal but (as Luke pointed out) recent neuroscientific findings suggest that human preferences are cardinally encoded. This fact implies that human preferences - and subsequently CEV - are not bound by the consequences of the theorem.

However, Arrow's impossibility theorem extends to cardinal utilities with the addition of a continuity axiom. This result - termed Samuelson's conjecture - was proven by Ehud Kalai and David Schmeidler in their 1977 paper "Aggregation Procedure for Cardinal Preferences." If an AI attempts to model human preferences using a utility theory that relies on the continuity axiom, then the consequences of Arrow's theorem will still apply. For example, this includes an AI using the von Neumann-Morgenstern utility theorem.

The proof of Samuelson's conjecture limits the solution space for what kind of CEV aggregation procedures are viable. In order to escape the consequences of Arrow's impossibility theorem, a CEV algorithm must accurately model human preferences without using a continuity axiom. It may be the case that we are living in a second-best world where such models are impossible. This scenario would mean we must make a trade-off between employing a fair aggregation procedure and producing a transitive result.

Supposing this is the case, what kind of trade-offs would be optimal? I am hesitant about weakening the transitivity criterion because an agent with a non-transitive utility function is vulnerable to Dutch-book theorems. This scenario poses a clear existential risk. On the other hand, weakening the independence of irrelevant alternatives criterion may be feasible. My cursory reading of the literature suggests that this is a popular alternative among welfare economists, but there are other choices.

Going forward, citing Arrow's impossibility theorem may serve as one of the strongest objections against CEV. Further consideration on how to reconcile CEV with Arrow's impossibility theorem is warranted.

Arrow's Theorem doesn't say anything about strategic voting. The only reasonable non-strategic voting system I know of is random ballot (pick a random voter; they decide who wins). I'm currently trying to figure out a voting system that is based on finding the Nash equilibrium (which may be mixed) of approval voting, and this system might also be strategy-free.

When I said linear combination of utility functions, I meant that you fix the scaling factors initially and don't change them. You could make all of them 1, for example. Your voting system (described in the last paragraph) is a combination of range voting and IRV. If everyone range votes so that their favorite gets 1 and everyone else gets -1, then it's identical to IRV, and shares the same problems such as non-monotonicity. I suspect that you will also get non-monotonicity when votes aren't "favorite gets 1 and everyone else gets -1".

EDIT: I should clarify: it's not 1 for your favorite and -1 for everyone else. It's 1 for your favorite and close to -1 for everyone else, such that when your favorite is eliminated, it's 1 for your next favorite and close to -1 for everyone else after rescaling.