A continuation of Only humans can have human values. Revised late in the evening on April 30.

Summary: I will present a model of value systems, and show that under it, the "averaged value system" found by averaging the values of all the agents,

- (RESTATED:) has more internal inconsistency than you would on average get by picking one agent's values at random
- is a less stable value system than you would get by picking one agent's values at random

ADDED: The reason for doing this is that numerous people have suggested implementing CEV by averaging different value systems together. My intuition is that value systems are not random; they are optimized in some way. This optimization is undone if you mix together different value systems simply by averaging them. I demonstrate this in the case where we suppose they are optimized to minimize internal conflict.

To someone working with the assumptions needed for CEV, the second bullet point is probably more important. Stability is central to CEV, while internal inconsistency may be a mere computational inconvenience.

## ADDED: Inconsistencies in value systems

We find consistent correlations in value systems. The US has two political parties, Republican and Democrat; and many people who find one or the other obviously, intuitively correct. Most countries have a conservative/liberal dimension that many values line up along. It's hard to know whether this is because people try to make their values consistent; or because game theory tends to produce two parties, or even because parties form along the first principle component of the scatterplot of the values of members of society, so that some essentially artifactual vector is guaranteed to be found to be the main dimension along which opinions vary. However, it's at least suggestive. You seldom find a country where the conservatives favor peace and the liberals favor war; or where the liberals value religious rules more than the conservatives. I seldom find vegetarians who are against welfare, or loggers or oilmen who are animal-rights activists.

If it's a general principle that some process causes people to form value systems with less inconsistencies than they would have by gathering different pieces from different value systems at random, it's not a great leap of faith to say that value systems with less inconsistencies are better in some way than ones with more inconsistencies. We can at the very least say that a cobbled-together value system lacks this property of naturally-occurring human value systems; and therefore is not itself a good example of a human value system.

You might study the space of possible environments in which an agent must act, and ask where in that space values are in conflict, and what the shape of the decision boundary surfaces between actions are in that space. My intuition is that value systems with many internal conflicts have complex boundary surfaces in that space.

More complex decision boundaries enable an agent to have a decision function that makes finer discriminations, and therefore can make more use of the information in the environment. However, overly-complex decision boundaries may be adding noise.

If you take the value systems held by a set of agents "in the wild", we can suppose their decision boundary surfaces are adapted to their environment and to their capabilities, so that they are doing a good job of balancing the complexity of the agent's decision surface vs. their computational power and the complexity of the life they face.

If you construct a value system from those value systems, in a way that does not use the combined information used to construct all of them, and you end up with a more-complex decision surface constructed from the same amount of underlying information as a typical "wild-type" value system, you could conclude that this decision surface is overly-complex, and the extra complexities are noise/overfitting.

I have other reasons I think that the degree of inconsistency within a value system could be a metric used to evaluate it. The comments below explore some different aspects of this. The topic needs at least a post of its own. The idea that higher internal consistency is always better is too simple. However, if we have a population of wild-type value systems that we think are adapted by some self-organizing process, then if we combine them in a way that produces an artificial value system that is consistently biased in the same direction - either lower or higher internal consistency than wild-type - I think that is cause for concern.

(I don't know if there are any results showing that an associative network with a higher IC, as defined below, has a more complex decision surface. I would expect this to be the case. A Hopfield network with no internal conflict would have a plane for its decision surface, and be able to store only 2 patterns.)

## A model of value systems

Model any value system as a fully-connected network, where the nodes are values, and the connection from one value to another gives the correlation (from -1 to 1) between the recommendations for behavior given by the two values. Each node is assigned a real number from 0 to 1 indicating how strongly the agent holds the value associated with that node. Connection weights are fixed by the environment; node values vary according to the value system.

The internal conflict (IC) in a value system is the negative of the sum, over all pairs of nodes, of the product of the node values and the connection weight between them. This is an energy measure that we want to minimize. Averaging value systems together is a reasonable thing to do, for an expected-utility-maximizer, *only* if the average of a set of value systems is expected to give a lower IC than the average IC of all of the value systems. (Utility = - (internal conflict).)

## IC(averaged values) > average(IC) if agents are better than random

Let there be *N* nodes. Let *a* be an agent from the set *A* of all agents. Let *v _{ai}* be the value agent

*a*places on node

*i*. Let

*w*be the weight between nodes

_{ij}*i*and

*j*. Let the "averaged agent"

*b*mean a constructed agent

*b*(not in A) for which v

_{bi}= average over all a of v

_{ai}. Write "the sum over all i and j of S" as sum_{i, j}(S).

Average IC = ICa = - sum_{i, j} [w_{ij} x sum_a (v_{ai} x v_{aj})] / |A|

Expected IC from average agent *b* = ICb = - sum_{i, j} [w_{ij} x (sum*_*a(v_{ai}) / |A|) x (sum_a(v_{aj}) / |A|)]

Now I will introduce the concept of a "random agent", which is an agent *r* constructed by choosing some other agent *a* at random for every node *i, *and setting v_{ri} = v_{ai}. Hopefully you will agree that a random agent will have, on average, a higher IC than one of our original agents, because existing agents are at least a little bit optimized, by evolution or by introspection.

(You could argue that values are things that an agent never, by definition, willingly changes, or is even capable of changing. Rather than get into a tricky philosophical argument, I will point out that, if that is so, then values have little to do with what we call "values" in English; and what follows applies more certainly to something more like the latter, and to what we think of when people say "values". But if you also claim that evolution does not reduce value conflicts, you must have a simple, statically-coded priority-value model of cognition, eg Brooks' subsumption architecture; and you must also believe that the landscape of optimal action as a function of environment is everywhere discontinuous, or else you would expect agents in which a slight change in stimuli results in a different value achieving dominance to suffer a penalty for taking uncorrelated actions in situations that differ only slightly.)

We find the average IC of a random agent, which we agreed (I hope) is higher than the average IC of a real agent, by averaging the contribution from pair of nodes {i, j} over all possible choices of agents used to set *v _{ri} *and

*v*. The average IC of a random agent is then

_{rj}ICr = Average IC of a random agent = - sum_{i, j} [w_{ij} x sum_a (v_{ai} x sum_a(v_{aj)}))] / (|A| x |A|)

We see that ICr = ICb. In other words, using this model, constructing a value system by averaging together other value systems gives you the same result that you would get, on average, by picking one agent's value for one node, and another agent's value for another node, and so on, at random. If we assume that the value system held by any real agent is, on average, better than such a randomly-thrown-together value system, this means that picking the value system of any real agent will give a lower expected IC than picking the value system of the averaged agent.

I didn't design this model to get that result; I designed just one model, which seemed reasonable to me, and found the proof afterward.

## Value systems are stable; an averaged value system is not

Suppose that agents have already evolved to have value systems that are consistent; and that agents often actively work to reduce conflicts in their value systems, by changing values that their other values disagree with. (But see comments below on deep values vs. surface values. A separate post justifying this supposition, and discussing whether humans have top-level goals, is needed.) If changing one or two node values would reduce the IC, either evolution or the agent would probably have already done so. This means we expect that each existing value system is already a local optimum in the space of possible node values.

If a value system is not at a local optimum, it's unstable. If you give that value system to an agent, or a society, it's likely to change to something else - possibly something far from its original setting. (Also, the fact that a value system is not a local optimum is a strong indicator that it has higher-than-typical IC, because the average IC of systems that are a little ways *d* away from a local minimum is greater than the average IC of systems at a local minimum, by an amount proportional to *d*.)

Averaging value systems together is therefore a reasonable thing to do *only* if the average of a set of value systems that are all local minima is guaranteed to give a value system that is also a local minimum.

This is not the case. Consider value systems of 3 nodes, A, B, and C, with the weights AB=1, BC=1, AC=-1. Here are two locally-optimal value systems. Terms in conflict measures are written as node x connection x node:

A = 0, B = 1, C = 1: Conflict = -(0 x 1 x 1 + 1 x 1 x 1 + 1 x -1 x 0) = -1

A = 1, B = 1, C = 0: Conflict = -(1 x 1 x 1 + 1 x 1 x 0 + 0 x -1 x 1) = -1

The average of these two systems is

A = 1/2, B = 1, C = 1/2: Conflict = -(.5 x 1 x 1 + 1 x 1 x .5 + .5 x -1 x .5) = -.75

We can improve on this by setting A = 1:

A = 1, B = 1, C = 1/2: Conflict = -(1 x 1 x 1 + 1 x 1 x .5 + .5 x -1 x 1) = -1 < -.75

It would only be by random chance that the average of value systems would be locally optimal. Averaging together existing values is thus practically guaranteed to give an unstable value system.

Let me point out again that I defined my model first, and the first example of two locally-optimal value systems that I tried out, worked.

## You can escape these proofs by not being rational

If we suppose that higher-than-wild-type IC is bad, under what circumstance is it still justified to choose the averaged agent rather than one of the original agents? It would be justified if you give an extremely high penalty for choosing a system with high IC, and do not give a correspondingly high reward for choosing a system with a wild-type IC. An example would be if you chose a value system so as to minimize the chance of having an IC greater than that given by averaging all value systems together. (In this context, I would regard that particular goal as cheating, as it is constructed to give the averaged value system a perfect score. It suffers zero-risk bias.)

Such risk-avoidant goals would, I think, be more likely to be achieved by averaging (although I haven't done the math). But they do not maximize expected utility. They suffer risk-avoidance bias, by construction.

## ... or by doing very thorough factor analysis

If, as I mentioned in Only humans can have human values, you can perform factor analysis and identify truly independent, uncorrelated latent "values", then the above arguments do not apply. You must take into account multiple hypothesis testing; using mathematics that guaranteed finding such a result would not impress me. If, for instance, you were to simply perform PCA and say that the resulting eigenvectors are your true latent values, I would respond that the first dozen eigenvectors might be meaningful, but the next thousand are overfitted to the data. You might achieve a great simplification of the problem, and greatly reduce the difference between ICa and ICb; but would still have ICa < ICb.

## ADDED: Ensemble methods

In machine learning, "ensemble methods" mean methods that combine (often by averaging together) the predictions of different classifiers. It is a robust result that ensemble methods have better performance than any of the individual methods comprising them. This seems to contradict the claim that an averaged value systems would be worse than any of the individual value systems comprising it.

I think there is a crucial difference, however: In ensemble methods, each of the different methods has exactly the same goals (they are trained by a process that agrees on what are good and bad decisions). An ensemble method is isomorphic to asking a large number of people who have the *same* value system to vote on a course of action.