Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Huge thanks to Veronika Žolnerčíková for drawing the pictures for this post. I am also grateful to the people who provided feedback on various versions of the text.

Comment on the epistemic status of this text: The model described here is oversimplified, but I think it nevertheless captures an important dynamic that will reemerge in more rigorous descriptions of the topic.

The Shifting Landscape of Values

In this post, I would like to talk about the relations between different value systems of different people, cultures, aliens, AIs, etc. By a “value system”, I will mean the collection of all preferences (stated and revealed), the collection of decision rules, utility function, or whatever else the given entity uses to make decisions. Rather than focusing on what exactly value systems are and how they are implemented, I would like to discuss how the different value systems view each other, what this implies for how preferences change over time, and what can we do to have better conversation on the topic.

We can view the set of all possible value-systems as some kind of an abstract space that contains points like “Alice’s value-system”, “Bob’s value-system”, “value-system of a particular alien or AI”, with “more similar” value-systems being closer together. With some eye-squinting, we can include points like “the system of values advocated for by the catholic church”, “the system of values that hunter-gatherers might have had”, and other hypothetical value-systems that might not necessarily correspond to any specific entity (Figure 1).


Figure 1: The space of different value-systems (including hypothetical value-systems).

Rather than being an impartial observer, I currently have some system of values and I have an opinion on the other systems. As a simple model, let’s say that I assign a real number to each value-system based on how desirable it seems to me. The interpretation is that if I had to choose which of two value-systems to adopt and stick with (via a magic button), I would pick the one with the higher corresponding number (Figure 2). As we can see on Figure 2, others might have a different opinion on how this "value landscape" looks like.


Figure 2: Desirability of different value-systems, from the point of view of different entities. On the "x-axis", we see the space of different value systems (should be multi-dimensional instead of a line but this would be impractical to draw). The y-axis indicates desirability. Each curve corresponds to the preferences of the value-system to which it is connected by the cross and vertical line --- it denotes how much would an entity with that value system like to have different values instead.


It might not always be the case that the most preferable values are the ones a person currently possesses. For example, I might wish to be vegan, more emphatic, or what not, but lack the means to make the change.



Figure 3: We might value other values higher than our own (without automatically adopting them). The axes are as in Figure 1.

Apart from the “bias against dissimilar values”, the landscape might also be shaped by other factors (an anecdotal example being that of catholics hating protestants more than atheists), particularly for value-systems that work differently from those of present-day humans. As a consequence, I propose that we can imagine the space of value-systems as a “shifting landscape” that changes shape as we move through it.


Figure 4: The space of value systems as a shifting landscape (the vertical line & cross connects each value-system to its “opinion” on other systems). The axes are as in Figure 1.

I assume quite a few people already have a somewhat similar image in their head. However, I believe that if more people had explicit access to this mental model, we might start referring to it in conversations, and thus increase the efficiency of discussions that involve the change of values over time. In the remainder of the post, I give several examples of why this might be the case.

Applications of the Model

Some of the things we can do with the above-described are:

1) We can talk about “landscape features” in the value landscape and about different ways in which values can shift. For example, for any shift between values A and B, the first thing we can ask is whether A and B “agree” on the valence of the shift. We can also talk about “value drift” - a series of changes that seem neutral locally but very significant when viewed as a whole. Finally, an important landscape-feature that we can focus on are stable points (where all local changes seem like a downgrade) and attractors (stable points to which all points from a larger area are drawn).



Figure 5: Examples of value shift (from "green" to "blue" values) where the old value system and the new one agree, resp. disagree, on the valence of the shift. Left: carnivore Leela would prefer to switch to being a vegan, and would retrospectively endorse the change. Right: carnivore Morty would also prefer being a vegan, but would regret the change afterwards. The axes are as in Figure 1.



Figure 6: Value drift, where many local changes to which we are indifferent compound into a major change. (Depicted is a single person across different parts of their life. However, similar dynamics could arise on long-term civilizational scales.) The axes are as in Fig. 1.



Figure 7: A stable point in the value landscape, to which values from some neighbourhood converge. The axes are as in Figure 1.

2) We can be explicit about whether a given issue can be discussed in this simple model, or whether it requires further refinement. For, example, if we want to reason about how abolishing slavery differs from brainwashing (or from the transition from the honor-based pre-agricultural society to the present one), we need to go beyond the model. But the model suffices for describing their similarity.

3) The model can provide a shared language for expressing the core of a potential disagreement. For example, we can ask “Are there even any stable points that seem like improvements from our current point of view, or is any change necessarily a moral dilemma?”. Two people might agree that both of these options might be true about some part of the value landscape, yet disagree about the actual shape of the landscape around us.

4) Finally, the model can help us communicate desiderata that different people might have. For example, we might wish to find a stable point that is as valuable from our current point of view as possible, or say that our current values are irrelevant and instead look for a stable point that values itself the highest, or find some compromise between these two.


Figure 8: Using the model to illustrate desiderata about the future of our values. (The particular example illustrates the difference between what seems optimal now vs what will seem optimal and be stable once we get there.)


_________

27

Ω 16

6 comments, sorted by Click to highlight new comments since: Today at 12:04 PM
New Comment

Agreed! For me, this perspective follows from Radical Probabilism, but I didn't emphasize consequences for values there.

I like that this post is fairly accessible, although I found the charts confusing, largely because it's not always that clear to me what's being measured on each axis. I basically get what's going on, but I find myself disliking something about way the charts are presented because it's not always very clear what each axis measures.

(In some cases I think of them as more like being multidimensional spaces you've put on a line, but that still makes the visuals kind of confusing.)

None of this is really meant to be a big complaint, though. Graphics are hard; I probably wouldn't have even tried to illustrate it, so kudos to you for trying. Just felt it was also useful to register my feedback that they didn't quite land for me even though I got the gist of them.

Thank you for the comment. As for the axes, the y-axis always denotes the desirability of the given value-system (except for Figure 1). And you are exactly right with the x-axis --- that is a multidimensional space of value-systems that we put on a line, because drawing this in 3D (well, (multi+1)-D :-) ) would be a mess. I will see if I can make it somewhat clearer in the post.

This reminds me of the "Converse Lawvere Problem" at https://www.alignmentforum.org/posts/5bd75cc58225bf06703753b9/the-ubiquitous-converse-lawvere-problem a little bit, except that the different functions in the codomain have domain which also has other parts to it aside from the main space  . 

As in, it looks like here, we have a space  of values , which includes things such as "likes to eat meat" or "values industriousness" or whatever, where this part can just be handled as some generic nice space   , as one part of a product, and as the other part of the product has functions from  to  .
That is, it seems like this would be like,  .

Which isn't quite the same thing as is described in the converse Lawvere problem posts, but it seems similar to me? (for one thing, the converse Lawvere problem wasn't looking for homeomorphisms from X to the space of functions from X to functions to [0,1] , just a surjective continuous function).

Of course, it is only like that if we are supposing that the space we are considering,  , has to have all combinations of "other parts of values" with "opinions on the relative merit of different possible values". Of course if we just want some space of possible values, and where each value has an opinion of each value, then that's just a continuous function from a product of the space with itself, which isn't any problem.
I guess this is maybe more what you meant? Or at least, something that you determined was sufficient to begin with when looking at the topic? (and I guess most more complicated versions would be a special case of it?)


Oh, if you require that the "opinion on another values" decomposes nicely in ways that make sense (like, if it depends separately on the desirability of the base level values, and the values about values, and the values about values about values, etc., and just has a score for each which is then combined in some way, rather than evaluating specifically the combinations of those) , then maybe that would make the space nicer than the first thing I described (which I don't know whether such a thing exists) in a way that might make it more likely to exist.
Actually, yeah, I'm confident that it would exist that way.
Let 
And let  
And then let  ,
and for  define 

which seems like it would be well defined to me. Though whether it can captures all that you want to capture about how values can be, is another question, and quite possibly it can't.

Of course if we just want some space of possible values, and where each value has an opinion of each value, then that's just a continuous function from a product of the space with itself, which isn't any problem.

Yeah, I just meant this simple thing that you can mathematically model as $$f : V \times V \to \mathbb R$$. I suppose it makes sense to consider special cases of this that would have better mathematical properties. But I don't have high-confidence intuitions on which special cases are the right ones to consider.

I mostly meant this as a tool that would allow people with different opinions to move their disagreements from "your model doesn't make sense" to "both of our models make sense in theory; the disagreement is an empirical one". (E.g., the value-drift situation from Figure 6 is definitely possible, but that doesn't necessarily mean that this is what is happening to us.)

At first I was dubious about the framing of a "shifting" n-dimensional landscape, because in a sense the landscape is fixed in 2n dimensions (I think?), but you've convinced me this is a useful tool to think about/discuss these issues. Thanks for writing this!