Crossposted from the EA forum

I've been asked variations of this question a few times recently, as I'm studying value drift for my undergraduate thesis, so I thought I would seek out others' thoughts on this.

I suppose part of this depends on how we define value drift. I've seen value drift defined as broadly as changes in values (from the Global Optimum podcast) and as narrowly as becoming less motivated to do altruistic things over time (from Joey Savoie's forum post). While the latter seems almost certainly net-negative, how the former plays out is a little less clear to me.

This leads me to wonder if there might be different kinds of value drift that may be varying degrees of good or bad.


New Answer
New Comment

2 Answers sorted by



I'll open the discussion for the broadest definition of 'value drift' - 'changes in values over time'.

'Good' and 'bad' are only defined relative to some set of values.

A simplistic (but technically correct answer): if you had values A, and then changed to have different values B, from the viewpoint of A this is bad *by definition*, no matter what A and B actually are. And from the viewpoint of B it's good by definition. Values are always optimal according to themselves. (If they're in conflict with one another, there should be a set of optimal balances defined by some meta-values you also need to hold.)

A more complex and human-like scenario: you're not perfectly rational. Knowing this, and wanting to achieve a certain goal, it might be useful to "choose" a set of values other than the trivial set "this goal is good", to influence your own future behavior. Just as it can be instrumentally rational to choose some false beliefs (or to omit some true ones), so it can be instrumentally rational to choose a set of values in order to achieve something those values don't actually claim to promote.

A contrived example: you value donating to a certain charity. If you join a local church and become influential, you could convince others to donate to it. You don't actually value the church. If you were perfectly rational, you could perfectly pretend to value it and act to optimize your real values (the charity). But humans tend to be bad at publicly espousing values (or beliefs) without coming to really believe them to some degree. So you'll get value drift towards really caring about the church. But the charity will get more donations than if you hadn't joined. So from the point of view of your original values (charity above all), the expected value drift to (charity + church) is an instrumentally good choice.

Agreed. In the abstract, value drift is just a change over time, within an agent or across a line of replicated agents. And from the point of view of any given set of values, any other set of values is "bad". If the other set were good, the agent would seek to adopt them, right?

So the question really is whether value diversity has meta-value. diversity over time isn't that much different than diversity across agents.



This question requires distinguishing current values from idealized values, and values (in charge) of the world from values of a person. Idealized values is an unchanging and general way of judging situations (the world), including choices that take place there. Current values are an aspect of an actual agent (person) involved in current decisions that are more limited in scope and can't accurately judge many things. By idealizing current values, we obtain idealized values that give a way in which the current values should function.

Most changes in current values change their idealization, but some changes that follow the same path as idealization don't, they only improve ability to judge things in the same idealized way. Value drift is a change in current values that changes their idealization. When current values disagree with idealized current values, their development without value drift eventually makes them agree, fixes their error. But value drift can instead change idealized values to better fit current values, calcifying the error.

Values in charge of the world (values of a singleton AI or of an agentic idealization of humanity) in particular direct what happens to people who live there. From the point of view of any idealized values, including idealized values of particular people (who can't significantly affect the world), it's the current values of the world that matter the most, because they determine what actually happens, and idealized values judge what actually happens.

Unless all people have the same idealized values, the values of the world are different from values of individual people, so value drift in values of the world can change what happens both positively and negatively according to idealized values of individual people. On the other hand, values of the world could approve of value drift in individual people (conflict between people, diversity of personal values over time, disruption of reflective equilibrium in people's reasoning), and so could those individual people, since their personal value drift won't disrupt the course of the world, which is what their idealized values judge. Note that idealized personal values approving of value drift doesn't imply that current personal values do. Finally, idealized values of the world disapprove of value drift in values of the world, since that actually would disrupt the course of the world.