I find odd the idea of value drift, let alone the idea that value drift is bad. My intuition is that value drift is good if anything since it represents an update of one's values based on new evidence and greater time to compute reflective equilibrium. But rather than arguing intuition, let's explore value drift a bit before we come to any stronger conclusions.
(Fair warning, this is going to get into some deep philosophical territory, be pretty unapologetic about it, and assume you are reading carefully enough to notice what I say and not what you think I said. I'm still working some of these ideas out myself, so I don't have the fluency to provide a more accessible explanation right now. I also take some pretty big inferential jumps at times that you may not be on board with as of yet, so the later parts might feel like unjustified reasoning. I don't think that's the case, but you'll have to poke at me to help me figure out how to fill in those gaps.
In spite of all those apologies, there are some key insights here, and I'm unlikely to get clearer unless I am first more opaque, so please bear with me if you please, especially if you are interested in value as it relates to AI alignment.)
Whence drifting values?
The metaphor of drifting values is that your values are initially one place and then gradually relocate to another, like flotsam. The waves of fortune, chance, and intention combine to determine where they end up on the seas of change. In this metaphor, values are discrete, identifiable things. Linguistically, they are nouns.
When we talk of values as nouns, we are talking about the values that people have, express, find, embrace, and so on. For example, a person might say that altruism is one of their values. But what would it mean to "have" altruism as a value or for it to be one of one's values? What is the thing possessed or of one in this case? Can you grab altruism and hold onto it, or find it in the mind cleanly separated from other thoughts? As best I can tell, no, unless contrary to evidence and parsimony something like Platonic idealism proves consistent with reality, so it seems a type error to say you possess altruism or any other value since values are not things but habituations or patterns of action (more on this in the next section). It's only because we use the metaphor of possession to mean something like habitual valuing that it can seem as if these patterns over our actions are things in their own right.
So what, you may think, it's just a linguistic convention and doesn't change what's really going on. That's both wrong and right. Yes, it's a linguistic convention and yes you get on with valuing all the same no matter how you talk about it, but linguistic conventions shape our thoughts and limit our ability to express ourselves with the frames they provide. In the worst case, as I suspect is often happening when people reckon about value drift, we can focus so much on the convention that we forget what's really going on and reason only about the abstraction, viz. mistake the map for the territory. And since we've just seen that the value-as-thing abstraction is leaky because it implies the ability to possess that which cannot be, it can lead us astray by allowing us to operate from a false assumption about how the world works, expecting it to function one way when it actually operates another.
To my listening most talk about value drift is at least partially if not wholly confused by this mistaking of values for things, and mistaking them specifically for essences. But let's suppose you don't make this mistake; is value drift still sensible?
I think we can rehabilitate it, but to do that we'll need a clearer understanding of "habitual valuing" and "patterns of action".
If we tear away the idea that we might possess values, we are left with the act of valuing, and to value something is ultimately to judge it or assess its worth. While I can't hope to fit all my philosophy into this paragraph, I consider valuing, judging, or assessing to be one of the fundamental operations of "conscious" things, it being the key input that powers the feedback loops that differentiate the "living" from the "dead". For historical reasons we might call this feeling or sensation, and if you like control theory "sensing" seems appropriate since in a control system it is the sensor that determines and sends the signal to the controller after it senses the system. Promising modern theories suggest control theory is useful for modeling the human mind as a hierarchy of control systems that minimize prediction error while also maintaining homeostasis, and this matches with one of the most detailed and longest used theories of human psychology, so I feel justified in saying that the key, primitive action happening when we value something is that we sense or judge it to be good, neutral, or bad (or, if you prefer, more, same, or less).
We could get hung up on good, neutral, and bad, but let's just understand them for now as relative terms in the sense of the brain as control system, where "good" signals better prediction or otherwise moving towards a set point and "bad" signals worse prediction or moving away from a set point. Then in this model we could say that to value something is to sense it and send a signal out to the rest of the brain that it is good. Thus to "have a value" is to observe a pattern of action that senses that pattern to be good. To return to the example of valuing altruism, when a person who values altruism acts in a way that pattern matches to altruism (maybe "benefits others" or something similar), the brain senses this pattern to be good and feeds that signal back into itself further habituating actions that match the altruism pattern. It is this habituation that we are pointing to when we say we "have" a value.
Aside: How any individual comes to sense any particular pattern, like altruism, to be good, neutral, or bad is an interesting topic in and of itself, but we don't need that particular gear to continue discussing value drift, so this is where the model bottoms out for this post.
We can now understand value drift to mean changes in habituations or patterns of action over time. I realize some of my readers will throw their hands up at this point and say "why did we have to go through all that just to get back to where we started?!?", but the point was to unpack value drift so we can understand it as it is, not as we think it is. And as will become clear in the following analysis, that unpacking is key to understanding why value drift seems an odd thing to worry about to me.
My explanation of valuing implies that values-as-things are after-the-fact reifications drawn from the observation of accumulated effects of individual actions, and as such values cannot themselves directly drift because they are downstream of where change happens. The changes that will befall these reifications that we call "values" happen moment to moment, action to action, where each particular action taken will only later be aggregated to form a pattern that can be expressed as a value, and even then that value exists only by virtue of ontology because it is an inference from observation. Thus when values "drift" it's about as meaningful as saying the drawings of continents "drift" over geological time: it's sort of true, but only meaningful so long as understanding remains firmly grounded in the phenomena being pointed to, and unlike with maps of geography maps of mind are more easily confused for mind itself.
What instead drifts or changes are actions, although saying they drift or change is wrought because it supposes some stable viewpoint from which to observe the change, yet actions, via the preferences that cause us to choose any particular action over all others, are continuously dependent on the conditions in which they arise because what we sense (value, judge, assess) is conditional on the entire context in which we do the sensing. So it is only outside the moment, whether before or after, that we judge change, and so change is also ontologically bound such that we can find no change if we look without ontology. In this sense change and drift in actions and patterns of action exist but are not real: they are in the map, but not the base territory.
Does that matter? I think it does, because we can be confused about ontology, confusion can only arise via ontology, and sensing/valuing is very near the root of ontology generation, so our understanding of what it means to value is mostly contaminated by valuing itself! Certainly by the time we put words to our thoughts we have already sensed and passed judgement on many phenomena, and that means that when we talk about value drift we're talking from a motivated stance where valuation heavily shaped our perspectives, so I find it not at all odd that valuing would find a way to make itself and its products stable points within concept space such that it would feel natural to worry that they might drift, and that drifting and change in values would evaporate without sensing feedback loops to prop them up!
This is not to anthropomorphize valuing, but to point out the way it is prior to and self-incentivized to magnify its existence; it's like a subagent carrying out its own goals regardless of yours, and it's so good at it that it's shaped your goals before you even knew you had them. And when we strip away everything posterior to valuing we find no mechanism by which value can change because we can't even conceptualize change at that point, so we are left with valuing as a pure, momentary act that cannot drift or change because it has no frame to drift or change within. So when I say value drift is odd to me this is what I mean: it's exists as a function of valuing, not of valuing itself, and we can find no place where value change occurs that is not tainted by the evaluations of sensing.
Yikes! So what do we do?
The questions that motivate this investigation are ones like "how do we protect effective altruists (EAs) from value drift so that they remain altruistic later in life and don't revert to the mean?" and "how do we align superintelligent AI with human values such that they stay aligned with human values even as they think longer and more deeply than any human could?". Even if I lost you in the previous section—and I'm a little bit lost in my own reasoning if I'm totally honest—how can we cash out all this philosophy into information relevant to these questions?
In the case of drifting EAs, I say let them drift. They value EA because conditions in their lives caused them to value it, and if those conditions change so be it. Most people lack the agency to stay firm in the face of changing conditions, I think this is mostly a safety mechanism to protect them from overcommitting when they aren't epistemically mature enough to know what they're doing, and for every EA lost to this there will likely be another EA gained, so we don't have to worry about it much other than to deal with churn effects on the least committed members of the movement. To do otherwise is to be inconsistent on respecting meta-preferences, assuming you think we should respect people's meta-preferences, in this case specifically the meta-preference for autonomy of beliefs and actions. Just like you would probably find it troubling to find racists or fascists or some other outgroup working on incentives to keep people racist or fascist in the face of evidence that they should change, you should find it troubling that we would seek to manipulate incentives such that people are more likely to continue to hold EA beliefs in the face of contrary evidence.
Most of this argument is aside my main point that value drift is a subtly motivated framing to keep values stable propagated by the very feedback processes that use sense signals as input with no prior manifestation to fall back on, but you might be able to see the deep veins of it running through. More relevant to this question directly are probably things like "Yes Requires the Possibility of No", "Fundamental Value Differences are not that Fundamental", "Archipelago", and much about meta-consistency in ethics that's not salient to me at this time.
On the question of AI alignment, this suggests concerns about value drift are at least partially about confusion on values and partially fear born of a desire for value self-preservation. That is, a preference to avoid value drift in superintelligent AIs may not be a principled stance, or may be principled but grounded in fear of change and nothing more. This is not to say we humans would be happy with any sense experiences, only that we are biased and anchored on our current sensing (valuing) when we think about how we might sense things other than we do now under other conditions. I realize this makes the alignment problem harder if you were hoping to train against current human values and then stick near them, and maybe that's still a good plan because although it's conservative and risks astronomical waste by denying us access to full optimization of valuing, that's probably better than attempting and failing at a more direct approach that is less wasteful but maybe also ends up tiling the universe with smiley faces. My concern is that if we take the more conservative approach, we might fail anyway because the value abstraction is leaky and we end up building agents that optimize for the wrong things, leaving gaps through which x-risks develop anyway.
(Unless it wasn't clear, AI alignment is hard.)
If any of that left you more confused than when you started reading this, then good, mission accomplished. I continue to be confused about values myself, and this is part of a program of trying to see through them and become deconfused on them, similar to the way I had to deconfuse myself on morality many years ago. Unfortunately not many people are deconfused on values (relatively more are deconfused on morals) so not much is written to guide me along. Look for the next post whenever I get more deconfused enough to have more to say.