I occasionally run across lines of reasoning that depend on or favor the position that value drift should be avoided.
I find odd the idea of value drift, let alone the idea that value drift is bad. My intuition is that value drift is good if anything since it represents an update of one's values based on new evidence and greater time to compute reflective equilibrium. But rather than arguing intuition, let's explore value drift a bit before we come to any stronger conclusions.
(Fair warning, this is going to get into some deep philosophical territory, be pretty unapologetic about it, and assume you are reading carefully enough to notice what I say and not what you think I said. I'm still working some of these ideas out myself, so I don't have the fluency to provide a more accessible explanation right now. I also take some pretty big inferential jumps at times that you may not be on board with as of yet, so the later parts might feel like unjustified reasoning. I don't think that's the case, but you'll have to poke at me to help me figure out how to fill in those gaps.
In spite of all those apologies, there are some key insights here, and I'm unlikely to get clearer unless I am first more opaque, so please bear with me if you please, especially if you are interested in value as it relates to AI alignment.)
Whence drifting values?
The metaphor of drifting values is that your values are initially one place and then gradually relocate to another, like flotsam. The waves of fortune, chance, and intention combine to determine where they end up on the seas of change. In this metaphor, values are discrete, identifiable things. Linguistically, they are nouns.
When we talk of values as nouns, we are talking about the values that people have, express, find, embrace, and so on. For example, a person might say that altruism is one of their values. But what would it mean to "have" altruism as a value or for it to be one of one's values? What is the thing possessed or of one in this case? Can you grab altruism and hold onto it, or find it in the mind cleanly separated from other thoughts? As best I can tell, no, unless contrary to evidence and parsimony something like Platonic idealism proves consistent with reality, so it seems a type error to say you possess altruism or any other value since values are not things but habituations or patterns of action (more on this in the next section). It's only because we use the metaphor of possession to mean something like habitual valuing that it can seem as if these patterns over our actions are things in their own right.
So what, you may think, it's just a linguistic convention and doesn't change what's really going on. That's both wrong and right. Yes, it's a linguistic convention and yes you get on with valuing all the same no matter how you talk about it, but linguistic conventions shape our thoughts and limit our ability to express ourselves with the frames they provide. In the worst case, as I suspect is often happening when people reckon about value drift, we can focus so much on the convention that we forget what's really going on and reason only about the abstraction, viz. mistake the map for the territory. And since we've just seen that the value-as-thing abstraction is leaky because it implies the ability to possess that which cannot be, it can lead us astray by allowing us to operate from a false assumption about how the world works, expecting it to function one way when it actually operates another.
To my listening most talk about value drift is at least partially if not wholly confused by this mistaking of values for things, and mistaking them specifically for essences. But let's suppose you don't make this mistake; is value drift still sensible?
I think we can rehabilitate it, but to do that we'll need a clearer understanding of "habitual valuing" and "patterns of action".
If we tear away the idea that we might possess values, we are left with the act of valuing, and to value something is ultimately to judge it or assess its worth. While I can't hope to fit all my philosophy into this paragraph, I consider valuing, judging, or assessing to be one of the fundamental operations of "conscious" things, it being the key input that powers the feedback loops that differentiate the "living" from the "dead". For historical reasons we might call this feeling or sensation, and if you like control theory "sensing" seems appropriate since in a control system it is the sensor that determines and sends the signal to the controller after it senses the system. Promising modern theories suggest control theory is useful for modeling the human mind as a hierarchy of control systems that minimize prediction error while also maintaining homeostasis, and this matches with one of the most detailed and longest used theories of human psychology, so I feel justified in saying that the key, primitive action happening when we value something is that we sense or judge it to be good, neutral, or bad (or, if you prefer, more, same, or less).
We could get hung up on good, neutral, and bad, but let's just understand them for now as relative terms in the sense of the brain as control system, where "good" signals better prediction or otherwise moving towards a set point and "bad" signals worse prediction or moving away from a set point. Then in this model we could say that to value something is to sense it and send a signal out to the rest of the brain that it is good. Thus to "have a value" is to observe a pattern of action that senses that pattern to be good. To return to the example of valuing altruism, when a person who values altruism acts in a way that pattern matches to altruism (maybe "benefits others" or something similar), the brain senses this pattern to be good and feeds that signal back into itself further habituating actions that match the altruism pattern. It is this habituation that we are pointing to when we say we "have" a value.
Aside: How any individual comes to sense any particular pattern, like altruism, to be good, neutral, or bad is an interesting topic in and of itself, but we don't need that particular gear to continue discussing value drift, so this is where the model bottoms out for this post.
We can now understand value drift to mean changes in habituations or patterns of action over time. I realize some of my readers will throw their hands up at this point and say "why did we have to go through all that just to get back to where we started?!?", but the point was to unpack value drift so we can understand it as it is, not as we think it is. And as will become clear in the following analysis, that unpacking is key to understanding why value drift seems an odd thing to worry about to me.
My explanation of valuing implies that values-as-things are after-the-fact reifications drawn from the observation of accumulated effects of individual actions, and as such values cannot themselves directly drift because they are downstream of where change happens. The changes that will befall these reifications that we call "values" happen moment to moment, action to action, where each particular action taken will only later be aggregated to form a pattern that can be expressed as a value, and even then that value exists only by virtue of ontology because it is an inference from observation. Thus when values "drift" it's about as meaningful as saying the drawings of continents "drift" over geological time: it's sort of true, but only meaningful so long as understanding remains firmly grounded in the phenomena being pointed to, and unlike with maps of geography maps of mind are more easily confused for mind itself.
What instead drifts or changes are actions, although saying they drift or change is wrought because it supposes some stable viewpoint from which to observe the change, yet actions, via the preferences that cause us to choose any particular action over all others, are continuously dependent on the conditions in which they arise because what we sense (value, judge, assess) is conditional on the entire context in which we do the sensing. So it is only outside the moment, whether before or after, that we judge change, and so change is also ontologically bound such that we can find no change if we look without ontology. In this sense change and drift in actions and patterns of action exist but are not real: they are in the map, but not the base territory.
Does that matter? I think it does, because we can be confused about ontology, confusion can only arise via ontology, and sensing/valuing is very near the root of ontology generation, so our understanding of what it means to value is mostly contaminated by valuing itself! Certainly by the time we put words to our thoughts we have already sensed and passed judgement on many phenomena, and that means that when we talk about value drift we're talking from a motivated stance where valuation heavily shaped our perspectives, so I find it not at all odd that valuing would find a way to make itself and its products stable points within concept space such that it would feel natural to worry that they might drift, and that drifting and change in values would evaporate without sensing feedback loops to prop them up!
This is not to anthropomorphize valuing, but to point out the way it is prior to and self-incentivized to magnify its existence; it's like a subagent carrying out its own goals regardless of yours, and it's so good at it that it's shaped your goals before you even knew you had them. And when we strip away everything posterior to valuing we find no mechanism by which value can change because we can't even conceptualize change at that point, so we are left with valuing as a pure, momentary act that cannot drift or change because it has no frame to drift or change within. So when I say value drift is odd to me this is what I mean: it's exists as a function of valuing, not of valuing itself, and we can find no place where value change occurs that is not tainted by the evaluations of sensing.
(Careful readers will note this is analogous to the epistemological problem that necessitates a leap of faith when knowledge is understood ontologically.)
Yikes! So what do we do?
The questions that motivate this investigation are ones like "how do we protect effective altruists (EAs) from value drift so that they remain altruistic later in life and don't revert to the mean?" and "how do we align superintelligent AI with human values such that they stay aligned with human values even as they think longer and more deeply than any human could?". Even if I lost you in the previous section—and I'm a little bit lost in my own reasoning if I'm totally honest—how can we cash out all this philosophy into information relevant to these questions?
In the case of drifting EAs, I say let them drift. They value EA because conditions in their lives caused them to value it, and if those conditions change so be it. Most people lack the agency to stay firm in the face of changing conditions, I think this is mostly a safety mechanism to protect them from overcommitting when they aren't epistemically mature enough to know what they're doing, and for every EA lost to this there will likely be another EA gained, so we don't have to worry about it much other than to deal with churn effects on the least committed members of the movement. To do otherwise is to be inconsistent on respecting meta-preferences, assuming you think we should respect people's meta-preferences, in this case specifically the meta-preference for autonomy of beliefs and actions. Just like you would probably find it troubling to find racists or fascists or some other outgroup working on incentives to keep people racist or fascist in the face of evidence that they should change, you should find it troubling that we would seek to manipulate incentives such that people are more likely to continue to hold EA beliefs in the face of contrary evidence.
Most of this argument is aside my main point that value drift is a subtly motivated framing to keep values stable propagated by the very feedback processes that use sense signals as input with no prior manifestation to fall back on, but you might be able to see the deep veins of it running through. More relevant to this question directly are probably things like "Yes Requires the Possibility of No", "Fundamental Value Differences are not that Fundamental", "Archipelago", and much about meta-consistency in ethics that's not salient to me at this time.
On the question of AI alignment, this suggests concerns about value drift are at least partially about confusion on values and partially fear born of a desire for value self-preservation. That is, a preference to avoid value drift in superintelligent AIs may not be a principled stance, or may be principled but grounded in fear of change and nothing more. This is not to say we humans would be happy with any sense experiences, only that we are biased and anchored on our current sensing (valuing) when we think about how we might sense things other than we do now under other conditions. I realize this makes the alignment problem harder if you were hoping to train against current human values and then stick near them, and maybe that's still a good plan because although it's conservative and risks astronomical waste by denying us access to full optimization of valuing, that's probably better than attempting and failing at a more direct approach that is less wasteful but maybe also ends up tiling the universe with smiley faces. My concern is that if we take the more conservative approach, we might fail anyway because the value abstraction is leaky and we end up building agents that optimize for the wrong things, leaving gaps through which x-risks develop anyway.
(Unless it wasn't clear, AI alignment is hard.)
If any of that left you more confused than when you started reading this, then good, mission accomplished. I continue to be confused about values myself, and this is part of a program of trying to see through them and become deconfused on them, similar to the way I had to deconfuse myself on morality many years ago. Unfortunately not many people are deconfused on values (relatively more are deconfused on morals) so not much is written to guide me along. Look for the next post whenever I get more deconfused enough to have more to say.
I find this kind of argument to be entirely uncompelling, and stemming from a fairly basic error regarding what kind of thing morality is. (I say "kind of argument", rather than simply "argument", because you could replace "EA" in the quoted sentence with just about anything else, and I would find the modified version no more compelling than the original.)
There are several problems with this kind of argument, so let's go over them in sequence. The first problem is that it's vacuous. "People only value X because something in their lives caused them to value X" is true for any X you could suggest (provided, of course, that the X in question is valued by at least some people), and thus it fails to distinguish between values that are worth preserving and values that are not. Unless your thesis is literally "no values are better than any other values, which makes it okay for our current values to be replaced by any other set of values" (and if that is your thesis, I think it's worth saying explicitly), the notion that we should be willing to relinquish any of our current values simply because something at some point caused us to acquire those values is an incredibly poor criterion to use.
That brings us to the second problem: even if your thesis really is that no values are better than any other, there would still remain the question of why the reader ought to accept such a thesis. You can't justify it via some external argument, because no such external argument exists: the question of "what values should we have?" is itself a fundamentally value-laden question, and value-laden questions can only be addressed by appealing to other values. With some effort on the part of the reader, the article could (vaguely) be interpreted as making such an appeal, but even if such an interpretation is used, much of the philosophical force of the argument is lost. The sense that the reader is compelled to accept that values cannot have greater meaning, because the author has triumphantly explained that "values" exist only as "after-the-fact reifications" of a particular agent's actions/judgments--and, after all, nobody cares about those--is lost.
And well it should be! I am inherently suspicious of any argument that claims people are "wrong" to value something, that does not itself rely upon other values. Often such arguments really consist of subtly hidden, value-laden assertions, which are strengthened by pretending to be something they are not (such as e.g. ironclad philosophical arguments). In the case of this article, the value-laden assertion is this:
Note that this assertion is extremely value-laden! It contains a claim about what you should do, which the original article completed omits in favor of obfuscatory talk regarding the neurological processes behind "valuing". And since (as I discussed above) any value you hold is the result of a causal process that could plausibly have gone the other way, the assertion simplifies to the following:
This is, again, a normative statement--and not a particularly compelling one at that. I don't find the idea of relinquishing all my values--of becoming an agent whose utility function is 0 everywhere--at all attractive, and absent an unimaginably strong argument in favor of such, I can't imagine such a prospect ever being attractive to me. The goal of metaethical theory is not to produce counterintuitive results (such as the assertion that nobody should value anything ever); the goal of metaethical theory is to produce a framework that explains and justifies the moral intuitions we already have. (This is what I meant when I said that the very first quoted statement stems from an error regarding what type of thing morality is: morality is not something you prove things about. Morality is simply the study of that which we choose to regard as good.)
The rest of the article is populated by sentence structures more typically found in continental philosophy works than on LessWrong, of which the most egregious is probably this one:
As far as I can tell, the above sentence simply expresses the following sentiment:
I don't know what it is about some writers that makes them think every sentence they produce must meet a 50-word quota at minimum, but in my opinion the clarity of their writing would be greatly improved if they would stop doing that. That the entire article is comprised of such constructions did nothing to improve my experience of reading it; in fact, it left me rather annoyed, which I think can probably be deduced from the tone of this comment.
Beyond that, I don't have much to say, except perhaps that I think the problem of the criterion (which you only tangentially bring up in this article, but which I've seen you repeatedly mention elsewhere, to the point where I'm starting to suspect it's some kind of weird hobbyhorse of yours) is nothing more than vague sophistry of the same kind many mainstream philosophers seem so fond of.
Final thoughts: I think it's a shame to be criticizing an article that obviously had a lot of effort put into it by the author, especially so harshly, but I've expressed my opinion of the article's contents as frankly as I can, and it's simply the case that my opinion of said contents is... not good. Ultimately, I think that (a) your central claim here is mistaken, and that (b) if I'm wrong about that, obviously it would be good if you convinced me otherwise, but that your current style of writing is not very conducive to that task.
Hmm, what can I say about the problem of the criterion? I don't really mean it to be a hobbyhorse, it's simply the problem at the heart of epistemology, and thus touches all things. It's a way we have of talking about the fundamental problem of how we know anything, and permeates all that is built on knowing (or if you like, within ontology/models), so we can only void it if we stop putting our thoughts to words, and not even totally then. We could just as easily talk about it in terms of the grounding problem since that's the more recent form in takes in analytic philosophy, but the problem of the criterion has historical precedence by some 2500 years.
My aim is not to make any normative claims here. As you note, you had to infer one from what I wrote, and that's an inference you made, not me. That you can fit a pattern to the data doesn't mean the generative process suggested by the pattern is there. Of course, maybe I am myself mistaken about my own intent since I don't have perfect capacities of introspection, but I am not trying to claim anyone is doing anything wrong, only that opposing value drift seems strange to me, and when I look at why someone would want to avoid value drift, the most accurate and parsimonious theory I've been able to reason my way to is one of attachment to present values and fear of impermanence rather than any principled stance that some particular set of values is best and we would not want to move away from them.
It is however also true that I wouldn't consider any value worth preserving against drift, since I expect conditions to either cause a value to remain steady for its usefulness or not, or put another way the evidence determines the values, not the values the evidence. Presumably whatever values are settled on in a particular environment are adaptive to that environment, and it seems odd to me to try to attach to anything other than what is useful and adaptive in the conditions you find yourself in. You might read this as making a normative claim, but I see this as more a statement of what minds do: they adapt to survive or fail to adapt and parish. Normativity is an inference we make from there about the purpose of adaptation that leads to survival, but seems not baked in to the act of adaptation that leads to survival itself.
Thanks for your detailed reply. There is a lot to address here, so I'll reply with multiple comments to address the parts.
How do you measure what values are better than others? Unless we have special access to moral facts that I don't see evidence that we have, we must do this via a process powered by our minds, and our minds decide what is better or worse based on how they values, thus I read your complaint as circular, i.e. you claim the problem is that I don't show how to value valuing via valuing. And I agree, this is maybe the fundamental conundrum of talking about values, and it creates the same kind of circular dependency problem we see in the problem of the criterion and, for example, naive set theory, and thus as you can imagine I think we suffering in our attempt to reason about value and epistemology we suffer the same sorts of problems we had in mathematics prior to the development of something at least a little better than naive set theory.
I think it's a big jump to say no value is better than another, though, because to say that is to give up being a mind altogether. Although there may be no outside ground, no absolute facts on which we can base what values are better than others, we can still each codependently make valuations. That is, I can't deny you or myself our values, even if they are not consistent with anything other than ourselves, and grounded only in our conditional valuations, and certainly it gives way to no sense of what values are "right", only those we collectively most prefer in any particular moment, and offer no reason to privilege the ones we consider best over the ones we could consider best in any other circumstance other than biased attachment to what is now vs. what is othertimes.
Presumably, that means no such argument is valid. There are plenty of arguments for the objectivity of value, and for specific values. And they are not circular so long as the reason you should accept them is different from the norm they are advising. Typically, rational normativity is assumed by any argument.
I have no idea why you think what you wrote is at all in contradiction to what I said. To wit, here is what I wrote:
And here is what you wrote:
These are quite literally the same sentiment, except that you use the word "norm" instead of "value". The point remains the same, however: assertions about should-ness must appeal to some preexisting notion of should-ness.
And this is not a problem, if you happen to accept whatever notion of should-ness is being appealed to. In the case of this article, however, I am specifically rejecting the notion being appealed to (that being the idea that if your values were arrived at via some process that could counterfactually have gone differently, they are not values worth preserving)—since, as I stated, this would result in the set of values "worth preserving" being reduced to the empty set.
This, too, is a normative claim; and I'm fine with that. Or, to taboo "normative": I'm fine with claiming that values like not killing people and not causing unnecessary suffering are intrinsically motivating to creatures like humans, and endorsedly so. And if Gordon wants to argue:
then my response to that is to nod sagely and reply:
And to this, neither Gordon nor you can offer a valid counterargument, since that counterargument would, indeed, have to appeal to some other norm—one which, presumably, I would find just as uncompelling.
Thanks for expressing your frustration with the writing, but I'll mostly ignore that because it's a matter of style not substance. Writing this way is functional for my purposes: I mostly intend to express subtle, nuanced points, and more direct writing flattens away the important nuance that, when understood, avoids misunderstanding. But when the nuance is missed and taken to be the simple thing at the core, it's more easily misunderstood. The other option is to write shorter, more ambiguous words that convey the same meaning via what we might call "poetry". Neither solution is great. The alternative, though, is I think worse: very detailed mathematical models with so many moving parts they are hard to keep track of by our brains not designed for that task. All options are poor, so I go with the one I am most fluent in.
From my perspective, this style means that although I feel pretty sure that you made a relatively simple mistake somewhere, I am unable to explain it, because the text is just too hard to work with.
I'd say this style works fine for some purposes, but "finding the truth" isn't one of them. (The same is probably true about the continental philosophy in general.)
My guess is that you use words "value drift" to mean many other things, such as "extrapolation of your values as you learn", "changes in priorities", etc.
I'm not sure this is my fault; I think other people use value drift to mean many things because it's confused, i.e. people aren't quite sure what they mean when they talk about it. Much of my goal is, if nothing else, to expose that confusion, and if you feel I used it to mean many things that I think I succeeded in causing you to see the same thing I see, even if you attribute the confusion to me alone rather than to everyone talking about this topic. My rhetoric was perhaps not quite careful enough to help you tell apart any unique confusion on my part from confusion created by a confused topic.
Since this appears to be a crux of your whole (fallacious, in my opinion) argument, I'm going to start by just criticizing this point. This argument proves far too much. It proves that:
It seems as if you've rhetorically denied the existence of any abstract properties whatsoever, for the purpose of minimizing values as being "merely" habituations or patterns of action. But I don't see why anyone should actually accept that claim.
Hmm, so there's a way in which I agree with you and a way I don't, and it depends on what you mean by "have" here. Without going back into addressing the possession metaphor, you're expressing a notion that I interpret as talking about existence, and I see a sharp line between existence or being and reality or the thing in itself. Existence is marked by differentiation, and for people to have beliefs, objects to have colors, etc. there must be some boundary at which these concepts are demarcated such that they are distinguishable from all else. In this sense we can say these things exist, but that it's dependent on our ability to observe and differentiate, to infer a pattern.
There is also a way in which some of these are more real than others. All of them arise from some physical process, but not all of them have neat correspondences. Color has maybe the cleanest, being an interaction of our senses with photons and directly correlates with behaviors of those photons. Concepts in books is maybe the flimsiest, since it's an interaction of a book (paper? words? what makes a book a book and not some other kind of stuff that conveys information to us?) and our model of how we model the world, and the hardest to find where it really comes from. This is not to say it is totally unreal, but it is to say there is no thing that looks like concepts in books if you do not also have a mind to provide that interpretation of phenomena.
Perhaps my presentation goes to far or is confusig, but the point is to be clear on what is ontological and what is ontic and not mistake the two, as I think it's happening in the usual model of values.
I don't see the usual commonsense understanding of "values" (or the understanding used in economics or ethics) as relying on values being ontologically fundamental in any way, though. But you've the fact that they're not to make a seemingly unjustified rhetorical leap to "values are just habituations or patterns of action", which just doesn't seem to be true.
Most importantly, because the "values" that people are concerned with then they talk about "value drift" are idealized values (ala. extrapolated volition), not instantaneous values or opinions or habituations.
For instance, philosophers such as EY consider that changing one's mind in response to a new moral argument is not value drift because it preserves one's idealized values, and that it is generally instrumentally positive because (if it brings one's instantaneous opinions closer to their idealized values) it makes one better at accomplishing their idealized values. So indeed, we should let the EAs "drift" in that sense.
On the other hand, getting hit with a cosmic ray which alters your brain, or getting hacked by a remote code execution exploit is value drift because it does not preserve one's idealized values (and is therefore bad, according to the usual decision theoretic argument, because it makes you worse at accomplishing them). And those are the kind of problems we worry about with AI.
Right, I think people are pointing at something else when they normally talk about values but that cluster is poorly constructed and doesn't cut reality at the joint in the same way our naive notions of belief, morals, and much else cut reality slightly askew. I'm suggesting this as a rehabilitative framing of values that is a stronger, more consistent meaning for "value" than the confused cluster of things people are normally pointing at. Although to be clear even the naive confused notion of value I'm trying to explode and rebuild here is still a fundamentally ontological thing, unless you think people mean something by "value" more like signals in the brain serving as control mechanisms to regulate feedback systems.
To your concern about an unjustified leap, this is a weakness of my current position: I don't yet have a strong ability to describe my own reasoning to bring most people along, and is one of the points of working out these ideas: so I can see what inferences do seem intuitive to people and which don't and use that information to iterate on my explanations.
To the extent that I think "value" is a confused concept, I think "idealized value" is consequently also confused, perhaps even more so because it is further distanced from what is happening on the ground. I realize idealized value feels intuitive to many folks, and at one time it did seem intuitive to me, but I am similarly suspicious that it is cleanly pointing to a real thing and is instead a fancy thing we have constructed as part of our reasoning that has no clear correlate out in the world. That is, it is an artifact of our reasoning process, and while that's not inherently bad, it also means it's something almost purely subjective and can easily become unhinged from reality, which makes me nervous about using it as a justification for any particular policy we might want to pursue.
The strongest argument against value drift (meaning the kind of change in current values that involves change in idealized values) is instrumental usefulness of future values that pursue idealized present values. This says nothing about terminal value of value drift, and a priori we should expect that people hold presence of value drift as a terminal value, because there is no reason for the haphazard human values to single out the possibility of zero value drift as most valuable. Value drift is just another thing that happens in the world, like kittens. Of course valuable value drift must observe proper form even as it breaks idealized values, since most changes are not improvements.
The instrumental argument is not that strong when your own personal future values don't happen to control the world. So the argument still applies to AIs that have significant influence over what happens in the future, but not to ordinary people, especially not to people whose values are not particularly unusual.
Your prior assumes that each concept is assigned a value which is unlikely to be zero, rather than that there is a finite list of concepts we care about one way or the other, which value drift is not necessarily likely to land on.
(I am currently on the path of learning how values actually work and figuring out what I should really do.)
It has been a few days since I read this post so I may be misrepresenting you, but I think this post committed a similar mistake to people who think that arguing with another person to change their mind is meaningless given that we don't have free will, because given a deterministic future, that person will just automatically change their mind. But it doesn't work like that, because the act of arguing is part of the deterministic process that eventually causes the person to change their mind. (I could not find the exact EY post that this appeared on: https://www.lesswrong.com/tag/free-will-solution) Similarly, even though we can let our values drift freely, controlling how the values drift is also part of the drifting process.
I eventually got less confused about values. I still think there's something unnecessary in worry about value drift, and I could probably make a better argument for that now but I've got other projects on my plate.
Anyway, since you're thinking about how values actually work, this post captures a bunch of what I figured out and links to other things, but it's also now a couple years old and I'd probably say things differently than I did at the time.
It is normal for human values to evolve. If my values were fixed at me at 6 years old, I would be regarded mentally ill.
However, there are normal human speed and directions of value evolution, and there are some ways of value evolution which could be regarded as too quick, too slow, or going in a strange direction. In other words, the speed and direction of the value drift is a normative assumption. For example, i find normal that a person is fascinated with some philosophical system for years and then just move to another one. If a person changes his ideology everyday or is fixed in "correct one" form 12 years old until 80, I find it less mentally healthy.
The same way I more prefer an AI which goals are evolving in millions of years – to the AI which is evolving in seconds or is fixed forever.
Human values evolve in human ways. A priori, an AI's value drift would almost surely take it in alien, worthless-to-us directions. A non-evolving AI sounds easier to align - we only need to hit the human-aligned region of valuespace once instead of needing to keep hitting it.