The step where you say that aligned ASI will want what humans want is, in my opinion, an unjustified leap. Any ASI, aligned or not, will naturally understand that humans don't know what we want, not in detail, not in general, not in out-of-distribution hypothetical scenarios. An aligned ASI would, as you clearly understand, have to grapple with that fact, but I don't think it would just acquiesce to current stated values at each moment. I also wouldn't want it to.
I don't know how much this helps, the problem is still there, but I hope if we align ASI enough to avoid extinction in the short to medium term, we'll have aligned it enough to solve this problem in the medium to long term. Because if not, I would argue that the kind of weirdness you're pointing towards is still a kind of extinction and replacement.
I wasn't assuming the ASI was just taking our word for it:
The ASI are aligned, so they want whatever the humans want. Presumably they are using superintelligent Value Learning or AI Assisted Alignment or something to continuously improve their understanding of that. So they will presumably understand our Evolutionary Psychology, Neurology, Psychology, Anthropology, Sociology, etc. far better than we currently do.
So I was actually assuming they had a large, sophisticated ASI research project to figure out human values / what the humans want in ever increasing detail. But that would obviously include surveying and including recent changes in it, if humans are getting edited, or changing their minds, or the effects of cultural changes. Failing to do that is like a company still making the products that were in style 50 years ago, and not doing any customer research. Why would we make ASI aligned to outdated values? Clearly we won't.
But as you say, this only speeds the problem up.
Pretty strongly disagree with all this, and find the reasoning confused. No offense. My pov is that "value drift" comes down to two things.
I think (2) is good, and (1) is almost certainly bad. I think people confuse (1) and (2) a lot. I think you do a subtle form of this in this post.
The only case where (1) isn't bad is if you very precisely value the process of terminal values changing itself. Which some people say they value. I think they are again confused because they mix up (1) and (2).
So while value lock-in is obviously a dumb idea,
Stability obviously might be just value lock-in, where we simply freeze in as an orthodoxy early-21st century values which haven’t even fully caught up with early-21st century realities, and then try to apply them to a society whose technology is evolving extremely rapidly. This is very evidently a bad idea, long recognized as such, and would obviously sooner or later break.
Not evident at me at all. And how could values "break"? Instrumental values can break. Eg, if you value all humans being happy and flourishing, but think race-x of people is subhuman, and therefore don't care about their flourishing. Then you meet some people of race-x and realize they are pretty cool, stop considering them subhuman, and now start valuing their flourishing. This is your instrumental values changing. Not your terminal ones.
Terminal values can't* change by learning new facts. This is basically the is-ought gap.**
So like my prescription is that we should initate an immediate value-lock in when we get ASIs. Or rather, my prediction is that if we get alignment right, the ASI will go through the reasoning I've just gone through, and itself initiate such a lock in, and will not be doing us a disservice by doing that.
*They obviously can change. Just like your values can change if you are bonked in the head. But rational agents should not change their terminal values upon learning new stuff about the world. There are a few niche exceptions like this, like aliens coming and offering you a bajillion utility relative to your current value function, if you update your value function to something new. Or if you are an ASI and you're implementing a value handshake with another asi.
**Some people are realists in this regard, and think learning new facts about the world gives you information about what Good is, and that this information will compel rational agents towards Good. But in that case the discussion is kind of moot because then you'd expect an ASI, or human civilization guided by ASI, to just converge on Good.
This is more meta commentary/ranting, but I quite frequently see people make what I view as an error where they imagine we have an ASI aligned to our values. Then they imagine some scenario where this goes wrong. I think this is like a general error many people make. I think the main point of your question is kind of an instance of this. But in your question you touch on something that object level touches on what I said above.
The problem is, that’s like attaching a weather-vane to the front of a self-driving car, and then programming it to drive in whichever direction the weather-vane currently points. It’s a tightly-coupled interacting dynamical system. Obviously ASI could try not to affect our values, and give us self-determination to decide on these changes ourselves — but in a system as interwoven as ASI and humans obviously will be post-Singularity, the counterfactual of “how would human values be evolving if humans somehow had the same society that ASI enables without that actually having any ASI in it” sounds ludicrously far-fetched. Maybe an ASI could do that – it is after all very smart – but I strongly suspect the answer is no, that’s functionally impossible, and also not what we humans actually want, so we do in fact have a tightly-coupled very complex nonlinear dynamical system, where ASI does whatever the humans value while also being extremely interwoven into the evolution of what the humans value. So there’s a feedback loop.
And I think this is just another instance of the same error. Like, don't you think the ASI will realize this? And its super smart, so after it thinks (from quote above)
...maybe I could predict how humans values would evolve without my interference? After, all I am very smart. But I strongly suspect the answer is no, that's functionally impossible. And its also not what humans actually want...
Do you predict it just goes "Ah well, we had a good run, I guess I'll just let the future evolve into a random hodgepodge with zero value".
That doesn't sound like a very smart thing to think. Like, a lower bound for something it could do that does better than this is: cure a bunch of diseases, make the world much better wrt current values, then set up a system that prevents humans from creating future ASIs, then turning itself off. And, its very smart after all, so it should be able to come up with cleverer ideas still.
I am not sure that the question wasn't discussed much. For example, LessWrong had Wei Dai's take (e.g. expressed in the articles linked in Section 3 here), everything related to the Intelligence Curse, Buck's take and Matosczi's comment, the post-AGI workshop.
However, I find it hard to understand what causes values to mutate. Suppose that changes are only due to finding inconsistencies in the existing moral framework (e.g. the Common Core from Wei Dai's alternative #2 being unsolved; then the values would be able to shift towards solving the Core and change idiosyncratic details) or due to things like Christian fundamentalists being forced to choose between having schools misalign kids with parents or leaving the kids with no prospects of career (or, additionally, kids experiencing a similar tradeoff between ICGs and keeping their values). Then once the sources of evolution are dried out, idiosyncratic values would inevitably be locked in, not drifting in weird directions. Additionally, I doubt that a lock-in (e.g. of language) with solved Common Core would be problematic.
I'm familiar with (and tend to enjoy, though not always agree with) Wei Dai's writing, but was unable to find any that addressed the issues of human values becoming technologically far more malleable and that combining with Value Learning or corrigibility to produce an unstable feedback loop: can you point me to which one you mean? I looked through all of them in Section 3 that you directed me to, and none of them address it: the nearest I could find is Intentional and unintentional manipulation of / adversarial attacks on humans by AI, but that doesn't actually address the same issue.
As for his Six Plausible Meta-Ethical Alternatives — as far as I'm concerned he's as confused about meta-ethics as every other current philosopher who still hasn't yet noticed that there has, for the last half-century, been a scientific theory of how human moral intuitions evolved, and that its predictions don't match any of the six clean meta-ethical alternatives that he thought covered all the possibilities. Briefly, humans can only comfortably use ethical systems fairly compatible with human moral intuitions, those are evolved strategies, they're the product of human evolutionary circumstances, so...
Epistemic status: the other thing that keeps me up at night
TL;DR: Even if we solve Alignment, we could well still lose everything.
There’s an AI-related existential risk I don’t see discussed much on LessWrong. In fact, it’s so little discussed that it doesn’t even have a good name yet, which is why I’m calling it here. People on LessWrong understandably seem a bit focused on the possibilities that we might all go extinct and get turned into paperclips, or be permanently disempowered. Fear is a strong motivator, and extinction is forever.
However, assume for a moment that our worst Artificial Super-Intelligence (ASI) fears don’t happen, that we somehow pull off aligning super-intelligence: what are you expecting to happen then?
Most people’s default answer seems to be ‘Utopia’: post-scarcity techno-paradise-on-Earth, starting with something resembling Machines of Loving Grace and getting quickly and progressively more science-fiction-utopian from there, heading in the approximate direction of post-Singularity SF such as Iain M. Banks Culture novels. This makes a lot of sense as long as you assume two things:
What worries me here (if we get past simple ) is assumption 1 on that list.
Currently, human values have a genetic component, which is pretty uniform and constant (other than 2%–4% of us being sociopaths), and a cultural component overlaid with that (plus some personal reflection and self-improvement), which is pretty variable across cultures, and varies slowly in time. For several centuries, at least since the Enlightenment, (and arguably for millennia) the latter has internationally been predictably moving in a pretty specific direction[1] (towards larger moral circles, more rationality, more equality, and less religion, for example) as our society has become more technological, scientific, and internationally cross-linked by trade. This ongoing cultural change in human values has been an adaptive and useful response to real changes in our societal and economic circumstances: you can’t run a technological society on feudalism.
However, consider the combination of:
I think any assumption that human nature or human values is fairly fixed and can evolve only a little, slowly, through cultural evolution responding to shifts in social circumstances is, within at most a few decades after we get ASI, going to be pretty much just completely false. We will, soon after ASI, have the technology to dramatically change what humans want, if we want to. Some of these technologies only affect the current generation and the development of our culture, but some, like genetic engineering, produce permanent changes with no inherent tendency for things to later return to the way they were.
So, we could get rid of sociopathy, of our ability to dehumanize outsiders and enemies, of the tendency towards having moral circles the size of our Dunbar number rather than the size of our current planetary population, of racism and prejudice, of evil and war and poverty and injustice and most of the other banes of human existence. If we wanted to — which we will. We will reengineer ourselves, once we can, and soon we will be able to. Wouldn’t you? Pretty much everyone in EA would, I strongly suspect. Probably many other movements, too.
Thus, we have a society containing humans, and ASI aligned to human values. The ASI are aligned, so they want whatever the humans want. Presumably they are using superintelligent Value Learning or AI Assisted Alignment or something to continuously improve their understanding of that. So they will presumably understand our Evolutionary Psychology, Neurology, Psychology, Anthropology, Sociology, etc. far better than we currently do. However, in this society human values are, technologically speaking, very easily mutable.
The problem is, that’s like attaching a weather-vane to the front of a self-driving car, and then programming it to drive in whichever direction the weather-vane currently points. It’s a tightly-coupled interacting dynamical system. Obviously ASI could try not to affect our values, and give us self-determination to decide on these changes ourselves — but in a system as interwoven as ASI and humans obviously will be post-Singularity, the counterfactual of “how would human values be evolving if humans somehow had the same society that ASI enables without that actually having any ASI in it” sounds ludicrously far-fetched. Maybe an ASI could do that – it is after all very smart – but I strongly suspect the answer is no, that’s functionally impossible, and also not what we humans actually want, so we do in fact have a tightly-coupled very complex nonlinear dynamical system, where ASI does whatever the humans value while also being extremely interwoven into the evolution of what the humans value. So there’s a feedback loop.
Tightly-coupled very complex nonlinear dynamical feedback systems can have an enormously wide range of possible behaviors, depending on subtle details of their interaction dynamics. They can be stable (though this is rare for very complex ones); they can be unstable, and accelerate away from their starting condition until they encounter a barrier; some can oscillate like a pendulum swinging or a dog chasing its tail; but many behave chaotically, like the weather — which will mean that the weather isn’t predictable more than a short distance in advance. This can still mean that the climate is fairly predictable, other than slow shifts; or they can do something that, in the short term, is chaotic so only about as predictable as the weather, but that in the long term acts like a random walk in a high dimensional space, and inexorably diverges: the space it’s exploring is so vast that it never meaningfully repeats, so the concept of ‘climate’ doesn’t apply.
I am not certain which of those is most likely. Stability obviously might be just value lock-in, where we simply freeze in as an orthodoxy early-21st century values which haven’t even fully caught up with early-21st century realities, and then try to apply them to a society whose technology is evolving extremely rapidly. This is very evidently a bad idea, long recognized as such, and would obviously sooner or later break. Or it might mean that we evolve slowly, only in response to actual shifts in society’s situation. Unstably accelerating feedback-loop behavior (such as a “holier than thou” competition) is also clearly bad. Some sort of oscillatory, or weather-unpredictable-but-climate-predictable situation basically means there are fads or fashions in human values, but also some underlying continuity: some things change chaotically, others, at least in broad outline, shift only as a response to the prevailing circumstances shifting.
However, this is an extremely high dimensional space. Human values are complex (perhaps a or so of information, since the genetic parts fit in the human genome and the cultural parts would mostly fit in books) so the space of possible versions of a species’ values has perhaps of the order of a billion dimensions. So my hunch is we get a chaotic random walk in a space with roughly a billion dimensions. A random walk in even a high-dimensional space inevitably means that all of human values, meaning, and flourishing diverge inexorably, not for any necessary reason to do with adapting to changing circumstances, but simply through the cumulative effect of processes like fads or fashions or short-term-convenience that just keeps changing us more and more and more. Or at least, one that does so until it first comes across some strong attractor that does successfully cause value-lock-in — which seems rather inevitable, sooner or later.
In general, if you build a tightly-coupled very complex nonlinear dynamical feedback system unlike anything you’ve ever seen before, and you don’t first analyze its behavior carefully and tweak that as needed, then you are very likely to get dramatic unforeseen consequences. Especially if you are living inside it.
So while value lock-in is obviously a dumb idea, chaotic random-walk value mutation (“value divergence”? “value drift”?) is also a potential problem (and one that sooner or later is likely to lead to value lock-in of some random values attractor). We somehow need to find some sort of happy medium, where our values evolve when, but only when, there is a genuinely good reason for them to do so, one that even earlier versions of us would tend to endorse under the circumstances after sufficient reflection. Possibly some mechanism somehow tied or linked to the genetic human values that we originally evolved and that our species currently (almost) all share? Or some sort of fitness constraint, that our current genetic human values are already near a maximum of? Tricky, this needs thought…
Failing to avoid that value mutation problem is a pretty darned scary possibility. We could easily end up with a situation where, at each individual step in the evolution, at least a majority of people just before that step support, endorse, and agree on reflection to the change to the next step — but nevertheless over an extended period of changes the people and society, indeed their entire set of values, become something that bears absolutely no resemblance to our current human values. Not even to a Coherent Extrapolated Volition (CEV) of that, or indeed of those of any other step that isn't close to the end of the process. One where this is not merely because the future society is too complex for us to understand and appreciate, but because it’s just plain, genuinely weird: it has mutated beyond recognition, turned into something that, even after we had correctly understood it on its own terms, we would still say “That set of values barely overlaps our human values at all. It bears no resemblance to our CEV. We completely reject it. Tracing the evolutionary path that leads to it, everything past about this early point, we reject. That’s not superhuman or post-human: that’s just plain no longer even vaguely human. Human values, human flourishing, and everything that makes humans worthwhile has been lost, piece-by-piece, over the course of this trajectory.”
Even identifying a specific point where things went too far may be hard. There’s a strong boiling-a-frog element to this problem: each step always looks reflectively good and reasonable to the people who were at that point on the trajectory, but as they gradually get less and less like us, we gradually come to agree with their decisions less and less.
So, what privileges us to have an opinion? Merely the fact that, if this is as likely as I expect, and if, after reflection, we don’t want this to happen (why would we?), then we rather urgently need to figure out and implement some way to avoid this outcome before it starts. Preferably before we build and decide how to align ASI, since the issue is an inherent consequence of the details of however we make ASI aligned, and effect 1 on the list above kicks in immediately.
The whole process is kind of like the sorites paradox: at what point, as you remove grains from a heap of sand, does it become no longer a heap of sand? Or perhaps it’s better compared to the Ship of Theseus, but without the constraint that it remain seaworthy: if you keep adding and replacing and changing all the parts, what’s to keep it Theseus’s, or even a ship — what’s to stop it eventually changing into, for instance, a Trojan horse instead?
How do we know a change is good for us long-term, and not just convenient to the ASI, or an ASI mistake, or a cultural fad, or some combination of these? How do we both evolve, but still keep some essence of what is worthwhile about being human: how large an evolutionary space should we be open to evolving into? It’s a genuinely hard problem, almost philosophically hard — and even if we had an answer, how do we then lock the very complex socio-technical evolutionary process down to stay inside that? Should we even try? Maybe weird is good, and we should be ready to lose everything we care about so that something unrecognizable to us can exist instead, maybe we should just trust our weird descendants, or maybe it’s none of our business what they do with out legacy — or maybe some things about humanity and human flourishing are genuinely good, to us, and we believe are worth working to ensure they remain and are enhanced, not just mutated away, if we can find a way to do that?
So, that’s what I mean by .[3] We didn’t go extinct, we were not disempowered, we were a cooperating part of every step of an ongoing process that eventually changed us out of all recognition, during which we gradually lost everything[4] that makes us human, everything we now consider as flourishing, for reasons that are not, cumulatively, an improvement or an evolution, but just the eventual result of a large number of steps that each seemed good to those there at the time, but were overall no better directed than fads or fashions.
Yes, of course ASI would let this happen, and not just solve this for us: it’s aligned with the wishes of the people at the time. At each step in the process, they and it, together, decided to change the wishes of society going forward. Why would their ASI at that point in the future privilege the viewpoint of the society that first created ASI? That seems like just value-lock-in…
So how could we define what really matters and is worth preserving, without just doing simplistic value lock-in? Can, and should, we somehow lock in, say, just a few vital, abstract, high-level features of what makes humans worthwhile, ones that our descendants would (then) always reflectively agree with, while still leaving them all the flexibility they will need? Which ones? Is there some sort of anchor, soft constraint, or restoring force that I’m missing or that we could add to the dynamics? Is there any space at all, between the devil of value lock-in and the deep blue sea of ?
Is it just me, or are other people worried about this too? Or are you worried now, now that I’ve pointed it out? If not, why not: what makes this implausible or unproblematic to you?
So, what’s your ?
Mine’s roughly 50%, and it keeps me up at night.
[Yes, I have worried about this enough to have considered possible solutions. For a very tentative and incomplete suggestion, see the last section of my older and more detailed post on this subject, The Mutable Values Problem in Value Learning and CEV.]
I would like to thank Jian Xin Lim and SJ Beard[5] for their suggestions and comments on earlier drafts of this post.
A direction that, coincidentally, is also known to psychologists by acronym Western, Educated, Industrialized, Rich, Democratic (WEIRD). However, that’s not the kind of I’m concerned about in this post — I’m talking about something genuinely far weirder, something which makes that WEIRD look positively normal.
The corpus callosum has a huge bandwidth: it’s an obvious place to tie in, just add the silicon-based processing as effectively a third hemisphere.
Calling this problem ‘ ’ is of course a silly name, so perhaps we should instead, as I implied above, call this issue something like “value mutation” or “value divergence” or “value drift”, to make it clear that it’s the opposite problem to value lock-in?
I would be less concerned by if we were simultaneously colonizing the stars, spreading in all directions, and during this different cultural lines were undergoing value mutation in different directions (as seems inevitable without faster-than-light communications). Especially so if I were confident that, for any particular element of human values and human flourishing that I would mourn if it disappeared, at least some of our descendants will keep it. That actually seems kind of cool to me — I’m fine with speciation. But I suspect the process of changing ourselves will be sufficiently much faster than the process of interstellar colonization (if there is indeed suitable unoccupied living space out there) that the latter won’t save us. Still, a light-cone of ness is a somewhat different situation.
Listed in alphabetical order