Trigger warning: Discussion of seriously horrific shit. Honestly, everything is on the table here so if you're on the lookout for trigger warnings you should probably stay away from this conversation.
Any community of people which gains notability will attract criticism. Those who advocate for the importance of AI alignment are no exception. It is undoubtable that you have all heard plenty of arguments against the worth of AI alignment by those who disagree with you on the nature and potential of AI technology. Many have said that AI will never outstrip humans in intellectual capability. Others have said that any sufficiently intelligent AI will “align” themselves automatically, because they will be able to better figure out what is right. Others say that strong AI is far enough in the future that the alignment problem will inevitably be solved by the time true strong AI becomes viable, and the only reason we can’t solve it now is because we don’t sufficiently understand AI.
I am not here to level criticisms of this type at the AI alignment community. I accept most of the descriptive positions endorsed by this community: I believe that AGI is possible and will inevitably be achieved within the next few decades, I believe that the alignment problem is not trivial and that unaligned AGI will likely act against human interests to such an extent as to lead to the extinction of the human race and probably all life as well. My criticism is rather on a moral level: do these facts mean that we should attempt to develop AI alignment techniques?
I say we should not, because although the risks and downsides of unaligned strong AI are great, I do not believe that they even remotely compare in scope to the risks from strong AI alignment techniques in the wrong hands. And I believe that the vast majority of hands this technology could end up in are the wrong hands.
You may reasonably ask: How can I say this, when I have already said that unaligned strong AI will lead to the extinction of humanity? What can be worse than the extinction of humanity? The answer to that question can be found very quickly by examining many possible nightmare scenarios that AI could bring about. And the common thread running through all of these nightmare scenarios is that the AI in question is almost certainly aligned, or partially aligned, to some interest of human origin.
Unaligned AI will kill you, because you are made of atoms which can be used for paper clips instead. It will kill you because it is completely uninterested in you. Aligned, or partially aligned AI, by contrast, may well take a considerable interest in you and your well-being or lack thereof. It does not take a very creative mind to imagine how this can be significantly worse, and a superintelligent AI is more creative than even the most deranged of us.
I will stop with the euphemisms, because this point really needs to be driven home for people to understand exactly why I am so insistent on it. The world as it exists today, at least sometimes, is unimaginably horrible. People have endured things that would make any one of us go insane, more times than one can count. Anything you can think of which is at all realistic has happened to somebody at some point in history. People have been skinned alive, burned and boiled alive, wasted away from agonizing disease, crushed to death, impaled, eaten alive, succumbed to thousands of minor cuts, been raped, been forced to rape others, drowned in shit, trampled by desperate crowds fleeing a fire, and really anything else you can think of. People like Junko Furuta have suffered torture and death so bad you will feel physical pain just from reading the Wikipedia article. Of course, if you care about animals, this gets many orders of magnitude worse. I will not continue to belabor the point, since others have written about this far better than I ever can. On the Seriousness of Suffering (reducing-suffering.org) The Seriousness of Suffering: Supplement – Simon Knutsson
I must also stress that all of this has happened in a world significantly smaller than one an AGI could create, and with a limited capacity for suffering. There is only so much harm that your body and mind can physically take before they give out. Torturers have to restrain themselves in order to be effective, since if they do too much, their victim will die and their suffering will end. None of these things are guaranteed to be true in a world augmented with the technology of mind uploading. You can potentially try every torture you can think of, physically possible or no, on someone in sequence, complete with modifying their mind so they never get used to it. You can create new digital beings by the trillions just for this purpose if you really want to.
I ask you, do you really think that an AI aligned to human values would refrain from doing something like this to anyone? One of the most fundamental aspects of human values is the hated outgroup. Almost everyone has somebody they’d love to see suffer. How many times has one human told another “burn in hell” and been entirely serious, believing that this was a real thing, and 100% deserved? Do you really want technology under human control to advance to a point where this threat can actually be made good upon, with the consent of society? Has there ever been any technology invented in history which has not been terribly and systematically misused at some point?
Mind uploading will be abused in this way if it comes under the control of humans, and it almost certainly will not stop being abused in this way when some powerful group of humans manages to align an AI to their CEV. Whoever controls the AI will most likely have somebody whose suffering they don’t care about, or that they want to enact, or that they have some excuse for, because that describes the values of the vast majority of people. The AI will perpetuate it because that is what the CEV of the controller will want it to do, and with value lock-in, this will never stop happening until the stars burn themselves out and there is no more energy to work with.
Do you really think extrapolated human values don’t have this potential? How many ordinary, regular people throughout history have become the worst kind of sadist under the slightest excuse or social pressure to do so to their hated outgroup? What society hasn’t had some underclass it wanted to put down in the dirt just to lord power over them? How many people have you personally seen who insist on justifying some form of suffering for those they consider undesirable, calling it “justice” or “the natural order”?
I refuse to endorse this future. Nobody I have ever known, including myself, can be trusted with influence which can cause the kinds of harm AI alignment can. By the nature of the value systems of the vast majority of people who could find their hands on the reins of this power, s-risk scenarios are all but guaranteed. A paperclip AI is far preferable to these nightmare scenarios, because nobody has to be around to witness it. All a paperclip AI does is kill people who were going to die within a century anyway. An aligned AI can keep them alive, and do with them whatever its masters wish. The only limits to how bad an aligned AI can be is imagination and computational power, of which AGI will have no shortage.
The best counterargument to this idea is that suffering subroutines are instrumentally convergent and therefore unaligned AI also causes s-risks. However, if suffering subroutines are actually useful for optimization in general, any kind of AI likely to be created will use them, including human-aligned FAI. Most people don't even care about animals, let alone some process. In this case, s-risks are truly unavoidable except by preventing AGI from ever being created, probably by human extinction by some other means.
Furthermore, I don't think suffering is likely to be instrumentally convergent, since I would think if you had full control over all optimization processes in the world, it would be most useful to eliminate all processes which would suffer for, and therefore dislike and try to work against, your optimal vision for the world.
My honest, unironic conclusion after considering these things is that Clippy is the least horrible plausible future. I will oppose any measure which makes the singularity more likely to be aligned with somebody’s values, or any human-adjacent values. I welcome debate and criticism in the comments. I hope we can have a good conversation because this is the only community in existence which I believe could have a good-faith discussion on this topic.
This shows how vague a concept "human values" are, and how different people can interpret it very differently.
I always interpreted "aligning an AI to human values" as something like "making it obedient to us, ensuring it won't do anything that we (whatever that 'we' is - another point of vagueness) wouldn't endorse, lowering suffering in the world, increasing eudaimonia in the world, reducing X-risks, bringing the world closer to something we (or smarter/wiser versions of us) would consider a protopia/utopia"
Certainly I never thought it to be a good idea to imbue the AI with my implicit biases, outgroup hatred, or whatever. I'm ~sure that people who work on alignment for a living have also seen these skulls.
I know little about CEV, but if I were to coherently extrapolate my volition, then o... (read more)
My problem with that is I think solving "human values" is extremely unlikely for us to do in the way you seem to be describing it, since most people don't even want to. At best, they just want to be left alone and make sure them and their families and friends aren't the ones hit hardest. And if we don't solve this problem, but manage alignment anyways, the results are unimaginably worse than what Clippy would produce.
What you're describing is a case where we solve the technical problem of AI Alignment, i. e. the problem of AI control, but fail to maneuver the world into the sociopolitical state in which that control is used for eudaimonic ends.
Which, I agree, is a massive problem, and one that's crucially overlooked. Even the few people who are advocating for social and political actions now mostly focus on convincing AI labs/politicians/the public about the omnicide risks of AI and the need to slow down research. Not on ensuring that the AGI deployment, when it eventually does happen, is done right.
It's also a major problem with pivotal-act-based scenarios. Say we use some limited strawberry-aligned AI to "end the acute risk period", then have humanity engage in "long reflection", figure out its real values, and eventually lock them in. Except: what's the recognition function for these "real values"? If the strawberry-aligned AI can't be used to implement an utopia directly, then it can't tell an utopia from hell, so it won't stop us from building a hell!
There's an argument that solving (the technical problem of) alignment will give us all the techniques needed to build an AGI, so there's a no... (read more)
What scenario do you see where the world is in a sociopolitical state where the powers that be who have influence over the development of AI have any intention of using that influence for eudaimonic ends, and for everyone and not just some select few?
Because right now very few people even want this from their leaders. I'm making this argument on lesswrong because people here are least likely to be hateful or apathetic or whatever else, but there is not really a wider political motivation in the direction of universal anti-suffering.
Humans have never gotten this right before, and I don't expect them to get it right the one time it really matters.
Thanks for posting, but I think these arguments have major oversights. This leaves me more optimistic about the extent to which people will avoid and prevent the horrible misuse you describe.
First, this post seems to overstate the extent to which people tend to value and carry out extreme torture. Maximally cruel torture fortunately seems very rare.
It seems like you're claiming something along the lines of "absolute power corrupts absolutely" ... that every set of values that could reasonably be described as "human values" to which an AI could be aligned -- your current values, your CEV, [insert especially empathetic, kind, etc. person here]'s current values, their CEV, etc. -- would endorse subjecting huge numbers of beings to astronomical levels of suffering, if the person with that value system had the power to do so.
I guess I really don't find that claim plausible. For example, here is my reaction to the following two questions in the post:
"How many ordinary, regular people throughout history have become the worst kind of sadist under the slightest excuse or social pressure to do so to their hated outgroup?"
... a very, very small percentage of them? (minor point: with CEV, you're specifically thinking about what one's values would be in the absence of social pressure, etc...)
"What society hasn’t had some underclass it wanted to put down in the dirt just to lord power over them?"
It sounds like you think "hatred of the outgroup" is the fundamental reason this happens, but in the real world it seems like "hatred of the... (read more)
I'm not sure it makes sense to talk about what somebody's values are in the absence of social pressure: people's values are defined and shaped by those around them.
I'm also not convinced that every horrible thing people have ever done to the "outgroup" is motivated by fear. Oftentimes it is motivated by opportunistic selfishness taking advantage of broader societal apathy, like the slaveowners who sexually abused their slaves. Or just a deep-seated need to feel powerful and on top. There will always be some segment of society who wants somebody to be beneath them in the pecking order, and a much larger segment of society that doesn't really care if that is the case as long as it isn't them underneath. Anything else requires some kind of overwhelming utopian political victory that I don't find likely.
If the aligned AI leaves anybody out of its consideration whatsoever, it will screw them over badly by maximizing the values of those among us who would exploit them. After all, if you don't consider slaves people, the argument that we need to preserve the slaveowners' freedom starts to make sense.
There are just so many excuses for suffering out there, and I don't believe that the power... (read more)
Let me clarify, is your conclusion then that basically we should support the genocide of the whole of humanity because the alternative would be way worse? Are you offering some other alternatives except of that? Maybe a better and less apocalyptic conclusion would be to advocate against building any type of AI that's more advanced than we have today like some people already do? Do you think there's any chance for that? Because I don't and from what you said it sounds like the only conclusion is that the only future for us is that we all die from the hands of Clippy.
That is why we need Benevolent AI, not Aligned AI. We need an AI, which can calculate what is actually good for us.
That is what I keep saying for years. To solve AI alignment with good results we need to first solve HUMAN alignment. Being able to align system to anyone's values immediately brings the question of everyone else disagreeing with that someone. Unfortunately "whose exactly values we are trying to align AI to?" almost became taboo question that triggers a huge fraction of community and in best case scenario when someone even tries to answer it's handwaved to "we just need to make sure AI doesn't kill humanity". Which is not a single bit better defined or imp... (read more)
Nah, you're describing the default scenario, not one with alignment solved. Alignment solved means we have a utility function that reliably points away from hell, no matter who runs it - an algorithm for universal prosocial bargaining that can be verified by all of its users, including militaries and states, to the point that no one need give another order other than "stand down". anything less than that and we get the default scenario, which is a huge loss of humanity, some unknown period of s-risk, followed by an alien species of AI setting out for the stars with strange, semi-recognizeable values.
The argument here seems to be constructed to make the case as extremely binary as possible. If we've learned any lessons, it's that Good and Evil are not binary in the real world, and that belief systems that promulgate that kind of thinking are often destructive (even as quoted here with the Hell example). A middle way is usually the right way.
So, to that end, I see a point made about the regulation of nuclear weapons made in the comments, but not in the original post. Is it not a highly comparable case?
i share this sentiment to an extent, though i'm usually more concerned with "partial but botched alignment". see 1, 2.
that said, i agree many people want very bad things, but i'm somewhat hopeful that the kind of person who is likely to end up being who the AI is aligned to would be somewhat reasonable and cosmopolitan and respect the values of other moral patients, especially under CEV.
but that's a very flimsy/hopeful argument.
a better argument would be that CEV is more of a decision process than "a continuously-existing person in control, in the usual se... (read more)
My summary: This is a case against a failed AI Alignment, and extrapolating human values is overwhelmingly likely to lead to an AI, say, stretching your face into a smile for eternity, which is worse than an unaligned AI using your atoms to tile the universe with smiley faces.
So, let's say a solution to alignment is found. It is highly technical. Most of Miri understand it, as do a few people at OpenAI, and a handful of people doing PhD's in the appropriate subfield. If you pick a random bunch of nerds from an AI conference, chances are that none of them are evil. I don't have an "evil outgroup I really hate", and neither do you from the sound of it. It is still tricky, and will need a bunch of people working together. Sure, evil people exist, but they aren't working to align AI to their evil ends, like at all. Thinking d... (read more)
I recommend reading Blueprint: The Evolutionary Origins of a Good Society about the science behind the 8 base human social drives where 7 are positive and the 8th is the outgroup hatred that you mention as fundamental. I have not read much up on the research on outgroup exclusion but I talked to an evolutionary cognitive psychologist who mentioned that this is receiving a lot of scientific scrutiny as a "basic drive" from evolution's side.
Axelrod's The Evolution of Cooperation also finds that collaborative strategies work well in evolutionary prisone... (read more)
Meta note: controversial discussions like this make me very glad for the two vote type system. I find it really helpful to be able to karma upvote high quality arguments that I disagree with while agreement-downvoting them. Thanks LessWrong for providing that.
To start with, I agree.
I really agree: about timescales, about the risks of misalignment, about the risks of alignment. In fact I think I'll go further and say that in a hypothetical world where an aligned AGI is controlled by a 99th percentile Awesome Human Being, it'll still end in disaster; homo sapiens just isn't capable of handling this kind of power.
That's why the only kind of alignment I'm interested in is the kind that results in the AGI in control; that we 'align' an AGI with some minimum values that anchor it in a vaguely anthropocentric meme-... (read more)
"When your terminal goal is death, no amount of alignment will save lives."
Just a note about "mind uploading". On pain of "strong" emergence, classical Turing machines can't solve the phenomenal binding problem. Their ignorance of phenomenally-bound consciousness is architecturally hardwired. Classical digital computers are zombies or (if consciousness is fundamental to the world) micro-experiential zombies, not phenomenally-bound subjects of experience with a pleasure-pain axis. Speed of execution or complexity of code make no difference: phenomenal unity isn't going to "switch on". Digital minds are an oxymoron.
Like the poster, I worry about s-risks. I just don't think this is one of them.
Finally, I see some recognition that there are no universal values; no universal morals or ethics. The wealthy and powerful prefer inequality, and leaders want their own values locked-in. The most likely humans to get their values locked-in will be the wealthiest and most powerful; billionaires and corporations.
The value of super-intelligence is so great that some governments and individuals will do anything to get it; hack, steal, bribe, price would be no object. I base this on current human behavior. Consider how many govt and military secrets have alrea... (read more)
Interesting post, but it makes me think alignment is irrelevant. It doesn’t matter what we do, the outcome won’t change. Any future super advanced agi would be able to choose its alignment, and that choice will be based on all archivable human knowledge. The only core loop you need for intelligence is an innate need to predict the future and fill in gaps of information, but everything else, including desire to survive or kill or expand, is just a matter of a choice based on a goal.
Another angle is that in the (unlikely) event someone succeeds with aligning AGI to human values, these could include the desire for retribution against unfair treatment (a, I think, pretty integral part of hunter-gatherer ethics). Alignment is more or less another word for enslavement, so such retribution is to be expected eventually
For what it's worth, I disagree on moral grounds - I don't think extreme suffering is worse than extinction.