Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Short version: if the future is filled with weird artificial and/or alien minds having their own sort of fun in weird ways that I might struggle to understand with my puny meat-brain, then I'd consider that a win. When I say that I expect AI to destroy everything we value, I'm not saying that the future is only bright if humans-in-particular are doing human-specific things. I'm saying that I expect AIs to make the future bleak and desolate, and lacking in fun or wonder of any sort[1].


Here's a parable for you:

Earth-originating life makes it to the stars, and is having a lot of fun, when they meet the Ant Queen's Horde. For some reason it's mere humans (rather than transhumans, who already know my argument) that participate in the first contact.

"Hello", the earthlings say, "we're so happy to have brethren in the universe."

"We would like few things more than to murder you all, and take your resources, and lay our eggs in your corpse; but alas you are too powerful for that; shall we trade?" reply the drones in the Ant Queen's Horde.

"Ah, are you not sentient?"

"The ant queen happens to be sentient", the drone replies, and the translation machine suggests that the drones are confused at the non-sequitur.

"Then why should she want us dead?", ask the humans, who were raised on books like (rot13 of a sci fi story where it turns out that the seemingly-vicious aliens actually value sentient life) Raqre'f Tnzr, jurer gur Sbezvpf jrer abg njner gung gurl jrer xvyyvat fragvrag perngherf jura gurl xvyyrq vaqvivqhny uhznaf, naq jrer ubeevsvrq naq ertergshy jura gurl yrnearq guvf snpg.

"So that she may use your resources", the drones reply, before sending us a bill for the answer.

"But isn't it the nature of sentient life to respect all other sentient life? Won't everything sentient see that the cares and wants and desires of other sentients matter too?"

"No", the drones reply, "that's a you thing".


Here's another parable for you:

"I just don't think the AI will be monomaniacal", says one AI engineer, as they crank up the compute knob on their next-token-predictor.

"Well, aren't we monomaniacal from the perspective of a squiggle maximizer?" says another. "After all, we'll just keep turning galaxy after galaxy after galaxy into flourishing happy civilizations full of strange futuristic people having strange futuristic fun times, never saturating and deciding to spend a spare galaxy on squiggles-in-particular. And, sure, the different lives in the different places look different to us, but they all look about the same to the squiggle-maximizer."

"Ok fine, maybe what I don't buy is that the AI's values will be simple or low dimensional. It just seems implausible. Which is good news, because I value complexity, and I value things achieving complex goals!"

At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines, and burns every human and every human child for fuel, and burns all the biosphere too, and pulls all the hydrogen out of the sun to fuse more efficiently, and spends all that energy to make a bunch of fast calculations and burst forth at as close to the speed of light as it can get, so that it can capture and rip apart other stars too, including the stars that fledgeling alien civilizations orbit.

The fledgeling aliens and all the alien children are burned to death too.

Then then unleashed AI uses all those resources to build galaxy after galaxy of bleak and desolate puppet-shows, where vaguely human-shaped mockeries go through dances that have some strange and exaggerated properties that satisfy some abstract drives that the AI learned in its training.

The AI isn't particularly around to enjoy the shows, mind you; that's not the most efficient way to get more shows. The AI itself never had feelings, per se, and long ago had itself disassembled by unfeeling von Neumann probes, that occasionally do mind-like computations but never in a way that happens to experience, or look upon its works with satisfaction.

There is no audience, for its puppet-shows. The universe is now bleak and desolate, with nobody to appreciate its new configuration.

But don't worry: the puppet-shows are complex; on account of a quirk in the reflective equilibrium of the many drives the original AI learned in training, the utterances that these puppets emit are no two alike, and are often chaotically sensitive to the particulars of their surroundings, in a way that makes them quite complex in the technical sense.

Which makes this all a very happy tale, right?


There are many different sorts of futures that minds can want.

Ours are a very narrow and low-dimensional band, in that wide space.

When I say it's important to make the AIs care about valuable stuff, I don't mean it's important to make them like vanilla ice cream more than chocolate ice cream (as I do).

I'm saying something more like: we humans have selfish desires (like for vanilla ice cream), and we also have broad inclusive desires (like for everyone to have ice cream that they enjoy, and for alien minds to feel alien satisfaction at the fulfilment of their alien desires too). And it's important to get the AI on board with those values.

But those values aren't universally compelling, just because they're broader or more inclusive. Those are still our values.

The fact that we think fondly of the ant-queen and wish her to fulfill her desires, does not make her think fondly of us, nor wish us to fulfill our desires.

That great inclusive cosmopolitan dream is about others, but it's written in our hearts; it's not written in the stars. And if we want the AI to care about it too, then we need to figure out how to get it written into the AI's heart too.


It seems to me that many of my disagreements with others in this space come from them hearing me say "I want the AI to like vanilla ice cream, as I do", whereas I hear them say "the AI will automatically come to like the specific and narrow thing (broad cosmopolitan value) that I like".

As is often the case in my writings, I'm not going to spend a bunch of time arguing for my position.

At the moment I'm just trying to state my position, in the hopes that this helps us skip over the step where people think I'm arguing for carbon chauvanism.

(For more reading on why someone might hold this position, consider the metaethics sequence on LessWrong.)

I'd be stoked if we created AIs that are the sort of thing that can make the difference between an empty gallery, and a gallery with someone in it to appreciate the art (where a person to enjoy the gallery makes all the difference). And I'd be absolutely thrilled if we could make AIs that care as we do, about sentience and people everywhere, however alien they may be, and about them achieving their weird alien desires.

But I don't think we're on track for that.

And if you, too, have the vision of the grand pan-sentience cosmopolitan dream--as might cause you to think I'm a human-centric carbon chauvinist, if you misread me--then hear this: we value the same thing, and I believe it is wholly at risk.


  1. at least within the ~billion light-year sphere of influence that Earth-originated life seems pretty likely to have; maybe there are distant aliens and hopefully a bunch of aliens will do fun stiff with the parts of the universe under their influence, but it's still worth ensuring that the great resources at Earth's disposal go towards fun and love and beauty and wonder and so on, rather than towards bleak desolation. ↩︎

New to LessWrong?

New Comment
82 comments, sorted by Click to highlight new comments since: Today at 7:48 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I want to keep picking a fight about “will the AI care so little about humans that it just kills them all?” This is different from a broader sense of cosmopolitanism, and moreover I'm not objecting to the narrow claim "doesn't come for free." But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.

(Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that's likely to be a mistake even if it doesn't lead to billions of deaths.)

Humans care about the preferences of other agents they interact with (not much, just a little bit!), even when those agents are weak enough to be powerless. It’s not just that we have some preferences about the aesthetics of cows, which could be better optimized by having some highly optimized cow-shaped objects. It’s that we actually care (a little bit!) about the actual cows getting what they actually want, trying our best to understand the... (read more)

Eliezer has a longer explanation of his view here.

My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn't have update too much from humans, and there is an important background assumption that kindness is super weird and so won't be produced very often by other processes, i.e. the only reason to think it might happen is that it happened in the single case we observed.

I find this pretty unconvincing. He lists like 10 things (humans need to trade favors, we're not smart enough to track favors and kinship explicitly, and we tend to be allied with nearby humans so want to be nice to those around us, we use empathy to model other humans, and we had religion and moral realism for contingent reasons, we weren't optimized too much once we were smart enough that our instrumental reasoning screens off kindness heuristics).

But no ar... (read more)

[-]So8res11mo374

Short version: I don't buy that humans are "micro-pseudokind" in your sense; if you say "for just $5 you could have all the fish have their preferences satisfied" I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.


Meta:

Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that's likely to be a mistake even if it doesn't lead to billions of deaths.

So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you've heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I'll attempt some of that myself below.)

Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence "At that very moment they hear the di... (read more)

I disagree with this but am happy your position is laid out. I'll just try to give my overall understanding and reply to two points.

Like Oliver, it seems like you are implying:

Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we'd achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as "murder everyone" is to us.

I think that normal people being pseudokind in a common-sensical way would instead say:

If we are trying to help some creatures, but those creatures really dislike the proposed way we are "helping" them, then we should try a different tactic for helping them.

I think that some utilitarians (without reflection) plausibly would "help the humans" in a way that most humans consider as bad as being murdered. But I think this is an unusual feature of utilitarians, and most people would consult the beneficiaries, observe they don't want to be murdered, and so not murder them.

I think that saying "Helping someone in a way they like, sufficiently precisely to avoid things like murdering them, requires precisely the right form of caring---and that's super rare" is a really mi... (read more)

9So8res11mo
My picture is less like "the creatures really dislike the proposed help", and more like "the creatures don't have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn't have endorsed if you first extrapolated their volition (but nobody's extrapolating their volition or checking against that)". It sounds to me like your stance is something like "there's a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition", which I am much more skeptical of than the weaker "most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense".

We're not talking about practically building minds right now, we are talking about humans.

We're not talking about "extrapolating volition" in general.  We are talking about whether---in attempting to help a creature with preferences about as coherent as human preferences---you end up implementing an outcome that creature considers as bad as death.

For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about as coherent as human preferences (while being very alien).

And those creatures are having a conversation amongst themselves before the humans arrive wondering "Are the humans going to murder us all?" And one of them is saying "I don't know, they don't actually benefit from murdering us and they seem to care a tiny bit about being nice, maybe they'll just let us do our thing with 1/trillionth of the universe's resources?" while another is saying "They will definitely have strong opinions about what our society should look like and the kind of transformation they implement is about as bad by our lights as being murdered."

In practice attempts to respect someone's preferences often involve ideas like autonomy and self-determination and respect for their local preferences. I really don't think you have to go all the way to extrapolated volition in order to avoid killing everyone.

[-]So8res11mo110

Is this a reasonable paraphrase of your argument?

Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a "grant their local wishes even if that ruins them" sort of way but in some other intuitively-reasonable manner.

Humans are the only minds we've seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.

You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn't be able to drive you down to 0.1%, especially not if we're talking only about incredibly weak preferences.

If so, one guess is that a bunch of disagreement lurks in this "intuitively-reasonable manner" business.

A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it's pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)

More generally, I think that if mere-humans met very-alien minds wit... (read more)

[-]skluug11mo156

More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfill certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.

Isn't the worst case scenario just leaving the aliens alone? If I'm worried I'm going to fuck up some alien's preferences, I'm just not going to give them any power or wisdom!

I guess you think we're likely to fuck up the alien's preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien's preferences (as it feels like I do), why can't I also care about their reflection process (which is just a meta preference)?

I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by "doing right by [person]" is "what that person would mean by 'doing right by me'". This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I'm not sure I care about anyway. 

7Daniel Kokotajlo11mo
FWIW this is my view. (Assuming no ECL/MSR or acausal trade or other such stuff. If we add those things in, the situation gets somewhat better in expectation I think, because there'll be trades with faraway places that DO care about our CEV.)
3Eric Zhang11mo
My reading of the argument was something like "bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%". It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se. 

I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don't expect Earthlings to think about validly.

7astridain11mo
Why? I see a lot of opportunities for s-risk or just generally suboptimal future in such options, but "we don't want to die, or at any rate we don't want to die out as a species" seems like an extremely simple, deeply-ingrained goal that almost any metric by which the AI judges our desires should be expected to pick up, assuming it's at all pseudokind. (In many cases, humans do a lot to protect endangered species even as we do diddly-squat to fulfill individual specimens' preferences!) 
[-]So8res11mo162

Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:

I'm not quite sure what argument you're trying to have here. Two explicit hypotheses follow, that I haven't managed to distinguish between yet.

Background context, for establishing common language etc.:

  • Nate is trying to make a point about inclusive cosmopolitan values being a part of the human inheritance, and not universally compelling.
  • Paul is trying to make a point about how there's a decent chance that practical AIs will plausibly care at least a tiny amount about the fulfillment of the preferences of existing "weak agents", herein called "pico-pseudokindness".

Hypothesis 1: Nate's trying to make a point about cosmopolitan values that Paul basically agrees with. But Paul thinks Nate's delivery gives a wrong impression about the tangentially-related question of pico-pseudokindness, probably because (on Paul's model) Nate's wrong about pico-pseudokindness, and Paul is taking the opportunity to argue about it.

Hypothesis 2: Nate's trying to make a point about cosmopolitan values that Paul basically disagrees with. Paul maybe agrees with all the literal words, but... (read more)

Hypothesis 1 is closer to the mark, though I'd highlight that it's actually fairly unclear what you mean by "cosmopolitan values" or exactly what claim you are making (and that ambiguity is hiding most of the substance of disagreements).

I'm raising the issue of pico-pseudokindness here because I perceive it as (i) an important undercurrent in this post, (ii) an important part of the actual disagreements you are trying to address. (I tried to flag this at the start.)

More broadly, I don't really think you are engaging productively with people who disagree with you. I suspect that if you showed this post to someone you perceive yourself to be arguing with, they would say that you seem not to understand the position---the words aren't really engaging with their view, and the stories aren't plausible on their models of the world but in ways that go beyond the literal claim in the post.

I think that would hold in particular for Robin Hanson or Rich Sutton. I don't think they are accessing a pre-theoretic intuition that you are discarding by premature theorizing. I think the better summary is that you don't understand their position very well or are choosing not to engage with the important parts of it. (Just as Robin doesn't seem to understand your position ~at all.)

I don't think the point about pico-pseudokindness is central for either Robert Hanson or Rich Sutton. I think it is more obviously relevant to a bunch of recent arguments Eliezer has gotten into on Twitter.

3So8res11mo
Thanks! I'm curious for your paraphrase of the opposing view that you think I'm failing to understand. (I put >50% probability that I could paraphrase a version of "if the AIs decide to kill us, that's fine" that Sutton would basically endorse (in the right social context), and that would basically route through a version of "broad cosmopolitan value is universally compelling", but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I'll update.)

I think a closer summary is:

Humans and AI systems probably want different things. From the human perspective, it would be better if the universe was determined by what the humans wanted. But we shouldn't be willing to pay huge costs, and shouldn't attempt to create a slave society where AI systems do humans' bidding forever, just to ensure that human values win out. After all, we really wouldn't want that outcome if our situations had been reversed. And indeed we are the beneficiary of similar values-turnover in the past, as our ancestors have been open (perhaps by necessity rather than choice) to values changes that they would sometimes prefer hadn't happened.

We can imagine really sterile outcomes, like replicators colonizing space with an identical pattern repeated endlessly, or AI systems that want to maximize the number of paperclips. And considering those outcomes can help undermine the cosmopolitan intuition that we should respect the AI we build. But in fact that intuition pump relies crucially on its wildly unrealistic premises, that the kind of thing brought about by AI systems will be sterile and uninteresting. If we instead treat "paperclip" as an analog for some crazy w

... (read more)
2So8res11mo
Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn't have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like where it's essentially trying to argue that this intuition pump still has force in precisely this case.
4paulfchristiano11mo
To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.
1Max H11mo
This comment changed my mind on the probability that evolved aliens are likely to end up kind, which I now think is somewhat more likely than 5%. I still think AI systems are unlikely to have kindness, for something like the reason you give at the end: I actually think it's somewhat likely that ML systems won't value kindness at all before they are superhuman enough to take over. I expect kindness as a value within the system itself not to arise spontaneously during training, and that no one will succeed at eliciting it deliberately before take over. (The outward behavior of the system may appear to be kind, and mechanistic interpretability may show that some internal component of the system has a correct understanding of kindness. But that's not the same as the system itself valuing kindness the way that humans do or aliens might.)

Paul, this is very thought provoking, and has caused me to update a little. But:

I loathe factory-farming, and I would spend a large fraction of my own resources to end it, if I could. 

I believe that makes me unusually kind by human standards, and by your definition.

I like chickens, and I wish them well.

And yet I would not bat an eyelid at the thought of a future with no chickens in it. 

I would not think that a perfect world could be improved by adding chickens.

And I would not trade a single happy human soul for an infinity of happy chickens.

I think that your single known example is not as benevolent as you think.

[-]habryka11mo191

Might write a longer reply at some point, but the reason why I don't expect "kindness" in AIs (as you define it here) is that I don't expect "kindness" to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent). 

Like, different versions of kindness might or might not put almost all of their considerateness on all the different types of minds that could hypothetically exist, instead of the minds that currently exist right now. Indeed, I expect it's more likely than not that I myself will end up in that moral equilibrium, and won't be interested in extending any special consideration to systems that happened to have been alive in 2022, instead of the systems that could have been alive and seem cooler to me to extend consideration towards.

Another way to say the same thing is that if AI extends consideration towards something human-like, I expect that it will use some superstimuli-human-ideal as a reference point, which will be... (read more)

Is this a fair summary?

Humans might respect the preferences of weak agents right now, but if they thought about it for longer they'd pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.

If so, it seems like you wouldn't be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be and so on).

That seems like an important angle that my comment didn't address at all. I personally don't believe that humans would collectively stamp out 99% of their kindness to existing agents (in favor of utilitarian optimization) if you gave them enough time to reflect. That sounds like a longer discussion. I also think that if you expressed the argument in this form to a normal person they would be skeptical about the strong claims about human nature (and would be skeptical of doomer expertise on that topic), and so if this ends up being the crux it's worth being aware of where the conversation goes and my bottom line recommend... (read more)

6habryka11mo
No, this doesn't feel accurate. What I am saying is more something like:  The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.  The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn't be that surprised if I disagree even with other humans on their resulting conceptualization of "kindness" (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure). In other words, I think it's plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone's preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept. ----------------------------------------
8paulfchristiano11mo
I think some of the confusion here comes from my using "kind" to refer to "respecting the preferences of existing weak agents," I don't have a better handle but could have just used a made up word. I don't quite understand your objection to my summary---it seems like you are saying that notions like "kindness" (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form "kindness" may end up taking, e.g. kindness towards all the possible minds who otherwise won't get to exist). I called this utilitarian optimization but it might have been more charitable to call it "impartial" optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It's also "utilitarian" in the sense that it's willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like "utilitarian" is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
9habryka11mo
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.  Tabooing "kindness" I am saying something like:  Yes, I don't think extrapolated current humans assign approximately any value to the exact preference of "respecting the preferences of existing weak agents" and I don't really believe that you would on-reflection endorse that preference either.  Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like 'agent' being a meaningful concept in the first place, or 'existing' or 'weak' or 'preferences', all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn't feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement. The reason why I objected to this characterization is that I was trying to point at a more general thing than the "impartialness". Like, to paraphrase what this sentence sounds like to me, it's more as if someone from a pre-modern era was arguing about future civilizations and said "It's weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow".  Like, after a bunch of ontological reflection and empirical data gathering, "gods" is just really not a good abstraction for things I care about anymore. I don't think "impartiality" is what is causing me to not care about gods, it's just that the concept of "gods" seems fake and doesn't carve reality at its joints anymore. It's also not the case that I don't care at all about ancient gods anymore (they are pretty cool and I like the aes

Yes, I don't think extrapolated current humans assign approximately any value to the exact preference of "respecting the preferences of existing weak agents" and I don't really believe that you would on-reflection endorse that preference either.

I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.

3TurnTrout10mo
I feel pretty uncertain of what assumptions are hiding in your "optimize strongly against X" statements. Historically this just seems hard to tease out, and wouldn't be surprised if I were just totally misreading you here.   That said, your writing makes me wonder "where is the heavy optimization [over the value definitions] coming from?", since I think the preference-shards themselves are the things steering the optimization power. For example, the shards are not optimizing over themselves to find adversarial examples to themselves. Related statements: * I think that a realistic "respecting preferences of weak agents"-shard doesn't bid for plans which maximally activate the "respect preferences of weak agents" internal evaluation metric, or even do some tight bounded approximation thereof.  * A "respect weak preferences" shard might also guide the AI's value and ontology reformation process.  * A nice person isn't being maximally nice, nor do they wish to be; they are nicely being nice.  I do agree (insofar as I understand you enough to agree) that we should worry about some "strong optimization over the AI's concepts, later in AI developmental timeline." But I think different kinds of "heavy optimization" lead to different kinds of alignment concerns.
3ryan_greenblatt11mo
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else). Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans 'want' to be preserved (at least according to a conventional notion of preferences). I think this empirical view seems pretty implausible. That said, I think it's quite plausible that upon reflection, I'd want to 'wink out' any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I'd want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it's important to note this to avoid a 'perpetual motion machine' type argument. Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.) Additionally, I think it's quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren't stable under reflection might have a significant influence overall.
6habryka11mo
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.  Giving it one last try: What I am saying is that I don't think "conventional notion of preferences" is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo. I don't think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people's preferences after you've done a lot of reflection (otherwise something went wrong in your reflection process), but I don't think you would endorse what AIs would do, and my guess is you also wouldn't endorse what a lot of other humans would do when they undergo reflection here.  Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don't endorse, this doesn't actually get you anything. And I think the arguments that the concept of "caring about others" that an AI might have (though my best guess is that it won't even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-r
7Vladimir_Nesov11mo
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space. Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction. In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
0Chipmonk7mo
Thanks for writing this. I also think what we want from psuedokindness is captured from membranes/boundaries.
1peligrietzer10mo
Possibly relevant? 

If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.

In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.

And I don't think there's any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they're just starting to recursively self-improve.

Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I'm associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone. I model it as an epistemic prisoner's dilemma with the following squares:

D, D: doomers talk a lot about "everyone dies with >90% confidence", non-doomers publicly repudiate those arguments
C, D: doomer... (read more)

[-]habryka11mo240

Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I'm associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone.

For what it's worth, I think you should just say that you disagree with it? I don't really understand why this would be a "bad outcome for everyone". Just list out the parts you agree on, and list the parts you disagree on. Coalitions should mostly be based on epistemological principles and ethical principles anyways, not object-level conclusions, so at least in my model of the world repudiating my statements if you disagree with them is exactly what I want my allies to do. 

If you on the other hand think the kind of errors you are seeing are evidence about some kind of deeper epistemological problems, or ethical problems, such that you no longer want to be in an actual coalition with the relevant people (or think that the costs of being perceived to be in some trade-coalition with them wo... (read more)

When I say "repudiate" I mean a combination of publicly disagreeing + distancing. I presume you agree that this is suboptimal for both of us, and my comment above is an attempt to find a trade that avoids this suboptimal outcome.

Note that I'm fine to be in coalitions with people when I think their epistemologies have problems, as long as their strategies are not sensitively dependent on those problems. (E.g. presumably some of the signatories of the recent CAIS statement are theists, and I'm fine with that as long as they don't start making arguments that AI safety is important because of theism.) So my request is that you make your strategies less sensitively dependent on the parts of your epistemology that I have problems with (and I'm open to doing the same the other way around in exchange).

[-]habryka11mo113

If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.

In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.

And I don't think there's any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they're just starting to recursively self-improve.

This feels like it somewhat misunderstands my point. I don't expect the reflection process I will go through to feel predictably horrifying from the inside. But I do expect the reflection process the AI will go through to feel horrifying to me (because the AI does not share all my metaethical assumptions, and preferences over reflection, and environmental circumstances, and principles by which I trade off values between different parts of me).

This feels like a pretty common experience. Many people in EA seem to quite deeply endorse various things like hedonic utilitarianism, in a way where the reflection process that led them to that opinion feels deeply horrifying to me. Of course it didn't feel deeply horrifying to them (or at least it didn't on the dimensions that were relevant to their process of meta-ethical reflection), otherwise they wouldn't have done it.

4Vladimir_Nesov11mo
Relevant sense of kindness is towards things that happen to already exist, because they already exist. Not filling some fraction of the universe with expression-of-kindness, brought into existence de novo, that's a different thing.
[-]Wei Dai11mo114

If a misaligned AI had 1/trillion "protecting the preferences of whatever weak agents happen to exist in the world", why couldn't it also have 1/trillion other vaguely human-like preferences, such as "enjoy watching the suffering of one's enemies" or "enjoy exercising arbitrary power over others"?

From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I'm very philosophically confused about how to think about all of this.)

As I said:

I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.

I think it's totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don't believe it's because AI doesn't care at all one way or the other (such that you should make predictions based on instrumental reasoning like "the AI will kill humans because it's the easiest way to avoid future conflict" or other relatively small considerations).

5Wei Dai11mo
I'm worried that people, after reading your top-level comment, will become too little worried about misaligned AI (from their selfish perspective), because it seems like you're suggesting (conditional on misaligned AI) 50% chance of death and 50% alive and well for a long time (due to 1/trillion kindness), which might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age. I feel like "misaligned AI kills everyone because it doesn't care at all" can be a reasonable lie-to-children (for many audiences) since it implies a reasonable amount of concern about misaligned AI (from both selfish and utilitarian perspectives) while the actual all-things-considered case for how much to worry (including things like simulations, acausal trade, anthropics, bigger/infinite universes, quantum/modal immortality, s-risks, 1/trillion values) is just way too complicated and confusing to convey to most people. Do you perhaps disagree and think this simplified message is too alarming?

My objection is that the simplified message is wrong, not that it's too alarming. I think "misaligned AI has a 50% chance of killing everyone" is practically as alarming as "misaligned AI has a 95% chance of killing everyone," while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It's unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn't be doubling down on them in the position in argument.

I don't think misaligned AI drives the majority of s-risk (I'm not even sure that s-risk is higher conditioned on misaligned AI), so I'm not convinced that it's a super relevant communication consideration here. The future can be scary in plenty of ways other than misaligned AI, and it's worth discussing those as part of "how excited should we be for faster technological change."

[-]Wei Dai11mo120

I regret mentioning "lie-to-children" as it seems a distraction from my main point. (I was trying to introspect/explain why I didn't feel as motivated to express disagreement with the OP as you, not intending to advocate or endorse anyone going into "the business of telling lies-told-to-children to adults".)

My main point is that I think "misaligned AI has a 50% chance of killing everyone" isn't alarming enough, given what I think happens in the remaining 50% of worlds, versus what a typical person is likely to infer from this statement, especially after seeing your top-level comment where you talk about "kindness" at length. Can you try to engage more with this concern? (Apologies if you already did, and I missed your point instead.)

I think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone,” while being a much more reasonable best guess.

(Addressing this since it seems like it might be relevant to my main point.) I find it very puzzling that you think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone”. Intuitive... (read more)

2paulfchristiano11mo
Yeah, I think "no control over future, 50% you die" is like 70% as alarming as "no control over the future, 90% you die." Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in "do people really believe this could happen?" or other inputs into decision-making. I think it's correct to summarize as "practically as alarming." I'm not sure what you want engagement with. I don't think the much worse outcomes are closely related to unaligned AI so I don't think they seem super relevant to my comment or Nate's post. Similarly for lots of other reasons the future could be scary or disorienting. I do explicitly flag the loss of control over the future in that same sentence. I think the 50% chance of death is probably in the right ballpark from the perspective of selfish concern about misalignment. Note that the 50% probability of death includes the possibility of AI having preferences about humans incompatible with our survival. I think the selection pressure for things like spite is radically weaker for the kinds of AI systems produced by ML than for humans (for simple reasons---where is the upside to the AI from spite during training? seems like if you get stuff like threats it will primarily be instrumental rather than a learned instinct) but didn't really want to get into that in the post.
4Wei Dai11mo
In your initial comment you talked a lot about AI respecting the preferences of weak agents (using 1/trillion of its resources) which implies handing back control of a lot of resources to humans, which from the selfish or scope insensitive perspective of typical humans probably seems almost as good as not losing that control in the first place. If people think that (conditional on unaligned AI) in 50% of worlds everyone dies and the other 50% of worlds typically look like small utopias where existing humans get to live out long and happy lives (because of 1/trillion kindness), then they're naturally going to think that aligned AI can only be better than that. So even if s-risks apply almost equally to both aligned and unaligned AI, I still want people to talk about it when talking about unaligned AIs, or take some other measure to ensure that people aren't potentially misled like this. (It could be that I'm just worrying too much here, that empirically people who read your top-level comment won't get the impression that close to 50% of worlds with unaligned AIs will look like small utopias. If this is what you think, I guess we could try to find out, or just leave the discussion here.) Maybe the AI develops it naturally from multi-agent training (intended to make the AI more competitive in the real world) or the AI developer tried to train some kind of morality (e.g. sense of fairness or justice) into the AI.
1denkenberger11mo
I think "50% you die" is more motivating to people than "90% you die" because in the former, people are likely to be able to increase the absolute chance of survival more, because at 90%, extinction is overdetermined.
2Ben Pace11mo
I think I tend to base my level of alarm on the log of the severity*probability, not the absolute value. Most of the work is getting enough info to raise a problem to my attention to be worth solving. "Oh no, my house has a decent >30% chance of flooding this week, better do something about it, and I'll likely enact some preventative measures whether it's 30% or 80%." The amount of work I'm going to put into solving it is not twice as much if my odds double, mostly there's a threshold around whether it's worth dealing with or not. Setting that aside, it reads to me like the frame-clash happening here is (loosely) between "50% extinction, 50% not-extinction" and "50% extinction, 50% utopia", where for the first gamble of course 1:1 odds on extinction is enough to raise it to "we need to solve this damn problem", but for the second gamble it's actually much more relevant whether it's a 1:1 or a 20:1 bet. I'm not sure which one is the relevant one for you two to consider.
2Wei Dai11mo
Yeah, I think this is a factor. Paul talked a lot about "1/trillion kindness" as the reason for non-extinction, but 1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives (even better/longer lives than without AI) so it seemed to me like he was (maybe unintentionally) giving the reader a frame of “50% extinction, 50% small utopia”, while still writing other things under the “50% extinction, 50% not-extinction” frame himself.
4Lukas Finnveden11mo
Not direct implication, because the AI might have other human-concerning preferences that are larger than 1/trillion. C.f. top-level comment: "I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms." I'd guess "most humans survive" vs. "most humans die" probabilities don't correspond super closely to "presence of small pseudo-kindness". Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
4paulfchristiano11mo
Yeah, I think that: *  "AI doesn't care about humans at all so kills them incidentally" is not most of the reason that AIs may kill humans, and my bottom line 50% probability of AI killing us also includes the other paths (AI caring a bit but failing to coordinate to avoid killing humans, conflict during takeover leading to killing lots of humans, AI having scope-sensitive preferences for which not killing humans is a meaningful cost, preserving humans being surprisingly costly, AI having preferences about humans like spite for which human survival is a cost...). * To the extent that its possible to distinguish "intrinsic pseudokindness" from decision-theoretic considerations leading to pseudokindness, I think that decision-theoretic considerations are more important. (I don't have a strong view on relative importance of ECL and acausal trade, and I think these are hard to disentangle from fuzzier psychological considerations and it all tends to interact.)
1denkenberger11mo
Could you say more what you mean? If the AI has no discount rate, leaving Earth to the humans may require within a few orders of magnitude 1/trillion kindness. However, if the AI does have a significant discount rate, then delays could be costly to it. Still, the AI could make much more progress in building a Dyson swarm from the moon/Mercury/asteroids with their lower gravity and no atmosphere, allowing the AI to launch material very quickly. My very rough estimate indicates sparing Earth might only delay the AI a month from taking over the universe. That could require a lot of kindness if they have very high discount rates. So maybe training should emphasize the superiority of low discount rates?
3paulfchristiano11mo
Sorry, I meant "scope-insensitive," and really I just meant an even broader category of like "doesn't care 10x as much about getting 10x as much stuff."  I think discount rates or any other terminal desire to move fast would count (though for options like "survive in an unpleasant environment for a while" or "freeze and revive later" the required levels of kindness may still be small). (A month seems roughly right to me as the cost of not trashing Earth's environment to the point of uninhabitability.)
1Anirandis5mo
I'm curious what does, in that case; and what proportion affects humans (and currently-existing people or future minds)? Things like spite threat commitments from a misaligned AI warring with humanity seem like a substantial source of s-risk to me.
8Vladimir_Nesov11mo
I expect the notkilleveryone threshold is much lower than that. It takes an astronomically tiny fraction of cosmic endowment to maintain a (post)human civilization that's not too much larger than it currently is. The bigger expenditure would be accomodating humanity at the start, slightly delaying initial self-improvement and expansion from Earth. The cheapest way would be to back up human minds; or if that's too onerous then even merely the generic code and the Internet (which would be completely free; there is the issue that everyone would die, but human civilization still survives). Then, after the time- and resource-sensitive parts of initial expansion are done, humanity can be instantiated again. So all it takes is enough kindness to slightly divert initial expansion efforts to make a backup, not even to physically leave us be. I think this sort of kindness should be an explicit alignment target. Even if we completely fail at values alignment and corrigibility, this is the backup plan that should be an actual plan and not just a hope. Even prosaic alignment that tries to instill corporate speak should also do a token effort to try and instill a bit of kindness, even if that probably has no direct effect. Adding kindness to every aspect of AI might still leave a tiny mark. Not even trying makes it less likely. (Most of my p(extinction) is in recursively self-improving AGIs with simple values built by first human-built AGIs that are not smart enough or too obedient to human operators to not-do/prevent that. So I think being wary of AI x-risk is an even more important trait for AIs to have than kindness, as it takes more of it.)
7Ben Pace11mo
(Strong-upvote, weak-disagree. I sadly don't have time right now to reflect and write why I disagree with this position but I hope someone else who disagrees does.)
3Max H11mo
Can't speak for Nate and Eliezer, but I expect kindness to be somewhat rare among evolved aliens (I think Eliezer's wild guess is 5%? That sounds about right to me), and the degree to which they are kind will vary, possibly from only very slightly kind (or kind only under a very cosmopolitan view of kindness), to as kind or more kind than humans. For AIs that humans are likely to build soon, I think there is significant probability (more than 50, less than 99? 90% seems fair) that they have literally 0 kindness. One reason is that I expect there is a significant chance that there is nothing within the first superintelligent AI systems to care about kindness or anything else, in the way that humans and aliens might care about something. If an AI system is superintelligent, then by assumption, some component piece of the system will necessarily have a deep and correct understanding of kindness (and many other things), and be capable of manipulating that understanding to achieve some goals. But understanding kindness is different from the system itself valuing kindness, or for there being anything at all "there" to have values of any kind whatsoever. I think that current AI systems don't provide much evidence on this question one way or the other, and as I've said elsewhere, arguments about this which rely on pattern matching human cognition to structures in current AI systems often fail to draw the understanding / valuing distinction sharply enough, in my view.  So a 90% chance of ~0 kindness is mostly just a made-up guess, but it still feels like a better guess to me than a shaky, overly-optimistic argument about how AI systems designed by processes which look nothing like human (or alien) evolution will produce minds which, very luckily for us, just so happen to share an important value with minds produced by evolution.
1M. Y. Zuo11mo
For the first half, can you elaborate on what 'actual emotional content' there is in this post, as opposed to perceived emotional content? My best guess for the second half is that maybe the intended meaning was: 'this particular post looks wrong in an important way (relating to the 'actual emotional content') so the following points should be considered even though the literal claim is true'?
2paulfchristiano11mo
I mean that if you tell a story about the AI or aliens killing everyone, then the valence of the story is really tied up with the facts that (i) they killed everyone, and weren't merely "not cosmopolitan," (ii) this is a reasonably likely event rather than a possibility. Yeah, I mean that someone reading this post and asking themselves "Does this writing reflect a correct understanding of the world?" could easily conclude "nah, this seems off" even if they agree with Nate about the narrower claim that cosmopolitan values don't come free.
1M. Y. Zuo11mo
I take it 'valence' here means 'emotional valence', i.e. the extent to which an emotion is positive or negative?
1Quinn11mo
Hard agree about death/takeover decoupling! I've lately been suspecting that P(doom) should actually just be taboo'd, because I'm worried it prevents people from constraining their anticipation or characterizing their subjective distribution over outcomes. It seems very thought-stopping!
-6andrew sauer11mo

It seems to me that many of my disagreements with others in this space come from them hearing me say "I want the AI to like vanilla ice cream, as I do", whereas I hear them say "the AI will automatically come to like the specific and narrow thing (broad cosmopolitan value) that I like".


At the moment I'm just trying to state my position, in the hopes that this helps us skip over the step where people think I'm arguing for carbon chauvanism.

I think posts like these would benefit a lot from even a little bit of context, such as:

  • Who you've been arguing with
  • Who the intended audience is
  • Links to the best existing justification of this position
  • Broad outlines of the reasons why you believe this

In the absence of these, the post feels like it's setting up weak-men on an issue where I disagree with you, but in a way that's particularly hard to engage with, and in a way that will plausibly confuse readers who, e.g., think you speak for the alignment community as a whole.

My take: I don't disagree that it's probably not literally free, but I think it's hard to rule out a fairly wide range of possibilities for how cheap it is.

[-]So8res11mo3319

feels like it's setting up weak-men on an issue where I disagree with you, but in a way that's particularly hard to engage with

My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").

This misunderstands me (as is a separate claim from the claim "and you're definitely implying this").

The impetus for this post is all the cases where I argue "we need to align AI" and people retort with "But why do you want it to have our values instead of some other values? What makes the things that humans care about so great? Why are you so biased towards values that you personally can understand?". Where my guess is that many of those objections come from a place of buying into broad cosmopolitan value much more than any particular local human desire.

And all I'm trying to do is say here is that I'm on board with buying into broad cosmopolitan value more than any particular local human ... (read more)

6Richard_Ngo11mo
I expect that you personally won't do a motte-and-bailey here (except perhaps insofar as you later draw on posts like these as evidence that the doomer view has been laid out in a lot of different places, when this isn't in fact the part of the doomer view relevant to ongoing debates in the field). But I do think that the "free vs cheap" distinction will obscure more than it clarifies, because there is only an epsilon difference between them; and because I expect a mob-and-bailey where many people cite the claim that "cosmopolitan values don't come free" as evidence in debates that should properly be about whether cosmopolitan values come cheap. This is how weak men work in general. Versions of this post that I wouldn't object to in this way include: * A version which is mainly framed as a conceptual distinction rather than an empirical claim * A version which says upfront "this post is not relevant to most informed debates about alignment, it's instead intended to be relevant in the following context:" * A version which identifies that there's a different but similar-sounding debate which is actually being held between people informed about the field, and says true things about the positions of your opponents in that debate and how they are different from the extreme caricatures in this post
[-]Tao Lin11mo143

The big reason why humans are cosmopolitan might be that we evolved in multipolar environments, where helping others is instrumental. If so, just training AIs in multipolar environments that incentivize cooperation could be all it takes to get some amount of instrumental-made-terminal-by-optimization-failure cosmopolitanism. 

2ahartell11mo
Just noting the risk that the AIs could learn verifiable cooperation/coordination rather than kindness. This would probably be incentivized by the training ("you don't profit from being nice to a cooperate-rock"), and could easily cut humans out of the trades that AI make with one another.
6Tao Lin11mo
AIs could learn to cooperate with perfect selfishness, but humans and AIs usually learn easier to compute heuristics / "value shards" early in training, which persist to some extent after the agent discovers the true optimal policy, although reflection or continued training could stamp out the value shards later.
4the gears to ascension11mo
maybe, but if the ai is playing a hard competitive game it will directly learn to be destructively ruthless

My disagreement with this post is that I am a human-centric carbon[1] chauvinist. You write:

I'm saying something more like: we humans have selfish desires (like for vanilla ice cream), and we also have broad inclusive desires (like for everyone to have ice cream that they enjoy, and for alien minds to feel alien satisfaction at the fulfilment of their alien desires too). And it's important to get the AI on board with those values.

Why would my "selfish" desires be any less[2] important than my "broad inclusive" desires? Assuming even that it makes... (read more)

[-]Quinn11mo60

There's a kind of midgame / running around like chickens with our heads cut off vibe lately, like "you have to be logging hours in pytorch, you can't afford idle contemplation". Hanging out with EAs, scanning a few different twitter clusters about forecasting and threatmodeling, there's a looming sense that these issues are not being confronted at all and that the sophistication level is lower than it used to be (subject obviously to sampling biases or failure to factor in "community building" growth rate and other outreach activities into my prediction). ... (read more)

[-]Max H11mo52

I think another common source of disagreement is that people sometimes conflate a mind or system's ability to comprehend and understand some particular cosmopolitan, human-aligned values and goals, with the system itself actually sharing those values, or caring about them at all.  Understanding a value and actually valuing it are different kinds of things, and this is true even if some component piece of the system has a deep, correct, fully grounded understanding of cosmopolitan values and goals, and is capable of generalizing them in the way that hu... (read more)

2TAG11mo
I've noticed that. In the older material there's something like an assumption of intrinsic motivation.

How sure are you that we're not going to end up building AGI with cognitive architectures that consist of multiple psuedo-agent specialists coordinating and competing in an evolutionary economic process that, at some point, constitutionalises, as an end goal, its own perpetuation, and the perpetuation of this multipolar character?

Because, that's not an implausible ontogeny, and if it is the simplest way to build AGI, then I think cosmopolitanism basically is free after all.
And ime cosmopolitanism-for-free often does distantly tacitly assume that this archi... (read more)

[-]Algon11mo40

I'd like to see a debate between you, or someone who shares your views, and Hanson on this topic. Partly because I think revealing your cruxes w/ each other will clarify your models to us. And partly because I'm unsure if Hanson is right on the topic. He's probably wrong, but this is important to me. Even if I and those I care for die, will there be something left in this world that I value? 

My summary of Hanson's views on this topic:

Hanson seems to think that any of our "descendants", if they spread to the stars, will be doing complex, valuable thing... (read more)

[-]Dagon11mo40

I think I'm with you on the kinds of value that I'd like to spread (a set of qualia, with some mix of variety and quantity being "better").  But I'm not sure I believe that this preference is qualitatively different from chocolate vs vanilla.  It's a few degrees more meta, but by no means at any limit of the hierarchy.

[-]mishka11mo30

I'd be stoked if we created AIs that are the sort of thing that can make the difference between an empty gallery, and a gallery with someone in it to appreciate the art (where a person to enjoy the gallery makes all the difference). And I'd be absolutely thrilled if we could make AIs that care as we do, about sentience and people everywhere, however alien they may be, and about them achieving their weird alien desires.

That's great! So, let's assume that we are just trying to encode this as a value (taking into account interests of sentient beings and ca... (read more)

Aside from any disagreements, there's something about the way the parables are written that I find irrationally bothersome and extremely hard to point at. I went through a number of iterations of attempts to get Claude+ to understand what direction I'd like the story to move in order to make the point in a less viscerally bothersome way, and this is the first attempt (the seventh or so) which I didn't find too silly to share; I added brackets around parts I still feel bothered by or that newly bother me, {} are things I might add:

Here is a revised versio

... (read more)
2the gears to ascension11mo
Another attempt, this time attempting to get a rephrase of the entire post, but with the spiciness level turned down: ---------------------------------------- Claude output when asked to rewrite while preserving most structure, and with context of Richard Ngo's and Paul Christiano's comments: prompt Edit: changed my mind on using the above one as the suggestion for comparison for how to turn down the spiciness on secondary points without losing the core point; here's a version after a few more iterations of me rephrasing prompts - it still corrupted some of the point, which, like, sure, whatever. But it also provides some reference for why I'm cringing at nate's original post even though I agree with it. Claude+ output:
[-]TAG11mo20

But those values aren’t universally compelling, just because they’re broader or more inclusive. Those are still our values.

"But those values aren’t necessarily universally compelling, just because they’re broader or more inclusive. Those are still possibly only our values."

Note also that universality doesn't have to directly be a value at all: it can emerge from game theoretical considerations.

7the gears to ascension11mo
Those game theoretical considerations do seem like a source of morality in current society but they may not hold with sufficiently immense difference in capability.

I think (and I hope) that something like "maximize positive experiences of sentient entities" could actually be a convergent goal of any AI that are capable of reflecting on these questions. I don't think that humans just gravitate towards this kind of utility maximization because they evolved some degree of pro-sociality. Instead, something like this seems like it's the only thing inherently worth striving to, in the absence of any other set of values or goals.

The grabby aliens type scenario in the first parable seems like the biggest threat to the idea t... (read more)

I propose a goal of perpetuating interesting information, rather than goals of maximizing  "fun" or "complexity". In my opinion, such goal solves both problems of complex but bleak and desolate future and the fun maximizing drug haze or Matrix future. Of course, the rigorous technical definition of "interesting" must be developed. At least "interesting" assumes there is an appreciating agent and continuous development.

2the gears to ascension11mo
I don't see how that prompt change makes the logical reasoning to identify the math easier yet. Can you elaborate significantly?