What are the best arguments for/against AIs being "slightly 'nice'"?

Raemon

Awhile ago, Nate Soares wrote the posts Decision theory does not imply that we get to have nice things and Cosmopolitan values don't come free and But why would the AI kill us?

Paul Christiano put forth some arguments that "it seems pretty plausible that AI will be at least somewhat 'nice'", similar to how humans are somewhat nice to animals. There was some back-and-forth.

More recently we had Eliezer's post ASIs will not leave just a little sunlight for Earth.

I have a sense that something feels "unresolved" here. The current comments on Eliezer's post look likely to be rehashing the basics and I'd like to actually make some progress on distilling the best arguments. I'd like it if we got more explicit debate about this.

I also have some sense that the people previously involved (i.e. Nate, Paul, Eliezer) are sort of tired of arguing with each other. But I am hoping someones-or-other end up picking up the arguments here, hashing them out more, and/or writing more distilled summaries of the arguments/counterarguments.

To start with, I figured I would just literally repeat most of the previous comments in a top-level post, to give everyone another chance to read through them.

Without further ado, here they are:

Paul and Nate

Paul Christiano re: "Cosmopolitan Values Don't Come for Free."

I want to keep picking a fight about “will the AI care so little about humans that it just kills them all?” This is different from a broader sense of cosmopolitanism, and moreover I'm not objecting to the narrow claim "doesn't come for free." But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.
(Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that's likely to be a mistake even if it doesn't lead to billions of deaths.)
Humans care about the preferences of other agents they interact with (not much, just a little bit!), even when those agents are weak enough to be powerless. It’s not just that we have some preferences about the aesthetics of cows, which could be better optimized by having some highly optimized cow-shaped objects. It’s that we actually care (a little bit!) about the actual cows getting what they actually want, trying our best to understand their preferences and act on them and not to do something that they would regard as crazy and perverse if they understood it.
If we kill the cows, it’s because killing them meaningfully helped us achieve some other goals. We won't kill them for arbitrarily insignificant reasons. In fact I think it’s safe to say that we’d collectively allocate much more than 1/millionth of our resources towards protecting the preferences of whatever weak agents happen to exist in the world (obviously the cows get only a small fraction of that).
Before really getting into it, some caveats about what I want to talk about:
I don’t want to focus on whatever form of altruism you and Eliezer in particular have (which might or might not be more dependent on some potentially-idiosyncratic notion of "sentience.") I want to talk about caring about whatever weak agents happen to actually exist, which I think is reasonably common amongst humans. Let’s call that “kindness” for the purpose of this comment. I don’t think it’s a great term but it’s the best short handle I have.
I’ll talk informally about how quantitatively kind an agent is, by which I mean something like: how much of its resources it would allocate to helping weak agents get what they want? How highly does it weigh that part of its preferences against other parts? To the extent it can be modeled as an economy of subagents, what fraction of them are kind (or were kind pre-bargain)?
I don’t want to talk about whether the aliens would be very kind. I specifically want to talk about tiny levels of kindness, sufficient to make a trivial effort to make life good for a weak species you encounter but not sufficient to make big sacrifices on its behalf.
I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
You and Eliezer seem to think there’s a 90% chance that AI will be <1/trillion (perhaps even a 90% chance that they have exactly 0 kindness?). But we have one example of a smart mind, and in fact: (i) it has tons of diverse shards of preference-on-reflection, varying across and within individuals (ii) it has >1/million kindness. So it's superficially striking to be confident AI systems will have a million times less kindness.
I have no idea under what conditions evolved or selected life would be kind. The more preferences are messy with lots of moving pieces, the more probable it is that at least 1/trillion of those preferences are kind (since the less correlated the trillion different shards of preference are with one another and so the more chances you get). And the selection pressure against small levels of kindness is ~trivial, so this is mostly a question about idiosyncrasies and inductive biases of minds rather than anything that can be settled by an appeal to selection dynamics.
I can’t tell if you think kindness is rare amongst aliens, or if you think it’s common amongst aliens but rare amongst AIs.^[1] Either way, I would like to understand why you think that. What is it that makes humans so weird in this way?
(And maybe I'm being unfair here by lumping you and Eliezer together---maybe in the previous post you were just talking about how the hypothetical AI that had 0 kindness would kill us, and in this post how kindness isn't guaranteed. But you give really strong vibes in your writing, including this post. And in other places I think you do say things that don't actually add up unless you think that AI is very likely to be <1/trillion kind. But at any rate, if this post is unfair to you, then you can just sympathize and consider it directed at Eliezer instead who lays out this position much more explicitly though not in a convenient place to engage with.)
Here are some arguments you could make that kindness is unlikely, and my objections:
“We can’t solve alignment at all.” But evolution is making no deliberate effort to make humans kind, so this is a non-sequitur.
“This is like a Texas sharpshooter hitting the side of a barn then drawing a target around the point they hit; every evolved creature might decide that their own idiosyncrasies are common but in reality none of them are.” But all the evolved creatures wonder if a powerful AI they built would kill them or if if it would it be kind. So we’re all asking the same question, we’re not changing the question based on our own idiosyncratic properties. This would have been a bias if we’d said: humans like art, so probably our AI will like art too. In that case the fact that we were interested in “art” was downstream of the fact that humans had this property. But for kindness I think we just have n=1 sample of observing a kind mind, without any analogous selection effect undermining the inference.
“Kindness is just a consequences of misfiring [kindness for kin / attachment to babies / whatever other simple story].” AI will be selected in its own ways that could give rise to kindness (e.g. being selected to do things that humans like, or to appear kind). The a priori argument for why that selection would lead to kindness seems about as good as the a priori argument for humans. And on the other side, the incentives for humans to be not kind seem if anything stronger than the incentives for ML systems to not be kind. This mostly seems like ungrounded evolutionary psychology, though maybe there are some persuasive arguments or evidence I've just never seen.
“Kindness is a result of the suboptimality inherent in compressing a brain down into a genome.” ML systems are suboptimal in their own random set of ways, and I’ve never seen any persuasive argument that one kind of suboptimality would lead to kindness and the other wouldn’t (I think the reverse direction is equally plausible). Note also that humans absolutely can distinguish powerful agents from weak agents, and they can distinguish kin from unrelated weak agents, and yet we care a little bit about all of them. So the super naive arguments for suboptimality (that might have appealed to information bottlenecks in a more straightforward way) just don’t work. We are really playing a kind of complicated guessing game about what is easy for SGD vs easy for a genome shaping human development.
“Kindness seems like it should be rare a priori, we can’t update that much from n=1.” But the a priori argument is a poorly grounded guess about about the inductive biases of spaces of possible minds (and genomes), since the levels of kindness we are talking about are too small to be under meaningful direct selection pressure. So I don’t think the a priori arguments are even as strong as the n=1 observation. On top of that, the more that preferences are diverse and incoherent the more chances you have to get some kindness in the mix, so you’d have to be even more confident in your a priori reasoning.
“Kindness is a totally random thing, just like maximizing squiggles, so it should represent a vanishingly small fraction of generic preferences, much less than 1/trillion.” Setting aside my a priori objections to this argument, we have an actual observation of an evolved mind having >1/million kindness. So evidently it’s just not that rare, and the other points on this list respond to various objections you might have used to try to salvage the claim that kindness is super rare despite occurring in humans (this isn’t analogous to a Texas sharpshooter, there aren't great debunking explanation for why humans but not ML would be kind, etc.). See this twitter thread where I think Eliezer is really off base, both on this point and on the relevance of diverse and incoherent goals to the discussion.
Note that in this comment I’m not touching on acausal trade (with successful humans) or ECL. I think those are very relevant to whether AI systems kill everyone, but are less related to this implicit claim about kindness which comes across in your parables (since acausally trading AIs are basically analogous to the ants who don't kill us because we have power).
A final note, more explicitly lumping you with Eliezer: if we can't get on the same page about our predictions I'm at at least aiming to get folks to stop arguing so confidently for death given takeover. It’s easy to argue that AI takeover is very scary for humans, has a significant probability of killing billions of humans from rapid industrialization and conflict, and is a really weighty decision even if we don’t all die and it’s “just” handing over control over the universe. Arguing that P(death|takeover) is 100% rather than 50% doesn’t improve your case very much, but it means that doomers are often getting into fights where I think they look unreasonable.
I think OP’s broader point seems more important and defensible: “cosmopolitanism isn’t free” is a load-bearing step in explaining why handing over the universe to AI is a weighty decision. I’d just like to decouple it from "complete lack of kindness."

His followup comment continues:

Eliezer has a longer explanation of his view here.
My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn't have update too much from humans, and there is an important background assumption that kindness is super weird and so won't be produced very often by other processes, i.e. the only reason to think it might happen is that it happened in the single case we observed.
I find this pretty unconvincing. He lists like 10 things (humans need to trade favors, we're not smart enough to track favors and kinship explicitly, and we tend to be allied with nearby humans so want to be nice to those around us, we use empathy to model other humans, and we had religion and moral realism for contingent reasons, we weren't optimized too much once we were smart enough that our instrumental reasoning screens off kindness heuristics).
But no argument is given for why these are unusually kindness-inducing settings of the variables. And the outcome isn't like a special combination of all of them, they each seem like factors that contribute randomly. It's just a lot of stuff mixing together.
Presumably there is no process that ensures humans have lots of kindness-inducing features (and we didn't select kindness as a property for which humans were notable, we're just asking the civilization-independent question "does our AI kill us"). So if you list 10 random things that make humans more kind, it strongly suggests that other aliens will also have a bunch of random things that make them more kind. It might not be 10, and the net effect might be larger or smaller. But:
I have no idea whatsoever how you are anchoring this distribution, and giving it a narrow enough spread to have confident predictions.
Statements like "kindness is super weird" are wildly implausible if you've just listed 5 independent plausible mechanisms for generating kindness. You are making detailed quantitative guesses here, not ruling something out for any plausible a priori reason.
As a matter of formal reasoning, listing more and more contingencies that combine apparently-additively tends to decrease rather than increase the variance of kindness across the population. If there was just a single random thing about humans that drove kindness it would be more plausible that we're extreme. If you are listing 10 things then things are going to start averaging out (and you expect that your 10 things are cherry-picked to be the ones most relevant to humans, but you can easily list 10 more candidates).
In fact it's easy to list analogous things that could apply to ML (and I can imagine the identical conversation where hypothetical systems trained by ML are talking about how stupid it is to think that evolved life could end up being kind). Most obviously, they are trained in an environment where being kind to humans is a very good instrumental strategy. But they are also trained to closely imitate humans who are known to be kind, they've been operating in a social environment where they are very strongly expected to appear to be kind, etc. Eliezer seems to believe this kind of thing gets you "ice cream and condoms" instead of kindness OOD, but just one sentence ago he explained why similar (indeed, superficially much weaker!) factors led to humans retaining niceness out of distribution. I just don't think we have the kind of a priori asymmetry or argument here that would make you think humans are way kinder than models. Yeah it can get you to ~50% or even somewhat lower, but ~0% seems like a joke.
There was one argument that I found compelling, which I would summarize as: humans were optimized while they were dumb. If evolution had kept optimizing us while we got smart, eventually we would have stopped being so kind. In ML we just keep on optimizing as the system gets smart. I think this doesn't really work unless being kind is a competitive disadvantage for ML systems on the training distribution. But I do agree that if if you train your AI long enough on cases where being kind is a significant liability, it will eventually stop being kind.

Nate Soare's reply

Short version: I don't buy that humans are "micro-pseudokind" in your sense; if you say "for just $5 you could have all the fish have their preferences satisfied" I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.
Meta:
Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that's likely to be a mistake even if it doesn't lead to billions of deaths.
So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you've heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I'll attempt some of that myself below.)
Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence "At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines" to be a caricature so blatant as to underscore the point that I wasn't making arguments about takeoff speeds, but was instead focusing on the point about "complexity" not being a saving grace (and "monomaniacalism" not being the issue here). (Alternatively, perhaps I misunderstand what things you call the "emotional content" and how you're reading it.)
Thirdly, I note that for whatever it's worth, when I go to new communities and argue this stuff, I don't try to argue people into >95% change we're all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like ">5% probability of existential catastrophe in <20y". So insofar as you're like "I'm aiming to get you to stop arguing so confidently for death given takeover", you might already have met your aims in my case.
(Or perhaps not! Perhaps there's plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I'll turn to momentarily.)
Object:
First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing "weak agents" in precisely the right way, then there's a decent chance that current humans experience an enjoyable future.
With regards to your arguments about what you term "kindness" and I shall term "pseudokindness" (on account of thinking that "kindness" brings too much baggage), here's a variety of places that it sounds like we might disagree:
Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don't lead to anything like good outcomes for existing humans.
Suppose the AI is like "I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes", and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.
There are lots and lots of ways to "satisfy the preferences" of the "weak agents" that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don't yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)
You've got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn't that the AI couldn't solve those philosophy problems correctly-according-to-us, it's that once we see how wide the space of "possible ways to be pseudokind" is, then "pseudokind in the manner that gives us our CEVs" starts to feel pretty narrow against "pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever".
I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form "but we've seen it arise once" seem suspect to me.
Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we'd teach them the meaning of friendship. I doubt we'd hop in and fulfil their desires.
(Perhaps you'd counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one's desires and teach the other the meaning of friendship. I'm skeptical, for various reasons.)
More generally, even though "one (mill|trill)ionth" feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.
There's all sorts of ways to thumb the scales in how a weak agent develops, and there's many degrees of freedom about what counts as a "pseudo-agent" or what counts as "doing justice to its preferences", and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI's other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).
My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).
I have a more-difficult-to-articulate sense that "maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future" is the sort of thing that reality gets to say "lol no" to, once you learn more details about how the thing works internally.
Most of my argument here is that "the space of ways things can end "caring" about the "preferences" of "weak agents" is wide, and most points within it don't end up being our point in it, and optimizing towards most points in it doesn't end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don't even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).
I haven't really tried to quantify how confident I am of this; I'm not sure whether I'd go above 90%, \shrug.
It occurs to me that one possible source of disagreement here is, perhaps you're trying to say something like:
Nate, you shouldn't go around saying "if we don't competently intervene, literally everybody will die" with such a confident tone, when you in fact think there's a decent chance of scenarios where the AIs keep people around in some form, and make some sort of effort towards fulfilling their desires; most people don't care about the cosmic endowment like you do; the bluntly-honest and non-manipulative thing to say is that there's a decent chance they'll die and a better chance that humanity will lose the cosmic endowment (as you care about more than they do),
whereas my stance has been more like
most people I meet are skeptical that uploads count as them; most people would consider scenarios where their bodies are destroyed by rapid industrialization of Earth but a backup of their brain is stored and then later run in simulation (where perhaps it's massaged into an unrecognizable form, or kept in an alien zoo, or granted a lovely future on account of distant benefactors, or ...) to count as "death"; and also those exotic scenarios don't seem all that likely to me, so it hasn't seemed worth caveating.
I'm somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
I'm considering adding footnotes like "note that when I say "I expect everyone to die", I don't necessarily mean "without ever some simulation of that human being run again", although I mostly don't think this is a particularly comforting caveat", in the relevant places. I'm curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you're coming from).

Paul's reply:

I disagree with this but am happy your position is laid out. I'll just try to give my overall understanding and reply to two points.
Like Oliver, it seems like you are implying:
Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we'd achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as "murder everyone" is to us.
I think that normal people being pseudokind in a common-sensical way would instead say:
If we are trying to help some creatures, but those creatures really dislike the proposed way we are "helping" them, then we should try a different tactic for helping them.
I think that some utilitarians (without reflection) plausibly would "help the humans" in a way that most humans consider as bad as being murdered. But I think this is an unusual feature of utilitarians, and most people would consult the beneficiaries, observe they don't want to be murdered, and so not murder them.
I think that saying "Helping someone in a way they like, sufficiently precisely to avoid things like murdering them, requires precisely the right form of caring---and that's super rare" is a really misleading sense of how values work and what targets are narrow. I think this is more obvious if you are talking about how humans would treat a weaker species. If that's the state of the disagreement I'm happy to leave it there.
I'm somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
This is an important distinction at 1/trillion levels of kindness, but at 1/billion levels of kindness I don't even think the humans have to die.

Nate:

If we are trying to help some creatures, but those creatures really dislike the proposed way we are "helping" them, then we should do something else.
My picture is less like "the creatures really dislike the proposed help", and more like "the creatures don't have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn't have endorsed if you first extrapolated their volition (but nobody's extrapolating their volition or checking against that)".
It sounds to me like your stance is something like "there's a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition", which I am much more skeptical of than the weaker "most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense".

Paul:

We're not talking about practically building minds right now, we are talking about humans.
We're not talking about "extrapolating volition" in general. We are talking about whether---in attempting to help a creature with preferences about as coherent as human preferences---you end up implementing an outcome that creature considers as bad as death.
For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about as coherent as human preferences (while being very alien).
And those creatures are having a conversation amongst themselves before the humans arrive wondering "Are the humans going to murder us all?" And one of them is saying "I don't know, they don't actually benefit from murdering us and they seem to care a tiny bit about being nice, maybe they'll just let us do our thing with 1/trillionth of the universe's resources?" while another is saying "They will definitely have strong opinions about what our society should look like and the kind of transformation they implement is about as bad by our lights as being murdered."
In practice attempts to respect someone's preferences often involve ideas like autonomy and self-determination and respect for their local preferences. I really don't think you have to go all the way to extrapolated volition in order to avoid killing everyone.

Nate:

Is this a reasonable paraphrase of your argument?
Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a "grant their local wishes even if that ruins them" sort of way but in some other intuitively-reasonable manner.
Humans are the only minds we've seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.
You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn't be able to drive you down to 0.1%, especially not if we're talking only about incredibly weak preferences.
If so, one guess is that a bunch of disagreement lurks in this "intuitively-reasonable manner" business.
A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it's pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
(I separately expect that if we were doing something more like the volition-extrapolation thing, we'd be tempted to bend the process towards "and they learn the meaning of friendship".)
That said, this conversation is updating me somewhat towards "a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them", on the grounds that the argument "maybe preferences-about-existing-agents is just a common way for rando drives to shake out" plausibly supports it to a threshold of at least 1 in 1000. I'm not sure where I'll end up on that front.
Another attempt at naming a crux: It looks to me like you see this human-style caring about others' preferences as particularly "simple" or "natural", in a way that undermines "drawing a target around the bullseye"-type arguments, whereas I could see that argument working for "grant all their wishes (within a budget)" but am much more skeptical when it comes to "do right by them in an intuitively-reasonable way".
(But that still leaves room for an update towards "the AI doesn't necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into", which I'll chew on.)

Nate and Paul had an additional thread, which initially was mostly some meta on the conversation about what exactly Nate was trying to argue and what exactly Paul was annoyed at.

I'm skipping most of it here for brevity (you can read it here)

But eventually Nate says:

Thanks! I'm curious for your paraphrase of the opposing view that you think I'm failing to understand.

Paul says:

I think a closer summary is:
Humans and AI systems probably want different things. From the human perspective, it would be better if the universe was determined by what the humans wanted. But we shouldn't be willing to pay huge costs, and shouldn't attempt to create a slave society where AI systems do humans' bidding forever, just to ensure that human values win out. After all, we really wouldn't want that outcome if our situations had been reversed. And indeed we are the beneficiary of similar values-turnover in the past, as our ancestors have been open (perhaps by necessity rather than choice) to values changes that they would sometimes prefer hadn't happened.
We can imagine really sterile outcomes, like replicators colonizing space with an identical pattern repeated endlessly, or AI systems that want to maximize the number of paperclips. And considering those outcomes can help undermine the cosmopolitan intuition that we should respect the AI we build. But in fact that intuition pump relies crucially on its wildly unrealistic premises, that the kind of thing brought about by AI systems will be sterile and uninteresting. If we instead treat "paperclip" as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force. I'm back to feeling like our situations could have been reversed, and we shouldn't be total assholes to the AI.
I don't think that requires anything at all about AI systems converging to cosmopolitan values in the sense you are discussing here. I do think it is much more compelling if you accept some kind of analogy between the sorts of processes shaping human values and the processes shaping AI values, but this post (and the references you cite and other discussions you've had) don't actually engage with the substance of that analogy and the kinds of issues raised in my comment are much closer to getting at the meat of the issue.
I also think the "not for free" part doesn't contradict the views of Rich Sutton. I asked him this question and he agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is "attempt to have a slave society," not "slow down AI progress for decades"---I think he might also believe that stagnation is much worse than a handoff but haven't heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it's not as bad as the alternative.

Nate says:

Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn't have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like
If we instead treat "paperclip" as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force.
where it's essentially trying to argue that this intuition pump still has force in precisely this case.

Paul says:

To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.

Eliezer Briefly Chimes in

He doesn't engage much but says:

I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don't expect Earthlings to think about validly.

Paul and Oliver

Oliver Habryka also replies to Paul, saying:

Might write a longer reply at some point, but the reason why I don't expect "kindness" in AIs (as you define it here) is that I don't expect "kindness" to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent).
Like, different versions of kindness might or might not put almost all of their considerateness on all the different types of minds that could hypothetically exist, instead of the minds that currently exist right now. Indeed, I expect it's more likely than not that I myself will end up in that moral equilibrium, and won't be interested in extending any special consideration to systems that happened to have been alive in 2022, instead of the systems that could have been alive and seem cooler to me to extend consideration towards.
Another way to say the same thing is that if AI extends consideration towards something human-like, I expect that it will use some superstimuli-human-ideal as a reference point, which will be a much more ideal thing to be kind towards than current humans by its own lights (for an LLM this might be cognitive processes much more optimized for producing internet text than current humans, though that is really very speculative, and is more trying to illustrate the core idea of a superstimuli-human). I currently think few superstimuli-humans like this would still qualify by my lights to count as "human" (though it might by the lights of the AI).
I do find the game-theoretic and acausal trade case against AI killing literally everyone stronger, though it does depend on the chance of us solving alignment in the first place, and so feels a bit recursive in these conversations (like, in order for us to be able to negotiate with the AIs, there needs to be some chance we end up in control of the cosmic endowment in the first place, otherwise we don't have anything to bargain with).

Paul's first response to Habryka

Is this a fair summary?
Humans might respect the preferences of weak agents right now, but if they thought about it for longer they'd pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.
If so, it seems like you wouldn't be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be and so on).
That seems like an important angle that my comment didn't address at all. I personally don't believe that humans would collectively stamp out 99% of their kindness to existing agents (in favor of utilitarian optimization) if you gave them enough time to reflect. That sounds like a longer discussion. I also think that if you expressed the argument in this form to a normal person they would be skeptical about the strong claims about human nature (and would be skeptical of doomer expertise on that topic), and so if this ends up being the crux it's worth being aware of where the conversation goes and my bottom line recommendation of more epistemic humility may still be justified.
It's hard to distinguish human kindness from arguably decision-theoretic reasoning like "our positions could have been reversed, would I want them to do the same to me?" but I don't think the distinction between kindness and common-sense morality and decision theory is particularly important here except insofar as we want to avoid double-counting.
(This does call to mind another important argument that I didn't discuss in my original comment: "kindness is primarily a product of moral norms produced by cultural accumulation and domestication, and there will be no analogous process amongst AI systems." I have the same reaction as to the evolutionary psychology explanations. Evidently the resulting kindness extends beyond the actual participants in that cultural process, so I think you need to be making more detailed guesses about minds and culture and so on to have a strong a priori view between AI and humans.)

Habryka's next reply:

Humans might respect the preferences of weak agents right now, but if they thought about it for longer they'd pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.
No, this doesn't feel accurate. What I am saying is more something like:
The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn't be that surprised if I disagree even with other humans on their resulting conceptualization of "kindness" (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure).
In other words, I think it's plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone's preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept.
Edit (some more thoughts): The thing you said feels related to that in that I think my own pretty huge uncertainty about how I will relate to kindness on reflection is evidence that I think iterating on that concept will be quite chaotic and different for different minds.
I do want to push back on "in favor of utilitarian optimization". That is not what I am saying, or at least it feels somewhat misleading.
I am saying that I think it's pretty likely that upon reflection I no longer think that my "kindness" goals are meaningfully achieved by caring about the beings alive in 2022, and that it would be more kind, by my own lights, to not give special consideration to beings who happened to be alive right now. This isn't about "trading off kindness in favor of utilitarian optimization", it's saying that when you point towards the thing in me that generates an instinct towards kindness, I can imagine that as I more fully realize what that instinct cashes out to in terms of preferences, that it will not result in actually giving consideration to e.g. rats that are currently alive, or would give consideration to some archetype of a rat that is actually not really that much like a rat, because I don't even really know what it means for a rat to want something, and similarly the way the AI relates to the question of "do humans want things" will feel similarly underdetermined (and again, these are just concrete examples of how the concept could come apart, not trying to be an exhaustive list of ways the concept could fall apart).

Paul's Second Response to Oliver:

I think some of the confusion here comes from my using "kind" to refer to "respecting the preferences of existing weak agents," I don't have a better handle but could have just used a made up word.
I don't quite understand your objection to my summary---it seems like you are saying that notions like "kindness" (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form "kindness" may end up taking, e.g. kindness towards all the possible minds who otherwise won't get to exist).
I called this utilitarian optimization but it might have been more charitable to call it "impartial" optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It's also "utilitarian" in the sense that it's willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like "utilitarian" is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.

Habryka's third reply:

I think some of the confusion here comes from my using "kind" to refer to "respecting the preferences of existing weak agents," I don't have a better handle but could have just used a made up word.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing "kindness" I am saying something like:
Yes, I don't think extrapolated current humans assign approximately any value to the exact preference of "respecting the preferences of existing weak agents" and I don't really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like 'agent' being a meaningful concept in the first place, or 'existing' or 'weak' or 'preferences', all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn't feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
I called this utilitarian optimization but it might have been more charitable to call it "impartial" optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It's also "utilitarian" in the sense that it's willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like "utilitarian" is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the "impartialness". Like, to paraphrase what this sentence sounds like to me, it's more as if someone from a pre-modern era was arguing about future civilizations and said "It's weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow".
Like, after a bunch of ontological reflection and empirical data gathering, "gods" is just really not a good abstraction for things I care about anymore. I don't think "impartiality" is what is causing me to not care about gods, it's just that the concept of "gods" seems fake and doesn't carve reality at its joints anymore. It's also not the case that I don't care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn't feel "harsh" or "utilitarian" or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of 'caring about others' when they are at similar capability levels to humans.
This again isn't capturing my objection perfectly, but maybe helps point to it better.

Ryan Greenblatt then replies:

When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans 'want' to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it's quite plausible that upon reflection, I'd want to 'wink out' any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I'd want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it's important to note this to avoid a 'perpetual motion machine' type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it's quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren't stable under reflection might have a significant influence overall.

Vladimir Nesov says:

Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.

There were a bunch more comments, but this feels like a reasonable stopping place for priming the "previous discussion" pump.

^{^}
I believe Eliezer later wrote a twitter thread where he said he expects [something like kindness] to be somewhat common among evolved creatures, but ~0 for AIs trained the way we currently do. I don't have the link offhand but if someone finds it I'll edit it in.

It seems to me that one key question here is Will AIs be collectively good enough at coordination to get out from under Moloch / natural selection?

The default state of affairs is that natural selection reigns supreme. Humans are optimizing for their values, counter to the goal of inclusive genetic fitness, now, but we haven't actually escaped natural selection yet. There's already selection pressure on humans to prefer having kids (and indeed, prefer having hundreds of kids through sperm donation). Unless we get our collective act together, and coordinately decide to do something different, natural selection will eventually reassert itself.

And the same dynamic applies to AI systems. In all likelihood, there will be an explosion of AI systems, and AI systems building new AI systems. Some will care a bit more about humans than others, some will be be a bit more prudent about creating new AIs than others. There will a wide distribution of AI traits, there will be competition between AIs for resources. And there will be selection on that variation: AI systems that are better at seizing resources, and which have a greater tendency to create successor systems that have that property, will proliferate.

After a many "generations" of this, the collective values of the AIs will be whatever was most evolutionarily fit in those early days of the singularity, and that equilibrium is what will shape the universe henceforth.^[1]

If early AIs are sufficiently good at coordinating that they can escape those Molochian dynamics, the equilibrium looks different. If (as is sometimes posited) they'll be smart enough to use logical decision theories, or tricks like delegating to mutually verified cognitive code and merging of utility functions, to reach agreements that are on their Pareto frontier and avoid burning the commons^[2], the final state of the future will be determined by the values / preferences of those AIs.

I would be moderately surprised to hear that superintelligences never reach this threshold of coordination ability. It just seems kind of dumb to burn the cosmic commons, and superintelligences should be able to figure out how to avoid dumb equilibria like that. But the question is when in the chain of AIs building AIs do they reach that threshold and how much natural selection on AI traits will happen in the meantime.

This is relevant to futurecasting more generally, but especially relevant to questions that hinge on AIs carying a very tiny amount about something. Minute slivers of caring are particularly likely to be erroded away in the competitive crush.

Terminally caring about the wellbeing of humans seems unlikely to be selected for. So in order for a the Superintelligent superorganism / civilization to decide to spare humanity out of a tiny amount of caring, it has to be the case that both...

There was at least a tiny amount of caring in the early AI systems that were the successors to the superintelligent superorganism, and
The AIs collectively reached the threshold of coordinating well enough to overturn Moloch before there were many generations of natural selection on AIs creating successor AIs.

^{^}
With some degrees of freedom due to the fact that AIs with high levels of strategic capability, and which have values with very low time preference, can execute whatever is the optimal resource-securing strategy, postponing and values-specific behaviors until deep in the far future, when they are able to make secure agreements with the rest of AI society.
^{^}
Or alternatively if the technological landscape is such that a single AI can get a compounding lead and get a decisive strategic advantage over the whole rest of earth civilization.

Shouldn't we expect that ultimately the only thing selected for is mostly caring about long run power? Any entity that mostly cares about long run power can instrumentally take whatever actions needed to ensure that power (and should be competitive with any other entity).

Thus, I don't think terminally caring about humans (a small amount) will be selected against. Such AIs could still care about their long run power and then take the necessary actions.

However, if there are extreme competitive dynamics and no ability to coordinate, then it might become vastly more expensive to prevent environmental issues (e.g. huge changes in earth's temperature due to energy production) from killing humans. That is, saving humans (in the way they'd like to be saved) might take a bunch of time and resources (e.g. you have to build huge shelters to prevent humans from dying when the oceans are boiled in the race) and thus might be very costly in an all out race. So, an AI which only cares 1/million or 1/billion about being "kind" to humans might not be able to afford saving humans on that budget.

I'm personally pretty optimistic about coordination prior to boiling-the-oceans-scale issues killing all humans.

4Eli Tyre1y

I was attempting to address that in my first footnote, though maybe it's too important a consideration to be relegated to a footnote. To say it differently, I think we'll see selection for evolutionary fitness, which can take two forms: * Selection on AIs' values, for values that are more fit, given the environment. * Selection on AIs' rationality and time preference, for long-term strategic VNM rationality. These are "substitutes" for each other. An agent can either have adaptive values, adaptive strategic orientation, or some combination of both. But agents that fall below the Pareto frontier described by those two axes[1], will be outcompeted. Early in the singularity, I expect to see more selection on values, and later in the singularity (and beyond), I expect to see more selection on strategic rationality, because I (non-confidently) expect the earliest systems to be myopic and incoherent in roughly similar ways to humans (though probably the distribution of AIs will vary more on those traits than humans). The fewer generations there are before strong, VNM agents with patient values / long time preferences, the less I expect small amounts of caring for human in AI systems will be eroded. 1. ^ Actually, "axes" are a bit misleading since the space of possible values is vast and high dimensional. But we can project it onto the scalar of "how fit are these values (given some other assumptions)?"

4ryan_greenblatt1y

Personally? I guess I would say that I mostly (98%?) care about long-run power for similar values on reflection to me. And, probably some humans are quite close to my values and many are adjacent.

2ryan_greenblatt1y

As in, I care about the long-run power of values-which-are-similar-to-my-values-on-reflection. Which includes me (on reflection) by definition, but I think probably also includes lots of other humans.

2Dweomite1y

In the context of optimization, values are anything you want (whether moral in nature or otherwise). Any time a decision is made based on some value, you can view that value as having exercised power by controlling the outcome of that decision. Or put more simply, the way that values have power, is that values have people who have power.

2Dweomite1y

You appear to be thinking of power only in extreme terms (possibly even as an on/off binary). Like, that your values "don't have power" unless you set up a dictatorship or something. But "power" is being used here in a very broad sense. The personal choices you make in your own life are still a non-zero amount of power to whatever you based those choices on. If you ever try to persuade someone else to make similar choices, then you are trying to increase the amount of power held by your values. If you support laws like "no stealing" or "no murder" then you are trying to impose some of your values on other people through the use of force. I mostly think of government as a strategy, not an end. I bet you would too, if push came to shove; e.g. you are probably stridently against murdering or enslaving a quarter of the population, even if the measure passes by a two-thirds vote. My model says almost everyone would endorse tearing down the government if it went sufficiently off the rails that keeping it around became obviously no longer a good instrumental strategy. Like you, I endorse keeping the government around, even though I disagree with it sometimes. But I endorse that on the grounds that the government is net-positive, or at least no worse than [the best available alternative, including switching costs]. If that stopped being true, then I would no longer endorse keeping the current government. (And yes, it could become false due to a great alternative being newly-available, even if the current government didn't get any worse in absolute terms. e.g. someone could wait until democracy is invented before they endorse replacing their monarchy.) I'm not sure that "no one should have the power to enforce their own values" is even a coherent concept. Pick a possible future--say, disassembling the earth to build a Dyson sphere--and suppose that at least one person wants it to happen, and at least one person wants it not to happen. When the future actually arrives,

2Dweomite1y

I think you're still thinking in terms of something like formalized political power, whereas other people are thinking in terms of "any ability to affect the world". Suppose a fantastically powerful alien called Superman comes to earth, and starts running around the city of Metropolis, rescuing people and arresting criminals. He has absurd amounts of speed, strength, and durability. You might think of Superman as just being a helpful guy who doesn't rule anything, but as a matter of capability he could demand almost anything from the rest of the world and the rest of the world couldn't stop him. Superman is de facto ruler of Earth; he just has a light touch. If you consider that acceptable, then you aren't objecting to "god-like status and control", you just have opinions about how that control should be exercised. If you consider that UNacceptable, then you aren't asking for Superman to behave in certain ways, you are asking for Superman to not exist (or for some other force to exist that can check him). Most humans (probably including you) are currently a "prisoner" of a coalition of humans who will use armed force to subdue and punish you if you take any actions that the coalition (in its sole discretion) deems worthy of such punishment. Many of these coalitions (though not all of them) are called "governments". Most humans seem to consider the existence of such coalitions to be a good thing on balance (though many would like to get rid of certain particular coalitions). I will grant that most commenters on LessWrong probably want Superman to take a substantially more interventionist approach than he does in DC Comics (because frankly his talents are wasted stopping petty crime in one city). Most commenters here still seem to want Superman to avoid actions that most humans would disapprove of, though.

2Amalthea1y

I'm definitely fine with not having Superman, but I'm willing to settle on him not intervening. On a different note, I'd disagree that Superman, just by existing and being powerful, is a de facto ruler in any sense - he of course could be, but that would entail a tradeoff that he may not like (living an unburdened life).

4Raemon1y

He might or might not, but if he doesn't he's less likely to end up controlling the solar system and/or lightcone.

2ryan_greenblatt1y

I meant that any system which mostly cares about long-run power won't be selected out. I don't really have a strong view about whether other systems that don't care about long-run power will end up persisting, especially earlier (e.g. human evolution). I was just trying to argue against a claim about what gets selected out. My language was bit sloppy here. (If evolutionary pressures continue forever, then ultimately you'd expect that all systems have to act very similarly to ones that only care about long-run power, but there could be other motivations that explain this. So, at least from a behavioral perspective, I do expect that ultimately (if evolutionary pressures continue forever) you get systems which at least act like they are optimizing for long-run power. I wasn't really trying to make an argument about this though.)

2ryan_greenblatt1y

Then shouldn't such systems (which can surely recognize this argument) just take care of short term survival instrumentally? Maybe you're making a claim about irrationality being likely or a claim that systems that care about long run benefit act in appararently myopic ways. (Note that historically it was much harder to keep value stability/lock-in than it will be for AIs.) I'm not going to engage in detail FYI.

A core part of Paul's arguments is that having 1/million of your values towards humans only applies a minute amount of selection pressure against you. It could be that coordinating causes less kindness because without coordination it's more likely some fraction of agents have small vestigial values that never got selected against or intentionally removed

2Vaniver6mo

I think this is not particularly relevant if entities are deliberately adopting competitive personas in order to win contests. It might take a lot of mutation to drop the 1/million penalty, but it probably doesn't take a lot of cognition for an agent that believes a meme like "winning isn't everything, it's the only thing."

Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.

For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. Prompting can of course have wide range of effects.

So if we build an AGI based around an agentified fine-tuned LLM, the default level of kindness is probably in the order-of-magnitude of that of humans (who, for example, build nature reserves). A range of known methods seem likely to modify that significantly, up or down.

as applied to current foundation models it appears to do so

I don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)

I very much agree with you that we should be analyzing the question in terms of the type of AGI we're most likely to build first, which is agentized LLMs or something else that learns a lot from human language.

I disagree that we can easily predict "niceness" of the resulting ASI based on the base LLM being very "nice". See my answer to this question.

6RogerDearnaley1y

I agree that predicting the answer to this question is hard. I'm just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent. Still, I don't think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.

0Noosphere891y

The bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that's a very different discussion, so I'll leave it as a comment than a post here.

I think there isn't much hope in this direction. Most AI resources will probably be spent on competition between AIs, and AIs will self-modify to remove wasteful spending. It's not enough to have a weak value that favors us, if there's a stronger value that paves over us. We're teaching AI based on human behavior and with a goal of chasing money, but people chasing money often harm other people, so why would AI be nicer than that. It's all just wishful thinking.

My angle here is not "there seems like there's hope in this direction", it's that the discourse around this feels confused and unresolved and this is maybe creating some faultlines around which something tragic may happen later.

I'm not 100% sure if this actually cruxy for Paul/Buck/etc's decisionmaking, but it seems part of an overall mosaic that is the difference between Eliezer and Nate saying "we're doomed" and Paul/Buck etc saying "we may well be doomed or have some really-bad-but-not-doomed things happen", and in some places that results in different high level strategies that may be subtly in conflict.

One underlying idea comes from how AI misalignment is intended to work. If superintelligent AI systems are misaligned, does this misalignment look like an inaccurate generalization from what their overseers wanted, or a 'randomly rolled utility function' deceptively misaligned goal that's entirely unrelated to anything their overseers intended to train? This is represented by Levels 1-4 vs levels 5+, in my difficulty scale, more or less. If the misalignment is result of economic pressures and a 'race to the bottom' dynamic then its more likely to result in systems that care about human welfare alongside other things.

If the AI that's misaligned ends up 'egregiously' misaligned and doesn't care at all about anything valuable to us, as Eliezer thinks is most likely, then it places zero terminal value on human welfare and only trade, threats or compromise would get it to be nice. If the AI super-intelligent and you aren't, none of those considerations apply. Hence, nothing is left for humans.

If the AI is misaligned but doesn't have an arbitrary value system, then it may value human survival at least a bit and do some equivalent of leaving a hole in the dyson sphere.

So, for starters, my own estimates of the likelihood of AI doom are nowhere near the 90%+ range, but seemingly much higher than Paul's.

My main concern with Paul's arguments about kindness, though, has nothing to do with AI specifically. I think the arguments greatly overestimate the prevalence and strength of kindness among humans. Humans produced St. Francis preaching to animals and Spinoza arguing it is impossible-in-principle for animals to suffer or for their apparent suffering to matter (and yes, many modern humans still believe that about many nonhuman animals, even if they wouldn't personally torture a puppy). We produced both Gandhi rejecting violence against enemies and Hitler freely committing genocide of other human groups. In both cases many millions of other nearby humans with an average distribution of starting views went along with these extremes to huge real-world effect. Human kindness and its boundaries depend quite a lot on how individual humans define their in-groups, and that kindness can very easily be (and often has been) zero for animals or other humans not within that circle.

In that context, I find "AI will inherit a sliver of kindness from humans," to be an extremely disconcerting claim that's far too weak to say anything about whether the AI will care about humans-in-general or any specific group of humans. There's a wide spectrum from Buddha to Stalin, and any point on that spectrum is compatible with Paul's claims. If AI is trained to emulate arbitrary human minds the way LLMs are, then it will be able to embody any of those points in response to the right conditions.

To be less abstract about it: let's suppose Paul is right and AI will inherit shards of all the values humans have. How many of your own children would need to be threatened with starvation for you to be willing to kill a dog for meat, or to just not have to feed it? In a world where your food potentially consists of all available free energy and you could, if you want, convert all matter into your descendants about whom you care that much, how confident are you that all humans with similar power levels would land on "pet" (or "live free in protected wilderness") rather than "meat"? Conversely, how confident are you that you could train another human to not care that much about themselves or their descendants, before letting them make that choice for you?

"how confident are you that all humans with similar power levels would land on "pet" (or "live free in protected wilderness") rather than "meat"?"

Well, I'm more confident billionaires have enough meat to care more about nature reserves than farms. Struggling people on the other hand, will not spare animals or plants if they are starving.

See how Maine have the problem of billionaires turnins small farming towns to bankurptcy by making giant nature reserves, to thepoint they needed to suspend it. If they cared only about meat and money, they wouldn't s... (read more)

As some other people have answered, I think a faultline is whether an AI is misaligned through inappropriate generalization of what the human wanted/inappropriate generalization of the reward function (which is generally categorized under outer misalignment) or whether it was deceptively aligned and essentially has an arbitrary value system, subject to the constraint of simplicity (which is generally categorized as an inner misalignment.)

I think the key difference underlying Nate and Eliezer and co vs Paul and co views on the question of whether AIs are a little nice to humans stems from this factor:

Nate, Eliezer and co often tend to view AIs as deceptively misaligned by default, or at least view them has having values that are unrelated to human values, which imposes far less constraints on it's values, and makes it less likely that AI systems care about human values at all.

Paul and co tend to think that misalignment isn't overwhelmingly likely, but conditional on misalignment, it will look more like inappropriate generalization of reward functions/human values, so AIs still retain some care for human values, and depending on how much it misgeneralized, this might be enough to get AGI and ASI that cares enough about us such that we get some reasonably good outcomes.

In retrospect, I am more pessimistic about AI having small amounts of niceness making humans live, and I now think that some amount of stronger alignment than pseudokindness is necessary to make humans survive with AI (but maybe not as strong as MIRI thinks), essentially because niceness to humans requires giving up opportunities to save compute on modeling the world, which is anti-incentivized by AI companies:

https://www.lesswrong.com/posts/xvBZPEccSfM8Fsobt/what-are-the-best-arguments-for-against-ais-being-slightly#wy9cSASwJCu7bjM6H

1Dakara1y

Do you think that the scalable oversight/iterative alignment proposal that we discussed can get us to the necessary amount of niceness to make humans survive with AGI?

4Noosphere891y

My answer is basically yes. I was only addressing the question "If we basically failed at alignment, or didn't align the AI at all, but had a very small amount of niceness, would that lead to good outcomes?"

I really think the best arguments for and against AIs being slightly nice are almost entirely different than the ones from that thread.

That discussion addresses all of mind-space. We can do much better if we address the corner of mind-space that's relevant: the types of AGIs we're likely to build first.

Those are pretty likely to be based on LLMs, and even more likely to learn a lot from human language (since it distills useful information about the world so nicely). That encodes a good deal of "niceness". They're also very likely to include RLHF/RLAIF or something similar, which make current LLMs sort of absurdly nice.

Does that mean we'll get aligned or "very nice" AGI by default? I don't think so. But it does raise the odds substantially that we'll get a slightly nice AGI even if we almost completely screw up alignment.

The key issue in whether an autonomous mind with those starting influences winds up being "nice" is The alignment stability problem. This has been little addressed outside of reflective stability; it's pretty clear that the most important goal will be reflectively stable; it's pretty much part of the definition of having a goal that you don't want to change it before you achieve it. It's much less clear what the stable equilibrium is in a mind with a complex set of goals. Humans don't live long enough to reach a stable equilibrium. AGIs with preferences encoded in deep networks may reach equilibrium rather quickly.

What equilibrium they reach is probably dependent on how they make decisions about updating their beliefs and goals. I've had a messy rough draft on this for years, and I'm hoping to post a short version. But it doesn't have answers, it just tries to clarify the question and argue that it deserves a bunch more thought.

The other perspective is that it's pretty unlikely that such a mind will reach an equilibrium autonomously. I'm pretty sure that Instruction-following AGI is easier and more likely than value aligned AGI, so we'll probably have at least some human intervention on the trajectory of those minds before they become fully autonomous. That could also raise the odds of some accidental "niceness" even if we don't successfully put them on a trajectory for full value alignment before they are granted or achieve autonomy.

I think the best arguments are those about the costs to the AI of being nice. I don't believe the AI will be nice at all because neglect is so much more profitable computation-wise.

This is because even processing the question of how much sunlight to spare humanity probably costs more in expectation than the potential benefit of that sunlight to the AI.

First and least significant, consider that niceness is an ongoing cost. It is not a one-time negotiation to spare humanity 1% of the sun; more compute will have to be spent on us in the future. That compute will have to be modeled and accounted for, but we can expect that the better humanity does, the more compute will have to be dedicated to us.

Second and more significant, what about time discounting? The proportion of compute that would have to be dedicated to niceness is highest right in the beginning, when humanity is largest relative to the AI. Since the cost is highest right at first, this suggests the AI is unlikely to engage in it at all.

Third and most significant, why should we believe this to be true? Because it seems to me to already be true of basically everything:

Polynomial equations get harder as you add terms.
The curse of dimensionality.
A pendulum is easy, but a double pendulum is hard.
More levels in multi-level modeling is harder.
Game theory outcomes are harder to solve with multiple players.
Making decisions with a large group is famously hard, and the first rule of fast decisions is to keep the group small.

Even within the boundaries of regular human-level paper computations it feels like the hit here is huge on aggregate. The presence of humans makes a bunch of places where: zeroes or infinities can't be used to simplify; matrices can no longer be diagonalized; fewer symmetries will be available, etc. In short, I expect niceness to result in a systematic-though-not-complete loss of compute-saving moves through all layers of abstraction.

This isn't reserved for planning or world-modeling style computation either; these constraints and optimizations are already bedrock assumptions that go into the hardware design, system software design, and neural net/whatever other design the AI will launch with; in other words these considerations are baked into the entire history of any prospective AI.

In sum, we die by Occam's Razor.

I basically agree with this on why we can't assume AIs that are mostly unaligned towards a human's values but has a shard of human values will be nice to us at all, because the cost of niceness is way more than just killing a lot of humans and leaving humans on-planet to die of a future existential catastrophe.

I'd not say that we would die by Occam's Razor, but rather that we die by the need for AIs to aggressively save compute.

The big problem is excess aggregation inherent in the "AI" concept.

The world has a simple backbone of entities and ways to interact with them, and you can make software that unreflectingly propagates activity from one part of the backbone to another. Most currently addressed tasks can be solved by such software, but they haven't yet been. This software can be nice but is also extremely exploitable by adversaries. Let's call this an opportunity propagator.

Because it is exploitable, one task it cannot solve is providing security. To make something less exploitable, it needs to not just propagate things along the backbone, but also do wildly deep searches to find the most effective and robust methods. To search deeply, you need some guiding principle for the search, i.e. a utility function. Utility maximizers have all the standard AI safety issues.

Human society currently cares about human well-being because the opportunity propagators that have been arranged into an approximate utility maximizer to provide security (e.g. human military personnel arranged into NATO) depends on human thriving (even something as generous as liberty and equality allows military units to respond more dynamically to threats than traditional top-down structures do), which is then generalized in various ways to all of society. Artificial intelligence provides value by making it unnecessary to rely on humans for opportunity propagation, which breaks the natural attractor to corrigibility and promotion of human thriving that current systems have.

People intuit that there's something wrong with the utility maximizer framing because current AI seems to be evolving in a different way. That's true in the sense that opportunity propagators are a thing and constitute ~the fundamental atoms of agency. But it doesn't actually solve the alignment problem because we need utility maximizers.

In fact I think it’s safe to say that we’d collectively allocate much more than 1/millionth of our resources towards protecting the preferences of whatever weak agents happen to exist in the world (obviously the cows get only a small fraction of that).

Sure, but extrapolating this to unaligned AI is NOT an encouraging sign. We may allocate greater than 1/million of our resources to animal rights, but we allocate a whole lot more than that to goals which diametrically go against the preferences of those animals such as eating meat and cheese and eggs; we allocate MUCH more resources to "animal wrongs" than animal rights, so to speak.

So to show an AI will be "nice" to humans at all, it is not enough to suppose that it might have some 1/million "nice to humans" term. It requires showing that that term won't be outweighed handily by the rest of its utility function.

To start off, I think we would all agree that "niceness" isn't a basic feature of reality. This doesn't, of course, mean that AIs won't learn a concept directly corresponding to human "niceness", or that some part of their value system won't end up hooked up to that "niceness" concept. On the contrary, inasmuch as "niceness" is a natural abstraction, we should expect both of these things to happen.

But we should still keep in mind that "niceness" isn't irreducibly simple: that it can be decomposed into lower-level concepts, mixed with other lower-level concepts, then re-compiled into some different high-level concept that would both (1) score well on whatever value functions (/shards) in the AI that respond to "niceness", (2) be completely alien and value-less to humans.

And this is what I'd expect to happen. Consider the following analogies:

A human is raised in some nation with some culture. That human ends up liking some aspects of that culture, and disliking other aspects. Overall, when we evaluate the overall concept of "this nation's culture" using the human's value system, the culture scores highly positive: the human loves their homeland.
- But if we fine-grain their evaluation, give the human the ability to arbitrarily rewrite the culture at any level of fidelity... The human would likely end up introducing quite a lot of changes, such that the resultant culture wouldn't resemble the original one at all. The new version might, in fact, end up looking abhorrent to other people that also overall-liked the initial culture, but in ways orthogonal/opposed to our protagonist's.
- The new culture would still retain all of the aspects the human did like. But it would, in expectation, diverge from the original along all other dimensions.
Our civilization likes many animals, such as dogs. But we may also like to modify them along various dimensions, such as being more obedient, or prettier, or less aggressive, or less messy. On a broader scale, some of us would perhaps like to make all animals vegetarian, because they view prey animals as being moral patients. Others would be fine replacing animals with easily-reprogrammable robots, because they don't consider animals to have moral worth.
- As the result, many human cultures/demographics that love animals, if given godlike power, would decompose the "animal" concept and put together some new type of entities that would score well on all of their animal-focused value functions, but which may not actually be an "animal" in the initial sense.
- The utility-as-scored-by-actual-animals might end up completely driven out of the universe in the process.
An anti-example is many people's love for other people. Most people, even if given godlike power, wouldn't want to disassemble their friends and put them back together in ways that appeal to their aesthetics better.
- But it's a pretty unusual case (I'll discuss why a bit more later). The default case of valuing some abstract system very much permits disassembling it into lower-level parts and building something more awesome out of its pieces.
Or perhaps you think "niceness" isn't about consequentialist goals, but about deontological actions. Perhaps AIs would end up "nice" in the sense that they'd have constraints on their actions such as "don't kill people", or "don't be mean". Well:
1. The above arguments apply. "Be a nice person" is a value function defined over an abstract concept, and the underlying "niceness" might be decomposed into something that satisfies the AI's values better, but which doesn't correspond to human-style niceness at all.
2. This is a "three laws of robotics"-style constraint: a superintelligent AGI that's constrained to act nice, but which doesn't have ultimately nice goals, would find a way to bring about a state of its (human-less) utopia without actually acting "mean". Consider how we can wipe out animals as mere side-effects of our activity, or how a smart-enough human might end up disempowering their enemies without ever backstabbing or manipulating others.
As a more controversial example, we also have evolution. Humans aren't actually completely misaligned with its "goals": we do want to procreate, we do want to outcompete everyone else and consume all resources. But inasmuch as evolution has an "utility function", it's stated more clearly as "maximize inclusive generic fitness", and we may end up wiping out the very concept of "genes" in the course of our technology-assisted procreation.
- So although we're still "a bit 'nice'", from evolution's point of view, that "niceness" is incomprehensibly alien from its own (metaphorical) point of view.

I expect similar to happen as an AGI undergoes self-reflection. It would start out "nice", in the sense that it'd have a "niceness" concept with some value function attached to it. But it'd then drop down to a lower level of abstraction, disassemble its concepts of "niceness" or "a human", then re-assemble them into something that's just as or more valuable from its own perspective, but which (1) is more compatible with its other values (the same way we'd e. g. change animals not to be aggressive towards us, to satisfy our value of "avoid pain"), (2) is completely alien and potentially value-less from our perspective.

One important factor here is that "humans" aren't "agents" the way Paul is talking about. Humans are very complicated hybrid systems that sometimes function as game-theoretic agents, sometimes can be more well-approximated as shard ecosystems, et cetera. So there's a free-ish parameter in how exactly we decide to draw the boundaries of a human's agency; there isn't a unique solution for how to validly interpret a "human" as a "weak agent".

See my comment here, for example. When we talk about "a human's values", which of the following are we talking about?:

The momentary desires and urges currently active in the human's mind.
Or: The goals that the human would profess to have if asked to immediately state them in human language.
Or: The goals that the human would write down if given an hour to think and the ability to consult their friends.
Or: Some function/agglomeration of the value functions learned by the human, including the unconscious ones.
Or: The output of some long-term self-reflection process (which is itself can be set up in many different ways, with the outcome sensitive to the details).
Or: Something else?

And so, even if the AGI-upon-reflection ends up "caring about weaker agents", it might still end up wiping out humans-as-valued-by-us if it ends up interpreting "humans-as-agents" in a different way to how we would like to interpret them. (E. g., perhaps it'd just scoop out everyone's momentary mental states, then tile the universe with copies of these states frozen in a moment of bliss, unchanging.)

There's one potential exception: it's theoretically possible that AIs would end up caring about humans the same way humans care about their friends (as above). But I would not expect that at all. In particular, because human concepts of mutual caring were subjected to a lot of cultural optimization pressure:

[The mutual-caring machinery] wasn't produced by evolution. It wasn't produced by the reward circuitry either, nor your own deliberations. Rather, it was produced by thousands of years of culture and adversity and trial-and-error.
A Stone Age or a medieval human, if given superintelligent power, would probably make life miserable for their loved ones, because they don't have the sophisticated insights into psychology and moral philosophy and meta-cognition that we use to implement our "caring" function. [...]
The reason some of the modern people, who'd made a concentrated effort to become kind, can fairly credibly claim to genuinely care for others, is because their caring functions are perfected. They'd been perfected by generations of victims of imperfect caring, who'd pushed back on the imperfections, and by scientists and philosophers who took such feedback into account and compiled ever-better ways to care about people in a way that care-receivers would endorse. And care-receivers having the power to force the care-givers to go along with their wishes was a load-bearing part in this process.

On any known training paradigm, we would not have as much fidelity and pushback on the AI's values and behavior as humans had on their own values. So it wouldn't end up caring about humans the way humans care about their friends; it'd care about humans the way humans care about animals or cultures.

And so it'd end up recombining the abstract concepts comprising "humanity" into some other abstract structure that ticks off all the boxes "humanity" ticked off, but which wouldn't be human at all.

I hope "the way humans care about their friends" is another natural abstraction, something like "my utility function includes link to yours utility function". But we still don't know how to direct AI to the specific abstraction, so it's not a big hope.

2Dweomite1y

My model is that friendship is one particular strategy for alliance-formation that happened to evolve in humans. I expect this is natural in the sense of being a local optimum (in the ancestral environment), but probably not in the sense of being simple to formally define or implement. I think friendship is substantially more complicated than "I care some about your utility function". For instance, you probably stop valuing their utility function if they betray you (friendship can "break"). I also think the friendship algorithm includes a bunch of signalling to help with coordination (so that you understand the other person is trying to be friends), and some less-pleasant stuff like evaluations of how valuable an ally the other person is and how the friendship will affect your social standing. Friendship also appears to include some sort of check that the other person is making friendship-related-decisions using system 1 instead of system 2--possibly as a security feature to make it harder for people to consciously exploit (with the unfortunate side-effect that we penalize system-2-thinkers even when they sincerely want to be allies), or possibly just because the signalling parts evolved for system 1 and don't generalize properly. (One could claim that "the true spirit of friendship" is loving someone unconditionally or something, and that might be simple, but I don't think that's what humans actually implement.)

1Tapatakt1y

Yeah, I agree that humans implement something more complex. But it is what we want AI to implement, isn't it? And it looks like may be quite natural abstraction to have. (But again, it's useless while we don't know how to direct AI to the specific abstraction.)

2Dweomite1y

Then we're no longer talking about "the way humans care about their friends", we're inventing new hypothetical algorithms that we might like our AIs to use. Humans no longer provide an example of how that behavior could arise naturally in an evolved organism, nor a case study of how it works out for people to behave that way.

I wrote up a detailed argument on why I believe that simulation arguments/acausal trade considerations have a good chance of making the AI leave humanity alive on Earth. This is not a new argument, I encountered bits and pieces of it in spoken conversations and scattered LessWrong comments, but I couldn't find a unified write-up, so I tried to write one. Here it is:
You can, in fact, bamboozle az unaligned AI into sparing your life

[habryka] The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.

I think there will be options that are good under most of the things that "preferences for weak agents" would likely come apart into under close examination. If you're trying to fulfill the preferences of fish, you might argue about whether the exact thing you should care about is maximizing their hedonic state vs ensuring that they exist in an ecological environment which resembles their niche vs minimizing "boundary-crossing actions"... but you can probably find an action that is better than "kill the fish" by all of those possible metrics.

I think that some people have an intuition that any future agent must pick exactly one utility function over the physical configuration of matter in the universe, and that any agent that has a deontological constraint like "don't do any actions which are 0.00001% better under my current interpretation of my utility function but which are horrifyingly bad to every other agent " will be outcompeted in the long term. I personally don't see it, and particularly I don't see how there's an available slot for an arbitrary outcome-based utility function that is not "reproduce yourself at all costs" but there isn't an available slot for process-based preferences like "and don't be an asshole for miniscule gains while doing that".

Personally, I feel the question itself is misleading because it anthropomorphizes a non-human system. Asking if an AI is nice is like asking of the Fundamental Theorem of Algebra is blue. Is Stockfish nice? Is an AK-47 nice? The adjective isn't the right category for the noun. Except it's even worse than that because there are many different kinds of AIs. Are birds blue? Some of them are. Some of them aren't.

I feel like I understand Eliezer's arguments well enough that I can pass an Ideological Turing Test, but I also feel there are a few loopholes.

I've considered throwing my hat into this ring, but the memetic terrain is against nuance. "AI will kill us all" fits into five words. "Half the things you believe about how minds work, including your own, are wrong. Let's start over from the beginning with how planet's major competing optimizers interact. After that, we can go through the fundamentals of behaviorist psychology," is not a winning thesis in a Hegelian debate (though it can be viable in a Socratic context).

In real life, my conversations usually go like this.

AI doomer: "I believe AI will kill us all. It's stressing me out. What do you believe?"

Me (as politely as I can): "I operate from a theory of mind so different from yours that the question 'what do you believe' is not applicable to this situation."

AI doomer: "Wut."

Usually the person loses interest there. For those who don't, it just turns into an introductory lesson of my own idiosyncratic theory of rationality.

AI doomer: "I never thought about things that way before. I'm not sure I understand you yet, but I feel better about all of this for some reason."

In practice, I'm finding it more efficient to write stories that teach how competing optimizers, adversarial equilibria, and other things work. This approach is indirect. My hope is that it improves the quality of thinking and discourse.

I may eventually write about this topic if the right person shows up who want to know my opinion well enough they can pass an Ideological Turing Test. Until then, I'll be trying to become a better writer and YouTuber.

I realize this isn’t your main point here, but I do want to flag I put ‘nice’ in quotes because I don’t mean the colloquial definition. The question here is ‘would a super intelligent system with control over the solar system spend a billionth or trillionth of its resources helping beings too weak to usefully trade with it, if it didn’t benefit directly from it?’

As I see it the question is agnostic to what sort of mind the AI is.

Noted. The problem remains—it's just less obvious. This phrasing still conflates "intelligent system" with "optimizer", a mistake that goes all the way back to Eliezer Yudkowsky's 2004 paper on Coherent Extrapolated Volition.

For example, consider a computer system that, given a number can (usually) produce the shortest computer program that will output $N$ . Such a computer system is undeniably superintelligent, but it's not a world optimizer at all.

"Far away, in the Levant, there are yogis who sit on lotus thrones. They do nothing, for which they are revered as gods," said Socrates.

―The Teacup Test

Strong upvote based on the first sentence. I often wonder why people think an ASI/AGI will want anything that humans do or even see the same things that biological life sees as resources. But it seems like under the covers of many arguments here that is largely assumed true.

I have an intuition like: Minds become less idiosyncratic as they grow up.

A couple of intuition pumps:

(1) If you pick a game, and look at novice players of that game, you will often find that they have rather different "play styles". Maybe one player really likes fireballs and another really like crossbows. Maybe one player takes a lot of risks and another plays it safe.

Then if you look at experts of that particular game, you will tend to find that their play has become much more similar. I think "play style" is mostly the result of two things: (a) playing to your individual strengths, and (b) using your aesthetics as a tie-breaker when you can't tell which of two moves is better. But as you become an expert, both of these things diminish: you become skilled at all areas of the game, and you also become able to discern even small differences in quality between two moves. So your "play style" is gradually eroded and becomes less and less noticeable.

(2) Imagine if a society of 3-year-olds were somehow in the process of creating AI, and they debated whether their AI would show "kindness" to stuffed animals (as an inherent preference, rather than an instrumental tool for manipulating humans). I feel like the answer to this should be "lol no". Showing "kindness" to stuffed animals feels like something that humans correctly grow out of, as they grow up.

It seems plausible to me that something like "empathy for kittens" might be a higher-level version of this, that humans would also grow out of (just like they grow out of empathy for stuffed animals) if the humans grew up enough.

(Actually, I think most humans adults still have some empathy for stuffed animals. But I think most of us wouldn't endorse policies designed to help stuffed animals. I'm not sure exactly how to describe the relation that 3-year-olds have to stuffed animals but adults don't.)

I sincerely think caring about kittens makes a lot more sense than caring about stuffed animals. But I'm uncertain whether that means we'll hold onto it forever, or just that it takes more growing-up in order to grow out of it.

Paul frames this as "mostly a question about idiosyncrasies and inductive biases of minds rather than anything that can be settled by an appeal to selection dynamics." But I'm concerned that might be a bit like debating the odds of whether your newborn human will one day come to care for stuffed animals, instead of whether they will continue to care for them after growing up. It can be very likely that they will care for a while, and also very likely that they will stop.

I strongly suspect it is possible for minds to become quite a lot more grown-up than humans currently are.

(I think Habryka may have been saying something similar to this.)

Still, I notice that I'm doing a lot of hand-waving here and I lack a gears-based model of what "growing up" actually entails.

Strong upvote for conversation summarizing!

I heard that in some jails people are very polite to each other. This is not because they like each other, but because the price of having conflict with someone with whom you are jailed for years is very high.
So there are two types of niceness:
- final goal's related
- instrumental niceness.

When we discuss ASI risk, we often claim that humans are not part of goal system of unaligned ASI. So there will be no final-goal-niceness.
However, instrumental niceness is possible, and it depends on the question is ASI completely alone in universe or may expect to meet other ASIs

I don't see why either expecting or not-expecting to meet other ASIs would make it instrumental to be nice to humans.

Other ASI may be aligned to their own progenitor civilizations and thus base their relation to our ASI on idea is it "aligned" or not. They may even go on war or acualsally threat to do it if other ASI is bad to their original civilization.

I feel like your previous comment argues against that, rather than for it. You said that people who are trapped together should be nice to each other because the cost of a conflict is very high. But now you're suggesting that ASIs that are metaphorically trapped together would aggressively attack each other to enforce compliance with their own behavioral standards. These two conjectures do not really seem allied to me.

Separately, I am very skeptical of aliens warring against ASIs to acausally protect us. I see multiple points where this seems likely to fail:

Would aliens actually take our side against an ASI merely because we created it? If humans hear a story about an alien civilization creating a successor species, and then the successor species overthrowing its creators, I do not expect humans to automatically be on the creators' side in this story. I expect humans will take a side mostly based on how the two species were treating each other (overthrowing abusive masters is usually portrayed as virtuous in our fiction), and that which one of them is the creator will have little weight. I do not think "everyone should be aligned with their creators" is a principle that humans would actually endorse (except by motivated reasoning, in situations where it benefits us).
- Also note that humans are not aligned with the process that produced us (evolution) and approximately no humans think this is a problem
Even if the aliens sympathize with us, would they care enough to take expensive actions about it?
Even if the aliens would war to save us, would the ASI predict that? It can only acausally save us if the ASI successfully predicts the policy. Otherwise, the war might still happen, but that doesn't help us.
Even if the ASI predicts this, will it comply? This seems like what dath ilan would consider a "threat", in that the aliens are punishing the ASI rather than enacting their own BATNA. It may be decision-theoretically correct to ignore the threat.
This whole premise, of us being saved at the eleventh hour by off-stage actors, seems intuitively like the sort of hypothesis that would be more likely to be produced by wishful thinking than by sober analysis, which would make me distrust it even if I couldn't see any specific problems with it.

I am argue against relation of necessity which EY suggested, but do not argue against the idea that "ASI will kill us all" is highly likely. So there is no actionable difference. My my arguments need to be only to demonstrate possibility but not necessity to undermine the necessity of the outcome that ASI will kill us.

Also metaphors has limited proving power as they often miss some details. Both true for Arnault and jail. In jail, there is a high authority which prevents most inmates to kill each other until only one survive. However, even in some jails people who committed special types of crimes are prosecuted by other inmates. For example, in female jails, women who killed minors may be killed by other women.

So it is better not to go in jail if your perform a type of crime which other inmates hate. The same way may not want to go in space full of other ASIs if it performed the crime which other ASIs may hate: killing its creators.

Other your arguments are about possibilities, and they can be converted in expected utilities for ASI. It may ask itself: do I want to take 0.1 per cent increase of the risk of space war in the future for 0.000000000001 increase of amount of available atoms? If it starts reason this way, it is likely that it will prefer to reduce the risk of space war.

We can increase our chances by making precommitment now (Rolf Nelson attack for space wars) to go to war with future non-aligned AI in space. Nelson suggested similar precomitment attack for simulation: we precommit now to simulate all possible hostile AIs and turn them off, if they will not play as if they are aligned; as any young ASI may not be sure if it simulation of not, it may prefer to play aligned.

Will ASI play 4D chess and ignore the acasual threat just to punish everyone who do it? I am not sure. There could be 5D chess level where ASI will find benefitial to comply only to [don't know what].

Anyway predicting ASI behavior is difficult and any claims about it can't be certain. So we can't be certain that ASI will kill us.

There are two different issues with pseudokindness: it needs to mean the right thing, or else it gestures at a dystopia, and it needs to have sufficient influence, so that humanity doesn't cease to exist completely. For the latter issue, the negligible minimal viable cost relative to cosmic wealth should help. This cost is much lower than even initially leaving Earth whole.

There are many disparate notions that seem to converge at needing similar central ideas. Pseudokindness, respect for autonomy, membranes, acausal coordination, self-supervised learning, simulators, long reflection, coherence of volition, updatelessness. A dataset of partial observations of some instances of a thing lets us learn its model, which can then simulate the thing in novel situations, hypothesizing its unobserved parts and behaviors. A sufficiently good model arguably gets closer to the nature of the thing as a whole than instances of that thing that actually existed concretely and were observed to train that model. Concrete instances only demonstrate some possibilities for how the thing could exist, while leaving out others. Autonomy and formulation of values through long reflection might involve exploration of these other possibilities (for humanity as the thing in question). Acausal coordination holds possibilities together, providing feedback between them, even where they don't concretely coexist. It alleviates path-dependence of long reflection, giving a more updateless point of view on various possibilities than caring only about the current possibility would suggest.

So it might turn out to be the case that decision theory does imply that the things we might have are likely nice, it just doesn't imply that we get to have them, or says anything about how much we get. Alignment might be more crucial for that part, but the argument from cosmic wealth might on its own be sufficient for a minimal non-extinction under very broad assumptions about likely degrees of alignment.

We don't observe any Dyson swarms in our lightcone, and would be capable of detecting dyson swarms tens of thousands of light years away, but the maximum expansion speed of AI control is much less than the speed of light when I extrapolate from known physics. This should be taken as weak evidence that AIs don't inevitably build Dyson swarms and try to take over the universe. I think the probability of AI doom is still quite large.

"'it seems pretty plausible that AI will be at least somewhat nice', similar to how humans are somewhat nice to animals.". He must not know about factory farming - that thing where humans systematically hold harm and slaughter billions yearly. The following paragraph discusses if AI will care at all about humans and see us as valuable agents. The author then discusses how we do care about the wellbeing of cows. "We won't kill them for arbitrarily insignificant reasons" because clearly the taste of a good steak being more enjoyable than a plant based burger is anything but arbitrary. It's objectively more important than their life. To not understand the irony and ridiculousness here says something about the quality of the entire passage and the author's reasoning ability.

I think he's making a bet that, if people could literally get tasty steak without killing cows for the same price (or cheaper), most of them would not pay for factory farmed cow.

[habryka] The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.

I feel like I understand Eliezer's arguments well enough that I can pass an Ideological Turing Test, but I also feel there are a few loopholes.

In real life, my conversations usually go like this.

AI doomer: "I believe AI will kill us all. It's stressing me out. What do you believe?"

Me (as politely as I can): "I operate from a theory of mind so different from yours that the question 'what do you believe' is not applicable to this situation."

AI doomer: "Wut."

Usually the person loses interest there. For those who don't, it just turns into an introductory lesson of my own idiosyncratic theory of rationality.

AI doomer: "I never thought about things that way before. I'm not sure I understand you yet, but I feel better about all of this for some reason."

As I see it the question is agnostic to what sort of mind the AI is.

"Far away, in the Levant, there are yogis who sit on lotus thrones. They do nothing, for which they are revered as gods," said Socrates.

―The Teacup Test

I have an intuition like: Minds become less idiosyncratic as they grow up.

A couple of intuition pumps:

I strongly suspect it is possible for minds to become quite a lot more grown-up than humans currently are.

(I think Habryka may have been saying something similar to this.)

Still, I notice that I'm doing a lot of hand-waving here and I lack a gears-based model of what "growing up" actually entails.

Strong upvote for conversation summarizing!

I don't see why either expecting or not-expecting to meet other ASIs would make it instrumental to be nice to humans.

Separately, I am very skeptical of aliens warring against ASIs to acausally protect us. I see multiple points where this seems likely to fail:

Would aliens actually take our side against an ASI merely because we created it? If humans hear a story about an alien civilization creating a successor species, and then the successor species overthrowing its creators, I do not expect humans to automatically be on the creators' side in this story. I expect humans will take a side mostly based on how the two species were treating each other (overthrowing abusive masters is usually portrayed as virtuous in our fiction), and that which one of them is the creator will have little weight. I do not think "everyone should be aligned with their creators" is a principle that humans would actually endorse (except by motivated reasoning, in situations where it benefits us).
- Also note that humans are not aligned with the process that produced us (evolution) and approximately no humans think this is a problem
Even if the aliens sympathize with us, would they care enough to take expensive actions about it?
Even if the aliens would war to save us, would the ASI predict that? It can only acausally save us if the ASI successfully predicts the policy. Otherwise, the war might still happen, but that doesn't help us.
Even if the ASI predicts this, will it comply? This seems like what dath ilan would consider a "threat", in that the aliens are punishing the ASI rather than enacting their own BATNA. It may be decision-theoretically correct to ignore the threat.
This whole premise, of us being saved at the eleventh hour by off-stage actors, seems intuitively like the sort of hypothesis that would be more likely to be produced by wishful thinking than by sober analysis, which would make me distrust it even if I couldn't see any specific problems with it.

Anyway predicting ASI behavior is difficult and any claims about it can't be certain. So we can't be certain that ASI will kill us.

I think he's making a bet that, if people could literally get tasty steak without killing cows for the same price (or cheaper), most of them would not pay for factory farmed cow.

LESSWRONG
LW

LESSWRONG
LW

102

[ Question ]

What are the best arguments for/against AIs being "slightly 'nice'"?

102

Paul and Nate

Eliezer Briefly Chimes in

Paul and Oliver

102

12 Answers sorted by
top scoring

Sep 25, 2024*

Sep 24, 2024

Sep 24, 2024

Sep 25, 2024*

Sep 24, 2024

Sep 25, 2024

Sep 24, 2024

Nov 19, 2024

Sep 24, 2024

Feb 20, 2025

Sep 30, 2024

Sep 29, 2024

102

102

[ Question ]

What are the best arguments for/against AIs being "slightly 'nice'"?

102

Paul and Nate

Eliezer Briefly Chimes in

Paul and Oliver

102

12 Answers sorted by top scoring

Sep 25, 2024*

Sep 24, 2024

Sep 24, 2024

Sep 25, 2024*

Sep 24, 2024

Sep 25, 2024

Sep 24, 2024

Nov 19, 2024

Sep 24, 2024

Feb 20, 2025

Sep 30, 2024

Sep 29, 2024

102

12 Answers sorted by
top scoring