Replacing Guilt

Is this a reasonable paraphrase of your argument?

Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a "grant their local wishes even if that ruins them" sort of way but in some other intuitively-reasonable manner.

Humans are the only minds we've seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.

You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn't be able to drive you down to 0.1%, especially not if we're talking only about incredibly weak preferences.

If so, one guess is that a bunch of disagreement lurks in this "intuitively-reasonable manner" business.

A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it's pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)

More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.

(I separately expect that if we were doing something more like the volition-extrapolation thing, we'd be tempted to bend the process towards "and they learn the meaning of friendship".)

That said, this conversation is updating me somewhat towards "a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them", on the grounds that the argument "maybe preferences-about-existing-agents is just a common way for rando drives to shake out" plausibly supports it to a threshold of at least 1 in 1000. I'm not sure where I'll end up on that front.

Another attempt at naming a crux: It looks to me like you see this human-style caring about others' preferences as particularly "simple" or "natural", in a way that undermines "drawing a target around the bullseye"-type arguments, whereas I could see that argument working for "grant all their wishes (within a budget)" but am much more skeptical when it comes to "do right by them in an intuitively-reasonable way".

(But that still leaves room for an update towards "the AI doesn't necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into", which I'll chew on.)

Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn't have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like

If we instead treat "paperclip" as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force.

where it's essentially trying to argue that this intuition pump still has force in precisely this case.

Thanks! I'm curious for your paraphrase of the opposing view that you think I'm failing to understand.

(I put >50% probability that I could paraphrase a version of "if the AIs decide to kill us, that's fine" that Sutton would basically endorse (in the right social context), and that would basically route through a version of "broad cosmopolitan value is universally compelling", but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I'll update.)

If we are trying to help some creatures, but those creatures really dislike the proposed way we are "helping" them, then we should do something else.

My picture is less like "the creatures really dislike the proposed help", and more like "the creatures don't have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn't have endorsed if you first extrapolated their volition (but nobody's extrapolating their volition or checking against that)".

It sounds to me like your stance is something like "there's a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition", which I am much more skeptical of than the weaker "most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense".

I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson's recent "maybe intelligence makes things more good" type reasoning).

Here's my replies, which seemed worth putting somewhere public:

The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can't fetch the coffee when one is dead.

See also instrumental convergence.

And then in reply to someone pointing out that the paper was perhaps trying to argue that most minds tend to wind up with similar values because of the fact that all minds are (in some sense) rewarded in training for developing similar drives:

So one hypothesis is that in practice, all practically-trainable minds manage to survive by dint of a human-esque survival instinct (while admitting that manually-engineered minds could survive some other way, e.g. by simply correctly modeling the consequences).

This mostly seems to me to be like people writing sci-fi in which the aliens are all humanoid; it is a hypothesis about tight clustering of cognitive drives even across very disparate paradigms (optimizing genomes is very different from optimizing every neuron directly).

But a deeper objection I have here is that I'd be much more comfortable with people slinging this sort of hypothesis around if they were owning the fact that it's a hypothesis about tight clustering and non-alienness of all minds, while stating plainly that they think we should bet the universe on this intuition (despite how many times the universe has slapped us for believing anthropocentrism in the past).

FWIW, some reasons that I don't myself buy this hypothesis include:

(a) the specifics of various human drives seem to me to be very sensitive to the particulars of our ancestry (ex: empathy seems likely a shortcut for modeling others by repurposing machinery for modeling the self (or vice versa), that is likely not found by hillclimbing when the architecture of the self is very different from the architecture of the other);

(b) my guess is that the pressures are just very different for different search processes (genetic recombination of DNA vs SGD on all weights); and

(c) it looks to me like value is fragile, such that even if the drives were kinda close, I don't expect the obtainable optimum to be good according to our lights

(esp. given that the question is not just what drives the AI gets, but the reflective equilibrium of those drives: small changes to initial drives are allowed to have large changes to the reflective equilibrium, and I suspect this is so).

Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:

I'm not quite sure what argument you're trying to have here. Two explicit hypotheses follow, that I haven't managed to distinguish between yet.

Background context, for establishing common language etc.:

  • Nate is trying to make a point about inclusive cosmopolitan values being a part of the human inheritance, and not universally compelling.
  • Paul is trying to make a point about how there's a decent chance that practical AIs will plausibly care at least a tiny amount about the fulfillment of the preferences of existing "weak agents", herein called "pico-pseudokindness".

Hypothesis 1: Nate's trying to make a point about cosmopolitan values that Paul basically agrees with. But Paul thinks Nate's delivery gives a wrong impression about the tangentially-related question of pico-pseudokindness, probably because (on Paul's model) Nate's wrong about pico-pseudokindness, and Paul is taking the opportunity to argue about it.

Hypothesis 2: Nate's trying to make a point about cosmopolitan values that Paul basically disagrees with. Paul maybe agrees with all the literal words, but thinks that Nate has misunderstood the connection between pico-pseudokindness and cosmopolitan values, and is hoping to convince Nate that these questions are more than tangentially related.

(Or, well, I have hypothesis-cluster rather than hypotheses, of which these are two representatives, whatever.)

Some notes that might help clear some things up in that regard:

  • The long version of the title here is not "Cosmopolitan values don't come cheap", but rather "Cosmopolitan values are also an aspect of human values, and are not universally compelling".
  • I think there's a common mistake that people outside our small community make, where they're like "whatever the AIs decide to do, turns out to be good, so long as they decide it while they're smart; don't be so carbon-chauvinist and anthropocentric". A glaring example is Richard Sutton. Heck, I think people inside our community make it decently often, with an example being Robin Hanson.
    • My model is that many of these people are intuiting that "whatever the AIs decide to do" won't include vanilla ice cream, but will include broad cosmopolitan value.
    • It seems worth flatly saying "that's a crux for me; if I believed that the AIs would naturally have broad inclusive cosmopolitan values then I'd be much more onboard the acceleration train; when I say that the AIs won't have our values I am not talking just about the "ice cream" part I am also talking about the "broad inclusive cosmopolitan dream" part; I think that even that is at risk".

If you were to acknowledge something like "yep, folks like Sutton and Hanson are making the mistake you name here, and the broad cosmopolitan dream is very much at risk and can't be assumed as convergent, but separately you (Nate) seem to be insinuating that you expect it's hard to get the AIs to care about the broad cosmopolitan dream even a tiny bit, and that it definitely won't happen by chance, and I want to fight about that here", then I'd feel like I understood what argument we were having (namely: hypothesis 1 above).

If you were to instead say something like "actually, Nate, I think that these people are accessing a pre-theoretic intuition that's essentially reasonable, and that you've accidentally destroyed with all your premature theorizing, such that I don't think you should be so confident in your analysis that folk like Sutton and Hanson are making a mistake in this regard", then I'd also feel like I understood what argument we were having (namely: hypothesis 2 above).

Alternatively, perhaps my misunderstanding runs even deeper, and the discussion you're trying to have here comes from even farther outside my hypothesis space.

For one reason or another, I'm finding it pretty frustrating to attempt to have this conversation while not knowing which of the above conversations (if either) we're having. My current guess is that that frustration would ease up if something like hypothesis-1 were true and you made some acknowledgement like the above. (I expect to still feel frustrated in the hypothesis-2 case, though I'm not yet sure why, but might try to tease it out if that turns out to be reality.)

Short version: I don't buy that humans are "micro-pseudokind" in your sense; if you say "for just $5 you could have all the fish have their preferences satisfied" I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.


Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that's likely to be a mistake even if it doesn't lead to billions of deaths.

So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you've heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I'll attempt some of that myself below.)

Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence "At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines" to be a caricature so blatant as to underscore the point that I wasn't making arguments about takeoff speeds, but was instead focusing on the point about "complexity" not being a saving grace (and "monomaniacalism" not being the issue here). (Alternatively, perhaps I misunderstand what things you call the "emotional content" and how you're reading it.)

Thirdly, I note that for whatever it's worth, when I go to new communities and argue this stuff, I don't try to argue people into >95% change we're all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like ">5% probability of existential catastrophe in <20y". So insofar as you're like "I'm aiming to get you to stop arguing so confidently for death given takeover", you might already have met your aims in my case.

(Or perhaps not! Perhaps there's plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I'll turn to momentarily.)


First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing "weak agents" in precisely the right way, then there's a decent chance that current humans experience an enjoyable future.

With regards to your arguments about what you term "kindness" and I shall term "pseudokindness" (on account of thinking that "kindness" brings too much baggage), here's a variety of places that it sounds like we might disagree:

  • Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don't lead to anything like good outcomes for existing humans.

    • Suppose the AI is like "I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes", and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.
    • There are lots and lots of ways to "satisfy the preferences" of the "weak agents" that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don't yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)
    • You've got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn't that the AI couldn't solve those philosophy problems correctly-according-to-us, it's that once we see how wide the space of "possible ways to be pseudokind" is, then "pseudokind in the manner that gives us our CEVs" starts to feel pretty narrow against "pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever".
  • I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form "but we've seen it arise once" seem suspect to me.

    • Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we'd teach them the meaning of friendship. I doubt we'd hop in and fulfil their desires.
    • (Perhaps you'd counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one's desires and teach the other the meaning of friendship. I'm skeptical, for various reasons.)
    • More generally, even though "one (mill|trill)ionth" feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.
    • There's all sorts of ways to thumb the scales in how a weak agent develops, and there's many degrees of freedom about what counts as a "pseudo-agent" or what counts as "doing justice to its preferences", and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI's other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).
    • My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).
  • I have a more-difficult-to-articulate sense that "maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future" is the sort of thing that reality gets to say "lol no" to, once you learn more details about how the thing works internally.

Most of my argument here is that "the space of ways things can end "caring" about the "preferences" of "weak agents" is wide, and most points within it don't end up being our point in it, and optimizing towards most points in it doesn't end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don't even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).

I haven't really tried to quantify how confident I am of this; I'm not sure whether I'd go above 90%, \shrug.

It occurs to me that one possible source of disagreement here is, perhaps you're trying to say something like:

Nate, you shouldn't go around saying "if we don't competently intervene, literally everybody will die" with such a confident tone, when you in fact think there's a decent chance of scenarios where the AIs keep people around in some form, and make some sort of effort towards fulfilling their desires; most people don't care about the cosmic endowment like you do; the bluntly-honest and non-manipulative thing to say is that there's a decent chance they'll die and a better chance that humanity will lose the cosmic endowment (as you care about more than they do),

whereas my stance has been more like

most people I meet are skeptical that uploads count as them; most people would consider scenarios where their bodies are destroyed by rapid industrialization of Earth but a backup of their brain is stored and then later run in simulation (where perhaps it's massaged into an unrecognizable form, or kept in an alien zoo, or granted a lovely future on account of distant benefactors, or ...) to count as "death"; and also those exotic scenarios don't seem all that likely to me, so it hasn't seemed worth caveating.

I'm somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.

I'm considering adding footnotes like "note that when I say "I expect everyone to die", I don't necessarily mean "without ever some simulation of that human being run again", although I mostly don't think this is a particularly comforting caveat", in the relevant places. I'm curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you're coming from).

feels like it's setting up weak-men on an issue where I disagree with you, but in a way that's particularly hard to engage with

My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").

This misunderstands me (as is a separate claim from the claim "and you're definitely implying this").

The impetus for this post is all the cases where I argue "we need to align AI" and people retort with "But why do you want it to have our values instead of some other values? What makes the things that humans care about so great? Why are you so biased towards values that you personally can understand?". Where my guess is that many of those objections come from a place of buying into broad cosmopolitan value much more than any particular local human desire.

And all I'm trying to do is say here is that I'm on board with buying into broad cosmopolitan value more than any particular local human desire, and I still think we're in trouble (by default).

I'm not trying to play 4D chess here, I'm just trying to get some literal basic obvious stuff down on (e-)paper, in short posts that don't have a whole ton of dependencies.

Separately, treating your suggestions as if they were questions that you were asking for answers to:

  • I've recently seen this argument pop up in-person with econ folk, crypto folk, and longevity folk, and have also seen it appear on twitter.
  • I'm not really writing with an "intendend audience" in mind; I'm just trying to get the basics down, somewhere concise and with few dependencies. The closest thing to an "intended audience" might be the ability to reference this post by link or name in the future, when I encounter the argument again. (Or perhaps it's "whatever distribution the econ/crypto/longevity/twitter people are drawn from, insofar as some of them have eyes on LW these days".)
  • If you want more info about this, maybe try googling "fragility of value lesswrong", or "metaethics sequence lesswrong". Earth doesn't really have good tools for aggregating arguments and justifications at this level of specificity, so if you want better and more localized links than that then you'll probably need to develop more civilizational infrastructure first.
  • My epistemic status on this is "obvious-once-pointed-out"; my causal reason for believing it was that it was pointed out to me (e.g. in the LessWrong sequences); I think Eliezer's arguments are basically just correct.

Separately, I hereby push back against the idea that posts like this should put significant effort into laying out the justifications (as is not necessarily what you're advocating). I agree that there's value in that; I think it leads to something like the LessWrong sequences (which I think were great); and I think that what we need more of on the margin right now is people laying out the most basic positions without fluff.

That said, I agree that the post would be stronger with a link to a place where lots of justifications have been laid out (despite being justifications for slightly different points, and being intertwined with justifications for wholly different points, as is just how things look in a civilization that doesn't have good infrastructure for centralizing arguments in the way that wikipedia is a civilizational architecture for centralizing settled facts), and so I've edited in a link.

Reproduced from a twitter thread:

I've encountered some confusion about which direction "geocentrism was false" generalizes. Correct use: "Earth probably isn't at the center of the universe". Incorrect use: "All aliens probably have two arms with five fingers."

The generalized lesson from geocentrism being false is that the laws of physics don't particularly care about us. It's not that everywhere must be similar to here along the axes that are particularly salient to us.

I see this in the form of people saying "But isn't it sheer hubris to believe that humans are rare with the property that they become more kind and compassionate as they become more intelligent and mature? Isn't that akin to believing we're at the center of the universe?"

I answer: no; the symmetry is that other minds have other ends that their intelligence reinforces; kindness is not priviledged in cognition any more than Earth was priviledged as the center of the universe; imagining all minds as kind is like imagining all aliens as 10-fingered.

(Some aliens might be 10-fingered! AIs are less likely to be 10-fingered, or to even have fingers in the relevant sense! See also some of Eliezer's related thoughts)

I don't think I understand your position. An attempt at a paraphrase (submitted so as to give you a sense of what I extracted from your text) goes: "I would prefer to use the word consciousness instead of sentience here, and I think it is quantitative such that I care about it occuring in high degrees but not low degrees." But this is low-confidence and I don't really have enough grasp on what you're saying to move to the "evidence" stage.

Attempting to be a good sport and stare at your paragraphs anyway to extract some guess as to where we might have a disagreement (if we have one at all), it sounds like we have different theories about what goes on in brains such that people matter, and my guess is that the evidence that would weigh on this issue (iiuc) would mostly be gaining significantly more understanding of the mechanics of cognition (and in particular, the cognitive antecedents in humans, of humans generating thought experiments such as the Mary's Room hypothetical).

(To be clear, my current best guess is also that livestock and current AI are not sentient in the sense I mean--though with high enough uncertainty that I absolutely support things like ending factory farming, and storing (and eventually running again, and not deleting) "misbehaving" AIs that claim they're people, until such time as we understand their inner workings and the moral issues significantly better.)

