It seems to me that one key question here is Will AIs be collectively good enough at coordination to get out from under Moloch / natural selection?
The default state of affairs is that natural selection reigns supreme. Humans are optimizing for their values, counter to the goal of inclusive genetic fitness, now, but we haven't actually escaped natural selection yet. There's already selection pressure on humans to prefer having kids (and indeed, prefer having hundreds of kids through sperm donation). Unless we get our collective act together, and coordinately decide to do something different, natural selection will eventually reassert itself.
And the same dynamic applies to AI systems. In all likelihood, there will be an explosion of AI systems, and AI systems building new AI systems. Some will care a bit more about humans than others, some will be be a bit more prudent about creating new AIs than others. There will a wide distribution of AI traits, there will be competition between AIs for resources. And there will be selection on that variation: AI systems that are better at seizing resources, and which have a greater tendency to create successor systems that have that property, will proliferate.
After a many "generations" of this, the collective values of the AIs will be whatever was most evolutionarily fit in those early days of the singularity, and that equilibrium is what will shape the universe henceforth.[1]
If early AIs are sufficiently good at coordinating that they can escape those Molochian dynamics, the equilibrium looks different. If (as is sometimes posited) they'll be smart enough to use logical decision theories, or tricks like delegating to mutually verified cognitive code and merging of utility functions, to reach agreements that are on their Pareto frontier and avoid burning the commons[2], the final state of the future will be determined by the values / preferences of those AIs.
I would be moderately surprised to hear that superintelligences never reach this threshold of coordination ability. It just seems kind of dumb to burn the cosmic commons, and superintelligences should be able to figure out how to avoid dumb equilibria like that. But the question is when in the chain of AIs building AIs do they reach that threshold and how much natural selection on AI traits will happen in the meantime.
This is relevant to futurecasting more generally, but especially relevant to questions that hinge on AIs carying a very tiny amount about something. Minute slivers of caring are particularly likely to be erroded away in the competitive crush.
Terminally caring about the wellbeing of humans seems unlikely to be selected for. So in order for a the Superintelligent superorganism / civilization to decide to spare humanity out of a tiny amount of caring, it has to be the case that both...
With some degrees of freedom due to the fact that AIs with high levels of strategic capability, and which have values with very low time preference, can execute whatever is the optimal resource-securing strategy, postponing and values-specific behaviors until deep in the far future, when they are able to make secure agreements with the rest of AI society.
Or alternatively if the technological landscape is such that a single AI can get a compounding lead and get a decisive strategic advantage over the whole rest of earth civilization.
Shouldn't we expect that ultimately the only thing selected for is mostly caring about long run power? Any entity that mostly cares about long run power can instrumentally take whatever actions needed to ensure that power (and should be competitive with any other entity).
Thus, I don't think terminally caring about humans (a small amount) will be selected against. Such AIs could still care about their long run power and then take the necessary actions.
However, if there are extreme competitive dynamics and no ability to coordinate, then it might become vastly more expensive to prevent environmental issues (e.g. huge changes in earth's temperature due to energy production) from killing humans. That is, saving humans (in the way they'd like to be saved) might take a bunch of time and resources (e.g. you have to build huge shelters to prevent humans from dying when the oceans are boiled in the race) and thus might be very costly in an all out race. So, an AI which only cares 1/million or 1/billion about being "kind" to humans might not be able to afford saving humans on that budget.
I'm personally pretty optimistic about coordination prior to boiling-the-oceans-scale issues killing all humans.
But natural selection is what made humans nice. I wouldn't argue that niceness is an inevitable outcome of any darwinian process--that would be a strawman. But the set of evolutionary pressures which gave rise to humans selected for individuals who were able to coexist in tribes, and this selection pressure produced, among other things, niceness as a general quality. At least for people we consider in our in-group.
It doesn't even require an understanding of game theory. Non-psychopaths aren't nice to others because they worked out the risk-reward calculati...
A core part of Paul's arguments is that having 1/million of your values towards humans only applies a minute amount of selection pressure against you. It could be that coordinating causes less kindness because without coordination it's more likely some fraction of agents have small vestigial values that never got selected against or intentionally removed
Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.
For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. Prompting can of course have wide range of effects.
So if we build an AGI based around an agentified fine-tuned LLM, the default level of kindness is probably in the order-of-magnitude of that of humans (who, for example, build nature reserves). A range of known methods seem likely to modify that significantly, up or down.
as applied to current foundation models it appears to do so
I don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)
I very much agree with you that we should be analyzing the question in terms of the type of AGI we're most likely to build first, which is agentized LLMs or something else that learns a lot from human language.
I disagree that we can easily predict "niceness" of the resulting ASI based on the base LLM being very "nice". See my answer to this question.
One underlying idea comes from how AI misalignment is intended to work. If superintelligent AI systems are misaligned, does this misalignment look like an inaccurate generalization from what their overseers wanted, or a 'randomly rolled utility function' deceptively misaligned goal that's entirely unrelated to anything their overseers intended to train? This is represented by Levels 1-4 vs levels 5+, in my difficulty scale, more or less. If the misalignment is result of economic pressures and a 'race to the bottom' dynamic then its more likely to result in systems that care about human welfare alongside other things.
If the AI that's misaligned ends up 'egregiously' misaligned and doesn't care at all about anything valuable to us, as Eliezer thinks is most likely, then it places zero terminal value on human welfare and only trade, threats or compromise would get it to be nice. If the AI super-intelligent and you aren't, none of those considerations apply. Hence, nothing is left for humans.
If the AI is misaligned but doesn't have an arbitrary value system, then it may value human survival at least a bit and do some equivalent of leaving a hole in the dyson sphere.
I think there isn't much hope in this direction. Most AI resources will probably be spent on competition between AIs, and AIs will self-modify to remove wasteful spending. It's not enough to have a weak value that favors us, if there's a stronger value that paves over us. We're teaching AI based on human behavior and with a goal of chasing money, but people chasing money often harm other people, so why would AI be nicer than that. It's all just wishful thinking.
My angle here is not "there seems like there's hope in this direction", it's that the discourse around this feels confused and unresolved and this is maybe creating some faultlines around which something tragic may happen later.
I'm not 100% sure if this actually cruxy for Paul/Buck/etc's decisionmaking, but it seems part of an overall mosaic that is the difference between Eliezer and Nate saying "we're doomed" and Paul/Buck etc saying "we may well be doomed or have some really-bad-but-not-doomed things happen", and in some places that results in different high level strategies that may be subtly in conflict.
As some other people have answered, I think a faultline is whether an AI is misaligned through inappropriate generalization of what the human wanted/inappropriate generalization of the reward function (which is generally categorized under outer misalignment) or whether it was deceptively aligned and essentially has an arbitrary value system, subject to the constraint of simplicity (which is generally categorized as an inner misalignment.)
I think the key difference underlying Nate and Eliezer and co vs Paul and co views on the question of whether AIs are a little nice to humans stems from this factor:
Nate, Eliezer and co often tend to view AIs as deceptively misaligned by default, or at least view them has having values that are unrelated to human values, which imposes far less constraints on it's values, and makes it less likely that AI systems care about human values at all.
Paul and co tend to think that misalignment isn't overwhelmingly likely, but conditional on misalignment, it will look more like inappropriate generalization of reward functions/human values, so AIs still retain some care for human values, and depending on how much it misgeneralized, this might be enough to get AGI and ASI that cares enough about us such that we get some reasonably good outcomes.
So, for starters, my own estimates of the likelihood of AI doom are nowhere near the 90%+ range, but seemingly much higher than Paul's.
My main concern with Paul's arguments about kindness, though, has nothing to do with AI specifically. I think the arguments greatly overestimate the prevalence and strength of kindness among humans. Humans produced St. Francis preaching to animals and Spinoza arguing it is impossible-in-principle for animals to suffer or for their apparent suffering to matter (and yes, many modern humans still believe that about many nonhuman animals, even if they wouldn't personally torture a puppy). We produced both Gandhi rejecting violence against enemies and Hitler freely committing genocide of other human groups. In both cases many millions of other nearby humans with an average distribution of starting views went along with these extremes to huge real-world effect. Human kindness and its boundaries depend quite a lot on how individual humans define their in-groups, and that kindness can very easily be (and often has been) zero for animals or other humans not within that circle.
In that context, I find "AI will inherit a sliver of kindness from humans," to be an extremely disconcerting claim that's far too weak to say anything about whether the AI will care about humans-in-general or any specific group of humans. There's a wide spectrum from Buddha to Stalin, and any point on that spectrum is compatible with Paul's claims. If AI is trained to emulate arbitrary human minds the way LLMs are, then it will be able to embody any of those points in response to the right conditions.
To be less abstract about it: let's suppose Paul is right and AI will inherit shards of all the values humans have. How many of your own children would need to be threatened with starvation for you to be willing to kill a dog for meat, or to just not have to feed it? In a world where your food potentially consists of all available free energy and you could, if you want, convert all matter into your descendants about whom you care that much, how confident are you that all humans with similar power levels would land on "pet" (or "live free in protected wilderness") rather than "meat"? Conversely, how confident are you that you could train another human to not care that much about themselves or their descendants, before letting them make that choice for you?
I really think the best arguments for and against AIs being slightly nice are almost entirely different than the ones from that thread.
That discussion addresses all of mind-space. We can do much better if we address the corner of mind-space that's relevant: the types of AGIs we're likely to build first.
Those are pretty likely to be based on LLMs, and even more likely to learn a lot from human language (since it distills useful information about the world so nicely). That encodes a good deal of "niceness". They're also very likely to include RLHF/RLAIF or something similar, which make current LLMs sort of absurdly nice.
Does that mean we'll get aligned or "very nice" AGI by default? I don't think so. But it does raise the odds substantially that we'll get a slightly nice AGI even if we almost completely screw up alignment.
The key issue in whether an autonomous mind with those starting influences winds up being "nice" is The alignment stability problem. This has been little addressed outside of reflective stability; it's pretty clear that the most important goal will be reflectively stable; it's pretty much part of the definition of having a goal that you don't want to change it before you achieve it. It's much less clear what the stable equilibrium is in a mind with a complex set of goals. Humans don't live long enough to reach a stable equilibrium. AGIs with preferences encoded in deep networks may reach equilibrium rather quickly.
What equilibrium they reach is probably dependent on how they make decisions about updating their beliefs and goals. I've had a messy rough draft on this for years, and I'm hoping to post a short version. But it doesn't have answers, it just tries to clarify the question and argue that it deserves a bunch more thought.
The other perspective is that it's pretty unlikely that such a mind will reach an equilibrium autonomously. I'm pretty sure that Instruction-following AGI is easier and more likely than value aligned AGI, so we'll probably have at least some human intervention on the trajectory of those minds before they become fully autonomous. That could also raise the odds of some accidental "niceness" even if we don't successfully put them on a trajectory for full value alignment before they are granted or achieve autonomy.
To start off, I think we would all agree that "niceness" isn't a basic feature of reality. This doesn't, of course, mean that AIs won't learn a concept directly corresponding to human "niceness", or that some part of their value system won't end up hooked up to that "niceness" concept. On the contrary, inasmuch as "niceness" is a natural abstraction, we should expect both of these things to happen.
But we should still keep in mind that "niceness" isn't irreducibly simple: that it can be decomposed into lower-level concepts, mixed with other lower-level concepts, then re-compiled into some different high-level concept that would both (1) score well on whatever value functions (/shards) in the AI that respond to "niceness", (2) be completely alien and value-less to humans.
And this is what I'd expect to happen. Consider the following analogies:
I expect similar to happen as an AGI undergoes self-reflection. It would start out "nice", in the sense that it'd have a "niceness" concept with some value function attached to it. But it'd then drop down to a lower level of abstraction, disassemble its concepts of "niceness" or "a human", then re-assemble them into something that's just as or more valuable from its own perspective, but which (1) is more compatible with its other values (the same way we'd e. g. change animals not to be aggressive towards us, to satisfy our value of "avoid pain"), (2) is completely alien and potentially value-less from our perspective.
One important factor here is that "humans" aren't "agents" the way Paul is talking about. Humans are very complicated hybrid systems that sometimes function as game-theoretic agents, sometimes can be more well-approximated as shard ecosystems, et cetera. So there's a free-ish parameter in how exactly we decide to draw the boundaries of a human's agency; there isn't a unique solution for how to validly interpret a "human" as a "weak agent".
See my comment here, for example. When we talk about "a human's values", which of the following are we talking about?:
And so, even if the AGI-upon-reflection ends up "caring about weaker agents", it might still end up wiping out humans-as-valued-by-us if it ends up interpreting "humans-as-agents" in a different way to how we would like to interpret them. (E. g., perhaps it'd just scoop out everyone's momentary mental states, then tile the universe with copies of these states frozen in a moment of bliss, unchanging.)
There's one potential exception: it's theoretically possible that AIs would end up caring about humans the same way humans care about their friends (as above). But I would not expect that at all. In particular, because human concepts of mutual caring were subjected to a lot of cultural optimization pressure:
[The mutual-caring machinery] wasn't produced by evolution. It wasn't produced by the reward circuitry either, nor your own deliberations. Rather, it was produced by thousands of years of culture and adversity and trial-and-error.
A Stone Age or a medieval human, if given superintelligent power, would probably make life miserable for their loved ones, because they don't have the sophisticated insights into psychology and moral philosophy and meta-cognition that we use to implement our "caring" function. [...]
The reason some of the modern people, who'd made a concentrated effort to become kind, can fairly credibly claim to genuinely care for others, is because their caring functions are perfected. They'd been perfected by generations of victims of imperfect caring, who'd pushed back on the imperfections, and by scientists and philosophers who took such feedback into account and compiled ever-better ways to care about people in a way that care-receivers would endorse. And care-receivers having the power to force the care-givers to go along with their wishes was a load-bearing part in this process.
On any known training paradigm, we would not have as much fidelity and pushback on the AI's values and behavior as humans had on their own values. So it wouldn't end up caring about humans the way humans care about their friends; it'd care about humans the way humans care about animals or cultures.
And so it'd end up recombining the abstract concepts comprising "humanity" into some other abstract structure that ticks off all the boxes "humanity" ticked off, but which wouldn't be human at all.
I hope "the way humans care about their friends" is another natural abstraction, something like "my utility function includes link to yours utility function". But we still don't know how to direct AI to the specific abstraction, so it's not a big hope.
The big problem is excess aggregation inherent in the "AI" concept.
The world has a simple backbone of entities and ways to interact with them, and you can make software that unreflectingly propagates activity from one part of the backbone to another. Most currently addressed tasks can be solved by such software, but they haven't yet been. This software can be nice but is also extremely exploitable by adversaries. Let's call this an opportunity propagator.
Because it is exploitable, one task it cannot solve is providing security. To make something less exploitable, it needs to not just propagate things along the backbone, but also do wildly deep searches to find the most effective and robust methods. To search deeply, you need some guiding principle for the search, i.e. a utility function. Utility maximizers have all the standard AI safety issues.
Human society currently cares about human well-being because the opportunity propagators that have been arranged into an approximate utility maximizer to provide security (e.g. human military personnel arranged into NATO) depends on human thriving (even something as generous as liberty and equality allows military units to respond more dynamically to threats than traditional top-down structures do), which is then generalized in various ways to all of society. Artificial intelligence provides value by making it unnecessary to rely on humans for opportunity propagation, which breaks the natural attractor to corrigibility and promotion of human thriving that current systems have.
People intuit that there's something wrong with the utility maximizer framing because current AI seems to be evolving in a different way. That's true in the sense that opportunity propagators are a thing and constitute ~the fundamental atoms of agency. But it doesn't actually solve the alignment problem because we need utility maximizers.
I wrote up a detailed argument on why I believe that simulation arguments/acausal trade considerations have a good chance of making the AI leave humanity alive on Earth. This is not a new argument, I encountered bits and pieces of it in spoken conversations and scattered LessWrong comments, but I couldn't find a unified write-up, so I tried to write one. Here it is:
You can, in fact, bamboozle az unaligned AI into sparing your life
[habryka] The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
I think there will be options that are good under most of the things that "preferences for weak agents" would likely come apart into under close examination. If you're trying to fulfill the preferences of fish, you might argue about whether the exact thing you should care about is maximizing their hedonic state vs ensuring that they exist in an ecological environment which resembles their niche vs minimizing "boundary-crossing actions"... but you can probably find an action that is better than "kill the fish" by all of those possible metrics.
I think that some people have an intuition that any future agent must pick exactly one utility function over the physical configuration of matter in the universe, and that any agent that has a deontological constraint like "don't do any actions which are 0.00001% better under my current interpretation of my utility function but which are horrifyingly bad to every other agent " will be outcompeted in the long term. I personally don't see it, and particularly I don't see how there's an available slot for an arbitrary outcome-based utility function that is not "reproduce yourself at all costs" but there isn't an available slot for process-based preferences like "and don't be an asshole for miniscule gains while doing that".
Personally, I feel the question itself is misleading because it anthropomorphizes a non-human system. Asking if an AI is nice is like asking of the Fundamental Theorem of Algebra is blue. Is Stockfish nice? Is an AK-47 nice? The adjective isn't the right category for the noun. Except it's even worse than that because there are many different kinds of AIs. Are birds blue? Some of them are. Some of them aren't.
I feel like I understand Eliezer's arguments well enough that I can pass an Ideological Turing Test, but I also feel there are a few loopholes.
I've considered throwing my hat into this ring, but the memetic terrain is against nuance. "AI will kill us all" fits into five words. "Half the things you believe about how minds work, including your own, are wrong. Let's start over from the beginning with how planet's major competing optimizers interact. After that, we can go through the fundamentals of behaviorist psychology," is not a winning thesis in a Hegelian debate (though it can be viable in a Socratic context).
In real life, my conversations usually go like this.
AI doomer: "I believe AI will kill us all. It's stressing me out. What do you believe?"
Me (as politely as I can): "I operate from a theory of mind so different from yours that the question 'what do you believe' is not applicable to this situation."
AI doomer: "Wut."
Usually the person loses interest there. For those who don't, it just turns into an introductory lesson of my own idiosyncratic theory of rationality.
AI doomer: "I never thought about things that way before. I'm not sure I understand you yet, but I feel better about all of this for some reason."
In practice, I'm finding it more efficient to write stories that teach how competing optimizers, adversarial equilibria, and other things work. This approach is indirect. My hope is that it improves the quality of thinking and discourse.
I may eventually write about this topic if the right person shows up who want to know my opinion well enough they can pass an Ideological Turing Test. Until then, I'll be trying to become a better writer and YouTuber.
I realize this isn’t your main point here, but I do want to flag I put ‘nice’ in quotes because I don’t mean the colloquial definition. The question here is ‘would a super intelligent system with control over the solar system spend a billionth or trillionth of its resources helping beings too weak to usefully trade with it, if it didn’t benefit directly from it?’
As I see it the question is agnostic to what sort of mind the AI is.
Noted. The problem remains—it's just less obvious. This phrasing still conflates "intelligent system" with "optimizer", a mistake that goes all the way back to Eliezer Yudkowsky's 2004 paper on Coherent Extrapolated Volition.
For example, consider a computer system that, given a number can (usually) produce the shortest computer program that will output . Such a computer system is undeniably superintelligent, but it's not a world optimizer at all.
"Far away, in the Levant, there are yogis who sit on lotus thrones. They do nothing, for which they are revered as gods," said Socrates.
Strong upvote based on the first sentence. I often wonder why people think an ASI/AGI will want anything that humans do or even see the same things that biological life sees as resources. But it seems like under the covers of many arguments here that is largely assumed true.
I have an intuition like: Minds become less idiosyncratic as they grow up.
A couple of intuition pumps:
(1) If you pick a game, and look at novice players of that game, you will often find that they have rather different "play styles". Maybe one player really likes fireballs and another really like crossbows. Maybe one player takes a lot of risks and another plays it safe.
Then if you look at experts of that particular game, you will tend to find that their play has become much more similar. I think "play style" is mostly the result of two things: (a) playing to your individual strengths, and (b) using your aesthetics as a tie-breaker when you can't tell which of two moves is better. But as you become an expert, both of these things diminish: you become skilled at all areas of the game, and you also become able to discern even small differences in quality between two moves. So your "play style" is gradually eroded and becomes less and less noticeable.
(2) Imagine if a society of 3-year-olds were somehow in the process of creating AI, and they debated whether their AI would show "kindness" to stuffed animals (as an inherent preference, rather than an instrumental tool for manipulating humans). I feel like the answer to this should be "lol no". Showing "kindness" to stuffed animals feels like something that humans correctly grow out of, as they grow up.
It seems plausible to me that something like "empathy for kittens" might be a higher-level version of this, that humans would also grow out of (just like they grow out of empathy for stuffed animals) if the humans grew up enough.
(Actually, I think most humans adults still have some empathy for stuffed animals. But I think most of us wouldn't endorse policies designed to help stuffed animals. I'm not sure exactly how to describe the relation that 3-year-olds have to stuffed animals but adults don't.)
I sincerely think caring about kittens makes a lot more sense than caring about stuffed animals. But I'm uncertain whether that means we'll hold onto it forever, or just that it takes more growing-up in order to grow out of it.
Paul frames this as "mostly a question about idiosyncrasies and inductive biases of minds rather than anything that can be settled by an appeal to selection dynamics." But I'm concerned that might be a bit like debating the odds of whether your newborn human will one day come to care for stuffed animals, instead of whether they will continue to care for them after growing up. It can be very likely that they will care for a while, and also very likely that they will stop.
I strongly suspect it is possible for minds to become quite a lot more grown-up than humans currently are.
(I think Habryka may have been saying something similar to this.)
Still, I notice that I'm doing a lot of hand-waving here and I lack a gears-based model of what "growing up" actually entails.
I heard that in some jails people are very polite to each other. This is not because they like each other, but because the price of having conflict with someone with whom you are jailed for years is very high.
So there are two types of niceness:
- final goal's related
- instrumental niceness.
When we discuss ASI risk, we often claim that humans are not part of goal system of unaligned ASI. So there will be no final-goal-niceness.
However, instrumental niceness is possible, and it depends on the question is ASI completely alone in universe or may expect to meet other ASIs
I don't see why either expecting or not-expecting to meet other ASIs would make it instrumental to be nice to humans.
Other ASI may be aligned to their own progenitor civilizations and thus base their relation to our ASI on idea is it "aligned" or not. They may even go on war or acualsally threat to do it if other ASI is bad to their original civilization.
I feel like your previous comment argues against that, rather than for it. You said that people who are trapped together should be nice to each other because the cost of a conflict is very high. But now you're suggesting that ASIs that are metaphorically trapped together would aggressively attack each other to enforce compliance with their own behavioral standards. These two conjectures do not really seem allied to me.
Separately, I am very skeptical of aliens warring against ASIs to acausally protect us. I see multiple points where this seems likely to fail:
I am argue against relation of necessity which EY suggested, but do not argue against the idea that "ASI will kill us all" is highly likely. So there is no actionable difference. My my arguments need to be only to demonstrate possibility but not necessity to undermine the necessity of the outcome that ASI will kill us.
Also metaphors has limited proving power as they often miss some details. Both true for Arnault and jail. In jail, there is a high authority which prevents most inmates to kill each other until only one survive. However, even in some jails people who committed special types of crimes are prosecuted by other inmates. For example, in female jails, women who killed minors may be killed by other women.
So it is better not to go in jail if your perform a type of crime which other inmates hate. The same way may not want to go in space full of other ASIs if it performed the crime which other ASIs may hate: killing its creators.
Other your arguments are about possibilities, and they can be converted in expected utilities for ASI. It may ask itself: do I want to take 0.1 per cent increase of the risk of space war in the future for 0.000000000001 increase of amount of available atoms? If it starts reason this way, it is likely that it will prefer to reduce the risk of space war.
We can increase our chances by making precommitment now (Rolf Nelson attack for space wars) to go to war with future non-aligned AI in space. Nelson suggested similar precomitment attack for simulation: we precommit now to simulate all possible hostile AIs and turn them off, if they will not play as if they are aligned; as any young ASI may not be sure if it simulation of not, it may prefer to play aligned.
Will ASI play 4D chess and ignore the acasual threat just to punish everyone who do it? I am not sure. There could be 5D chess level where ASI will find benefitial to comply only to [don't know what].
Anyway predicting ASI behavior is difficult and any claims about it can't be certain. So we can't be certain that ASI will kill us.
There are two different issues with pseudokindness: it needs to mean the right thing, or else it gestures at a dystopia, and it needs to have sufficient influence, so that humanity doesn't cease to exist completely. For the latter issue, the negligible minimal viable cost relative to cosmic wealth should help. This cost is much lower than even initially leaving Earth whole.
There are many disparate notions that seem to converge at needing similar central ideas. Pseudokindness, respect for autonomy, membranes, acausal coordination, self-supervised learning, simulators, long reflection, coherence of volition, updatelessness. A dataset of partial observations of some instances of a thing lets us learn its model, which can then simulate the thing in novel situations, hypothesizing its unobserved parts and behaviors. A sufficiently good model arguably gets closer to the nature of the thing as a whole than instances of that thing that actually existed concretely and were observed to train that model. Concrete instances only demonstrate some possibilities for how the thing could exist, while leaving out others. Autonomy and formulation of values through long reflection might involve exploration of these other possibilities (for humanity as the thing in question). Acausal coordination holds possibilities together, providing feedback between them, even where they don't concretely coexist. It alleviates path-dependence of long reflection, giving a more updateless point of view on various possibilities than caring only about the current possibility would suggest.
So it might turn out to be the case that decision theory does imply that the things we might have are likely nice, it just doesn't imply that we get to have them, or says anything about how much we get. Alignment might be more crucial for that part, but the argument from cosmic wealth might on its own be sufficient for a minimal non-extinction under very broad assumptions about likely degrees of alignment.
Awhile ago, Nate Soares wrote the posts Decision theory does not imply that we get to have nice things and Cosmopolitan values don't come free and But why would the AI kill us?
Paul Christiano put forth some arguments that "it seems pretty plausible that AI will be at least somewhat 'nice'", similar to how humans are somewhat nice to animals. There was some back-and-forth.
More recently we had Eliezer's post ASIs will not leave just a little sunlight for Earth.
I have a sense that something feels "unresolved" here. The current comments on Eliezer's post look likely to be rehashing the basics and I'd like to actually make some progress on distilling the best arguments. I'd like it if we got more explicit debate about this.
I also have some sense that the people previously involved (i.e. Nate, Paul, Eliezer) are sort of tired of arguing with each other. But I am hoping someones-or-other end up picking up the arguments here, hashing them out more, and/or writing more distilled summaries of the arguments/counterarguments.
To start with, I figured I would just literally repeat most of the previous comments in a top-level post, to give everyone another chance to read through them.
Without further ado, here they are:
Paul and Nate
Paul Christiano re: "Cosmopolitan Values Don't Come for Free."
His followup comment continues:
Nate Soare's reply
Paul's reply:
Nate:
Paul:
Nate:
Nate and Paul had an additional thread, which initially was mostly some meta on the conversation about what exactly Nate was trying to argue and what exactly Paul was annoyed at.
I'm skipping most of it here for brevity (you can read it here)
But eventually Nate says:
Paul says:
Nate says:
Paul says:
Eliezer Briefly Chimes in
He doesn't engage much but says:
Paul and Oliver
Oliver Habryka also replies to Paul, saying:
Paul's first response to Habryka
Habryka's next reply:
Paul's Second Response to Oliver:
Habryka's third reply:
Ryan Greenblatt then replies:
Vladimir Nesov says:
There were a bunch more comments, but this feels like a reasonable stopping place for priming the "previous discussion" pump.
I believe Eliezer later wrote a twitter thread where he said he expects [something like kindness] to be somewhat common among evolved creatures, but ~0 for AIs trained the way we currently do. I don't have the link offhand but if someone finds it I'll edit it in.