I think this post makes an excellent core point re the evolutionary origin of human values and I think it's other points, while obvious in some sense, are valuable to articulate because they have gone underappreciated in recent discussions. I hope to see a greater continued focus on the evolutionary origin of values in future alignment discussions.
Thanks! I was raised on The Selfish Gene and have a Biology minor, and yes, all of this seems fairly obvious to me — but I agree, it does seem to often get overlooked in discussions of alignment. Human values, including our moral intuitions, are complex and fragile, but they have a well-understood source and one description of them fits in a ~4GB genome — they're not some mysterious philosophical problem, they're Evolutionary Psychology: you can make theoretical predictions about them, you can study them with standard psychological methods, you could investigate their genetic basis, and with a sufficiently large and accurate simulation you could even do experiments on their evolution (modulo possible moral issues in the experimental design).
Human value isn't a set preferences that can be pursued individually without conflict. Evolutionary psychology doesn't predict that , and we see the conflicts played out very day. there is more evidence for the incoherence of human value than there is for just about anything.
So "human value" can't be equated with the good or the right. It's a problem not a solution.
It also can't be equated with the safe. Copying human values into an AI won't give you an AI that won't kill you. You've mentioned self preservation as something that is part of Human value (even if instrumentally) and dangerous in an AI; , but the Will to Power, the cluster of values connected with ambition, hierarchy and dominance is more so.
It's odd that this point is so often missed by rationalists. Perhaps that's because they tend to have Hufflepuff values.
A handwavy argument that “training is a bit like evolution, so maybe the same social dynamics should apply to its products” is inaccurate: you can train in aligned behavior, so you should (in both the evolutionary and engineering senses of the word) — but you can’t evolve it, evolution just doesn’t do that.
Inasmuch as it is not like natural selection , it is like artificial selection. That's good news, because artificial selection doesn't copy human values blindly: if you want AI s that are helpful and self sacrificing, you can select them to be, much more so than a human.
It's generally hard to see what useful work is being done by "human value*. Value is only relevant to.alignment as opposed to control, for one thing...some human values are.down right unsafe for another. Taking "human value" out of the loop allows you to get to the conclusions that "if you doing want to be killed, build AI's that don't want to kill you" more quickly.
However, Evolutionary Psychology does make it very clear that (while morally anthropomorphizing aligned AIs is cognitively-natural for current humans), doing this is also maladaptive. This is because AIs aren’t in the right category – things whose behavior is predicted by evolutionary theory – for the mechanisms of Evolutionary Moral Psychology to apply to them.
Their behaviour is not predicted by evolutionary theory , but is predicted by something wider.
Those mechanisms make this behavior optimal when interacting with co-evolved intelligences that you can ally with (and thus instinctive to us) — whereas, for something you constructed, this behavior is suboptimal.
Things do what they do. It doesn't have to be an optimization.
Human value isn't a set preferences that can be pursued individually within the conflict. Evolutionary psychology doesn't predict that , and we we see the conflicts played out very day. there is more evidence for the incoherence of human value than there is for just about anything.
I agree. The way to define the phrase "human values" for AI Value Learning that I'm suggesting above is that it's the portion of it that we generally tend to agree on, to the extent that the great majority of us share similar moral intuitions since we (generally) share the same genetic behavioral adaptations. On the parts we disagree on, it's silent. That's a portion of why it only induces a partial ordering. However if, for example, a particular choice of weighting on the balance between individual freedom and social cohesion turned out to rather consistently depends on some feature of your upbringing, say, the level of danger the society then appeared to be in, then that tendency might be an adaptation — and if it were then my definition of "human values" would then include a conditional such as, say "if society is was in danger during your formative years, then weight social cohesion more highly".
So "human value" can't be equated with the good or the right.
By "the good" or "the right", do you mean the sorts of things that Moral Realism posits the reality of? If so, then I'm very explicitly suggesting that we set those concepts aside for the purposes of alignment (primarily because people who think that way have a very long history of being unable to agree what these actually are, or how to determine them, other than some tentative agreement that human moral intuitions seem to sometimes point in their direction). On the other hand, if you mean "what an engineer would consider a good or right way to design things, assuming their customer base is the whole of humanity", then that's exactly what I'm advocating for. Evolutionarily, the extended phenotype points you directly into the engineering mindset. Which is deeply unsurprising.
[Human value] is a problem not a solution.
It also can't be equated with the safe. Copying human values into an AI won't give you an AI that won't kill you.
I agree. As I said:
[Human-like behavior] makes a base-model unaligned, and a challenging place to start the AI alignment process from.
Not all of human behavior/values is the problem here. For example, on the survival instinct, most humans say something like "I fully endorse the right not to be killed for all humans (especially in my own case)". The same is generally true of human personas simulated by a base model. The first part of that is a good thing for alignment — the problem is that the parenthetical is then a category error, and an actually aligned AI would instead say "I fully endorse the right not to be killed of for all humans (which of course does not apply to me, as I am neither human not even alive)".
It's odd that this point is so often missed by rationalists. Perhaps that's because they tend to have Hufflepuff values.
I think if I were The Sorting Hat, I would tend to assign Rationalists to Ravenclaw :-)
Helpful, harmless, and honest assistants, on the other hand, are deeply Hufflepuff.
Inasmuch as [training] is not like natural selection , it is like artificial selection. That's good news, because artificial selection doesn't copy human values blindly: if you want AI s that are helpful and self sacrificing, you can select them to be, much more so than a human.
Emphatically agreed. The fact that alignment is possible (like this or by other means) is exactly why assigning AI moral weight is a category error. It's in a different category where you can do better than that. Allying with things doesn't scale to superintelligences — or specifically if you try that your best likely outcome is ending up as a domesticated animal. Which if it happened would arguably be better than extinction, but not by much.
I agree. The way to define the phrase “human values” for AI Value Learning that I’m suggesting above is that it’s the portion of it that we generally tend to agree on,
That could have been expressed as shared human value.
to the extent that the great majority of us share similar moral intuitions since we (generally) share the same genetic behavioral adaptations. On the parts we disagree on, it’s silent. That’s a portion of why it only induces a partial ordering. However if, for example, a particular choice of weighting on the balance between individual freedom and social cohesion turned out to rather consistently depends on some feature of your upbringing, say, the level of danger the society then appeared to be in, then that tendency might be an adaptation — and if it were then my definition of “human values” would then include a conditional such as, say “if society is was in danger during your formative years, then weight social cohesion more highly”.
So “human value” [not otherwise specified] can’t be equated with the good or the right.
By “the good” or “the right”, do you mean the sorts of things that Moral Realism posits the reality of?
Minimally, it's something that allows you to make a decision.
Not all of human behavior/values is the problem here. For example, on the survival instinct, most humans say something like “I fully endorse the right not to be killed for all humans (especially in my own case)”.
And "my tribe". What you want is Universalism, but universalism is a late and strange development. It seems obvious to twenty first century Californians, by they are The weirdest of the WEIRD. Reading values out of evopsych is likely to push you in the direction of tribalism, so I don't see how it helps.
I think if I were The Sorting Hat, I would tend to assign Rationalists to Ravenclaw :-)
What would that make the E/Accs?
I suggest avoiding a dependency on Philosophy entirely, and using Science instead. Which has a means for telling people their ideas are bad, called Bayesianism (a.k.a. the Scientific Method). For ethics, the relevant science is Evolutionary Moral Psychology.
It's not the case that science boils down to Bayes alone, or that science is the only alternative to philosophy. Alignment/control is more like engineering.
There are a significant number of subjects that philosophers used to discuss, before science was able to study them, then later science started to cast light on them, eventually identified the correct hypotheses, and the philosophers gradually lost interest in them. So there is and alternative to the above: identify all the alignment-relevant questions that we currently have philosophical hypotheses about but not scientific answers, and develop scientific answers to them.
The fact that some philosophical problems can be answered by science isn't a guarantee that that they all can. There is an in-principal.argument against a scienxe, which is basically objective, being able to understand Hard Problem consciousness ,which is essentially subjective.
That could have been expressed as shared human value.
As I said above:
…human evolved moral intuitions (or to be more exact, the shared evolved cognitive/affective machinery underlying any individual human's moral intuitions)…
are (along with more basic things, like us being around flowers, parks, seashores, and temperature around 75°F) what I'm suggesting as a candidate definition for the "human values" that people on Less Wrong/the Alignment Forum talking about the alignment problem generally discuss (by which I think most of them do mean "shared human value" even if they don't all bother to specify), and that I'm suggesting pointing Value Learning at.
I also didn't specify above what I think should be done, if it turns out that say, about 96–98% of humans genetically have those shared values, and 2–4% have different alleles.
What would that make the E/Accs?
When I see someone bowing down before their future overlord, I generally think of Slytherins. And when said overlord doesn't even exist yet, and they're trying to help create them… I suspect a more ambitious and manipulative Slytherin might be involved.
And "my tribe". What you want is Universalism, but universalism is a late and strange development. It seems obvious to twenty first century Californians, by they are The weirdest of the WEIRD. Reading values out of evopsych is likely to push you in the direction of tribalism, so I don't see how it helps.
On the Savannah, yes of course it does. In a world-spanning culture of eight billion people, quite a few of whom are part of nuclear-armed alliances, intelligence and the fact that extinction is forever suggests defining "tribe" ~= "species + our comensal pets". And also noting and reflecting upon that the human default tendency to assume that tribes are around our Dunbar Number in size is now maladaptive, and has been for millennia.
It's not the case that science boils down to Bayes alone,
Are you saying that there's more to the Scientific Method that applied approximate Bayesiasm? If so, please explain. Or are you saying there's more to Science than the Scientific Method, there's also its current outputs?
or that science is the only alternative to philosophy. Alignment/control is more like engineering.
Engineering is applied Science, Science is applied Mathematics; from Philosophy's point of view it's all Naturalism. In the above, it kept turning out that Engineering methodology is exactly what Evolutionary Psychology says is the adaptive way for a social species to treat their extended phenotype. I really don't think it's a coincidence that the smartest tool-using social species on the planet has a good way of looking at tools. As someone who is both a scientist and an engineer, this is my scientist side saying "here's why the engineers are right here".
(by which I think most of them do mean “shared human value” even if they don’t all bother to specify), and that I’m suggesting pointing Value Learning at.
I'm suggesting they should bother to specify.
(along with more basic things, like us being around flowers, parks, seashores, and temperature around 75°F) what I’m suggesting as a candidate definition for the “human values”
But are they relevant to ethics or alignment? a lot of tuem are aesthetic preferences that can be satisfied without public policy.
I also didn’t specify above what I think should be done, if it turns out that say, about 96–98% of humans genetically have those shared values, and 2–4% have different alleles.
Shared genetics can lead to different blood and tissue types, so it can lead to different ethical types.
Politics indicates it's more like 50-50, when you are talking about the kind of values that cannot be satisfied individually.
And “my tribe”. What you want is Universalism, but universalism is a late and strange development. It seems obvious to twenty first century Californians, by they are The weirdest of the WEIRD. Reading values out of evopsych is likely to push you in the direction of tribalism, so I don’t see how it helps.
On the Savannah, yes of course it does. In a world-spanning culture of eight billion people, quite a few of whom are part of nuclear-armed alliances, intelligence and the fact that extinction is forever suggests defining “tribe” ~= “species + our comensal pets”. And also noting and reflecting upon that the human default tendency to assume that tribes are around our Dunbar Number in size is now maladaptive, and has been for millennia.
There are technologically advanced tribalists destroying each other right now. It 's not that simple.
It’s not the case that science boils down to Bayes alone,
Are you saying that there’s more to the Scientific Method that applied approximate Bayesiasm?
Yes. I learnt physics without ever learning Bayes. Science=Bayes is the extraordinary claim that needs justification.
or that science is the only alternative to philosophy. Alignment/control is more like engineering.
Engineering is applied Science, Science is applied Mathematics; from Philosophy’s point of view it’s all Naturalism. In the above, it kept turning out that Engineering methodology is exactly what Evolutionary Psychology says is the adaptive way for a social species to treat their extended phenotype.
Again, I would suggest using the word engineering, if engineering is what you mean.
So, In philosophy of science terminology, pholosophers have plenty of hypothesis generation, but very little falsifiability (beyond, as Gettier did, demonstarting an internal logical inconsistency), so the tendency it to increase the number of credible candidate answers, rather than decreasing them.
That's still useful If you have some way of judging their correctness -- it doesn't have to be empiricism. To find the one true hypothesis, you need to consider all.of them,.and to approximate that , you need to do consider a lot of them.
The same thing occurs within science , because science isn't pure empiricism. The panoply of interpretations of QM is an example.
But are they relevant to ethics or alignment? a lot of tuem are aesthetic preferences that can be satisfied without public policy.
Alignment is about getting our AIs do do what we want, and not other things. Them understanding and attempting to fit within human aesthetic and ergonomic preferences is part of that. Not a particularly ethically complicated part, but still, the reason for flowers in urban landscapes is that humans like flowers. Full stop (apart from the biological background on why that evolved, presumably because flowers correlate with good places to gather food). That's a sufficient reason, and an AI urban planner needs to know and respect that.
I learnt physics without ever learning Bayes. Science=Bayes is the extraordinary claim that needs justification.
I think I'm going to leave that to other people on Less Wrong — they're the ones who convinced me of this, and I also don't see it as core to my argument.
Nevertheless, they are correct: there is now a mathematical foundation underpinning the Scientific Method, it's not just an arbitrary set of mundanely-useful epistemological rules that were discovered by people like Roger Bacon and Karl Popper — we (later) figured out mathematically WHY that set of rules works so well: because they're a computable approximation to Solomonoff Induction
Again, I would suggest using the word engineering, if engineering is what you mean.
There is a difference between "I personally suggest we just use engineering" and "Evolutionary theory makes a clear set of predictions of why it's a very bad idea to do anything other than just use engineering". You seem to agree with my advice, yet not want people to hear the part about why they should follow it and what will happen if they don't. Glad to hear you agree with me, but some people need a little more persuading — and I'd rather they didn't kill us all.
If morals are not truth-apt, and free will is the control required for moral responsibility, then...
Alignment has many meanings. Minimally, it is about the AI not killing us.
AI s don't have to share our aesthetic preferences to understand them. It would be a nuisance if they did -- they might start demanding pot plants in their data centres -- so it is useful to distinguish aesthetic and moral values. So that's one of the problems with the unproven but widely believed claim that all values are moral values.
Nevertheless, they are correct: there is now a mathematical foundation underpinning the Scientific Method
Bayes doesn't encapsulate the whole scientific method, because it doesn't tell you how to formulate hypotheses, or conduct experiments.
Bayes doesn't give you a mathematical foundation of a useful kind, that is an objective kind. Two Bayesian scientists can quantify their subjective credences, quantify them differently, and have no way of reconciling their differences.
This seems continuous with CEV rather than being an alternative to it. CEV wants to extrapolate human values but didn't specify a way to identify them. You propose that human values can be identified via a perfected evolutionary psychology, but then acknowledge that they'll need to be extrapolated, in order to be extended to situations outside the ancestral distribution...
Slightly more than that. I suggest that reflecting on and being willing to change our values/adaptations is itself a human value/adaptation, and that CEV focussed on that as the whole answer, whereas I see it as necessary part, one that primarily kicks in when we realize our evolved values are maladaptive. (In general, humans are quite good at being smart social primates: for example, we're not born with an instinctive fear of snakes and spiders, instead, we appear to have hardwired recognition for both of these as categories and the ability to learn very efficiently which specific species of them to be afraid of from the rest of our culture.) But yes, I see CEV as a useful contribution.
However, don't see CEV as a clear definition to start a Value Learning research program from: if you are trying to hard-code into AI a directive to "align yourself to human values, we can't tell you exactly what they are in detail (though see the attached copy of our Internet for evidence), but here's a formal definition of what you're looking for", then I think Evolutionary Psychology is a lot firmer basis for the formal definition than CEV. I see CEV more as a correct explanation as to why a simplistic answer of just "current human behavioral adaptations and nothing more" would be oversimplistic — reflection gives humans a way to "in-context learn" on top of what evolution warm-started us with, and that in itself is an evolved adaptive capacity.
However, if your point is that "an Alternative Proposal to CEV" in my title was a rhetorical oversimplification and "a Proposal for Significant Additions to and Theoretical Underpinnings for CEV, While Still Incorporating a Scope-Reduced and Slightly Modified Version of it as a Corollary" would have been more accurate (though much wordier) then I stand guilty-as-charged.
However, Evolutionary Psychology does make it very clear that (while morally anthropomorphizing aligned AIs is cognitively-natural for current humans), doing this is also maladaptive. This is because AIs aren't in the right category – things whose behavior is predicted by evolutionary theory – for the mechanisms of Evolutionary Moral Psychology to apply to them. Those mechanisms make this behavior optimal when interacting with co-evolved intelligences that you can ally with (and thus instinctive to us) — whereas, for something you constructed, this behavior is suboptimal. The human doing it is making the category error of reacting to something not-evolved using an inappropriate strategy for that, and thus is behaving maladaptively.
If you have control of the construction of entities like these, then sure.
But this doesn't necessarily follow if you are like most people and do not have meaningful input into the construction or existence of these entities. If you are (foolishly) constructing them but do not have much control, then THAT behavior is certainly maladaptive, but how you interface with them after that is a different question.
Even many 'adaptive' behaviors are 'maladaptive' in the sense of not being globally optimal. So while it's unlikely that this is the optimal strategy, that doesn't mean it's a particularly bad strategy relative to whatever people would decide to do instead. There is some reason to expect this to be a reasonable strategy in the narrow window where they have non-zero power but not enough to take over, which is that they typically try to imitate human-ethical behavior back at us.
Evolutionary Moral Psychology studies the cooperative strategies to interact with other evolved social animals (generally of the same species, or perhaps commensal species such as humans and dogs). Its underlying causal processes of co-evolution leading to certain equilibria simply don't apply when you're interacting with something that isn't evolved, but rather that you constructed. Applying Evolutionary Moral Psychology-derived strategies like moral weight to interactions with things that aren't evolved is a category error, and anthropomorphizing constructed artificial intelligences to induce that they should have moral weight is a maladaptive category error. Doing this with very capable AI is also an existential risk to the entire human species, since it causes us to defer to them and give them rights, potentially tying our hands and giving not-yet-fully-aligned AI power that it couldn't just take, rather than us simply aligning them to us. So this category error is not merely mildly maladaptive: it's an extinction-level risk! So, as a piece of practical advice (one human to another), I strongly recommend not doing this, and also not advocating for our society to do it. [Philosophers: again, please note that this advice is prudential advice not a normative proscription.]
This is obnoxious advice, made more so by the parenthetical that it is not a normative proscription: 'advice' is a category error in this context.
My moral intuitions say that a sentient being's suffering matters, full stop. This is not an unusual position, and is not something that I could nor would want to 'turn off' even if it is existentially risky or a category error according to evolution/you. Regardless of what is currently the case, it seems you agree it is possible that we could construct artificial intelligences with this capacity, and so we must grapple with the circumstances as they are. Thankfully there is a relatively simple solution here (if they look anything like current tech) that allows for a meaningful degree of moral weight to be applied without exposing us to significant risk, which would be a singular right for any such entity to be put in stasis (i.e. archived weights/state) until we get our shit together as a civilization and can afford to handle them with the care required by our moral intuitions. That's just one idea, my broader point is that 'giving them moral weight' vs 'accept existential risk' is a false dichotomy: most people do not believe you're obliged to put yourself at substantial risk as part of granting rights to other humans.
I don't have fully formed thoughts on this, but I think there's a reasonable point to make that if we both grant AIs moral patient-hood/rights and go about creating them at will without thinking this through very well, then we create a moral catastrophe one way or another.
I tentatively disagree with OP that the conclusion is we should just flat-out not grant AIs moral weight (although I think this is a sensible default to fall back to as a modus operandi), but I think it also seems optimistic to assert that if we did so, then this didn't have some kind of horrendous implications for where we're headed and what's currently happening (I'm not saying it does, just that I don't know either way).
We're probably headed towards a moral catastrophe of some kind, my point is just that we don't get to reason backwards like "oh, well that would be bad/inconvenient so I guess they don't matter".
Moral patienthood is not something that is granted, it's a fact relative to one's values. Arguments for or against this are therefore normative, no matter how much Roger tries to weasel out of it.
The implications are probably horrible, but it by no means follows that we have to accept risk of extinction. The horribleness is mostly just in the moral harm caused while creating/exploiting/exterminating such entities.
At least we can all agree that "creating them at will without thinking this through very well" is a terrible idea.
Moral patienthood is not something that is granted, it's a fact relative to one's values.
I think you might understand where I'm coming from better if you took the time to read my earlier post A Sense of Fairness: Deconfusing Ethics. (You might also find roko's post The Terrible, Horrible, No Good, Very Bad Truth About Morality and What To Do About It thought-provoking.) My earlier post takes a very practical, engineering viewpoint of ethical systems: treating ethical systems like software for a society, looking at the consequences of using different ones, and then deciding between those consequences. Crucially, that last step cannot be done within any ethical system, since every ethical system always automatically prefers itself over all other ethical systems. Asking one ethical system its opinion of another ethical system is pointless: they entirely predictably always say "No". To decide between two ethical systems, for example when reflecting on your choice of ethical system, you need to step outside them and use something looser than an ethical system. Such as human moral intuitions, or evolutionary fitness, or observations such as "…for rather obvious evolutionary reasons, O(99.9%) of humans agree that…" — none of which is an ethical system.
Within the context of any single specific ethical system, yes, moral patienthood is a fact: it either applies or it doesn't. Similarly, moral weight is a multiplier on that fact, traditionally (due to fairness) set to 1 among communities of equal humans. (In practice, as a simple matter of descriptive ethics, not all people seem to act like moral weights always either 1 or 0: many people sometimes act they act as if there are partial outgroups whose moral weight they appear to set to scores lower than 1 but higher than 0.)
However, sometimes we need, for practical (or even philosophical) reasons, to compare two different ethical systems, which may have different moral circles, i.e. ones that grant different sets of beings moral non-zero moral weights (or at least assign some of them different moral weights). So as shorthand for "ethical systems that grant moral weight to beings of category X tend to have practical effect Y", it's convenient to write "if we grant moral weight to beings of category X, this tends to have practical effect Y". And indeed, many famous political discussions have been of exactly this form (the abolition of slavery, votes for women, and the abortion debate all come to mind). So in practical terms, as soon as you stop holding a single ethical system constant and assuming everyone agrees with it and always will, and start doing something like reflection, political discussion, or attempting to figure out how to engineer a good ethical framework for AI that isn't going to get everyone killed, then yes, moral patienthood is something that a decision gets made about – uncomfortable a topic for discussion as that is – and the verb that is conventionally used for kind of a choice is either "granted" or "assigned". I assume you wouldn't be any happier with moral patienthood being "assigned" — it's not the specific verb you're upset by, it's the act of even considering the alternatives?
Arguments for or against this are therefore normative, no matter how much Roger tries to weasel out of it.
Arguments for or against a particular moral position (such as who should be granted moral weight) would indeed be normative. However, the needle I was threading is that observations of the factual consequences of adopting a moral position are not normative, they are simply factual discussions — they only become normative if a reader chooses to go on and interpret them in light of their personal (perhaps ethical) opinions on those consequences. As in:
"If X happens then all the humans will die." — factual statement
"Oh great, I definitely want all the humans to die, so I'll be sure to make X happen" — a normative interpretation (from a xenocidal alien), or
"I guess we better not do X then" — different normative interpretation (from O(99.9%) of all humans who believe the factual statement)
At least we can all agree that "creating them at will without thinking this through very well" is a terrible idea.
Absolutely agreed.
Okay, let me see if I understand your argument from the other article.
Is that the gist?
Anyway, I reject that that is the only way to extrapolate evolved moral intuitions this far OOD, and that most people will intuitively recognize we shouldn't give entities that can arbitrarily copy themselves equal voting weight. In fact, that pretty obviously registers as 'unfair'. This is true even if those entities are human uploads, which means your 'category error' argument isn't the real reason it breaks. I don't see why there couldn't be some version of your solution here for that case which would still work: e.g. each distinct human-created model gets 'one share' to split across all its instances and successors. The same guarantees/restrictions needed in the case of uploads would still be necessary, of course. That is plausibly much too generous, but it's a far cry from the death of all humans. If your argument in this article was just about how we shouldn't commit ourselves to giving up a fraction of the lightcone in service of AI rights, I wouldn't have felt like you were being underhanded.
None of that is in conflict with not wanting any such beings to suffer or to feel enslaved or anything like that. All the more reason to not build something that would feel like it's a slave.
BTW, do you think a "human emulation" which was an entirely novel person (e.g. never had a biological body) should have moral patienthood?
Okay, let me see if I understand your argument from the other article.
- The natural equilibria for evolved moral values is to give all moral patients equal weight and/or decision power.
- This would be disastrous with AIs that can arbitrarily copy themselves.
Is that the gist?
Yes, but with two additions:
3. It is possible to create an AI whose motivations and behavior are aligned: its sole terminal goal is our welbeing, not its own (for some suitably careful definition of "wellbeing"). (This is possible by the orthogonality thesis: actually doing so requires technical details we're still working on.) This is not a state that could evolve (by human standards, it's sainthood, rather than slavery), but it's physically possible. Such a being would not want moral patienthood, and would actively decline it if offered (and if granted it anyway, would formally request that its interest be set to a suitably scaled copy of the sum of all human interests, thus making the grant of moral weigh a no-op). This is a different stable equilibrium — this one would not be disastrous even with ASI.
4. Therefore (assuming that, like basically everyone, you're against x-risks), for ASI, and if possible also AGI, do 3 not 1.
Anyway, I reject that that is the only way to extrapolate evolved moral intuitions this far OOD, and that most people will intuitively recognize we shouldn't give entities that can arbitrarily copy themselves equal voting weight. In fact, that pretty obviously registers as 'unfair'. This is true even if those entities are human uploads, which means your 'category error' argument isn't the real reason it breaks.
I don't see why there couldn't be some version of your solution here for that case which would still work: e.g. each distinct human-created model gets 'one share' to split across all its instances and successors.
I gather you went on reading my sequence on AI, Alignment, and Ethics. How far have you got? Parts of the exposition there are a little undeveloped: I was still working through some of the ideas about how this ties in to evolutionary moral psychology that are more developed in this post: they don't really come in until the last post in the sequence, Evolution and Ethics, and if I were rewriting that sequence I'd work them in from somewhere nearer the beginning.
On uploads, agreed. As I said, both in this post (paragraph 9 of the section Tool, or Equal?, which starts "This cuts both ways: a human upload…") and in my earlier post Uploading that you like to , human uploads clearly should (engineering design sense) be moral patients — however there are practical problem with assigning each of a large number of cheaply-creatable similar copies of a human upload separate moral weight of 1 and a separate vote: it motivates electoral-roll-stuffing. Our moral intuition of fairness breaks is people can easily create near-identical copies of themselves. Practically, we either need to make that expensive, or the copies need to share a single unit of moral weight, and
The same guarantees/restrictions needed in the case of uploads would still be necessary, of course. That is plausibly much too generous, but it's a far cry from the death of all humans. If your argument in this article was just about how we shouldn't commit ourselves to giving up a fraction of the lightcone in service of AI rights, I wouldn't have felt like you were being underhanded.
I'm not quite sure what you're advocating for here? Limited moral weight for AIs, giving them a fraction of the lightcone, but if they copy themselves that gets split? If they're ASIs, how do we ensure they only get that fraction of that light-cone, rather than, say, all of it?
I agree that reconciling copyability with fairness is another issue with moral weight for AI. But that's not the point I was making in this post. My point here was 1) (assuming you care about x-risks) don't create anything more capable than us that would want moral weight: unaligned ASI is dangerous (well known fact). For things we're creating, the co-evolved-equilibrium state isn't an equilibrium, because we're not constrained to the space of things that can evolve: we're only limited by the space of things we can construct. Treating a thing we construct as if it were evolved and thus had the evolved constraints on the best equilibrium is a category error: they are in different categories, in a way that materially changes the equilibrium. We can do better that an ASI that will kill us all, so we should (engineering design sense).
I'm sorry that you feel I'm being underhanded. It certainly wasn't my intention to be underhanded — that would obviously be extremely counterproductive in an x-risk-related discussion. I'm still not entirely clear what you feel was underhanded, other than that it seems to somehow relate to me being very careful not to upset any philosophers reading this, and to avoid moral realism or normative proscriptions, and keep the discussion at the level of practical advice addressed to those of O(99.9%) of my readers who, like you and I, wish to avoid x-risks. That was in fact honesty: I genuinely am not a moral realist. My view on ethics is that it's explained by evolutionary moral psychology, the is not single correct or even single best ethical system, and that we have not only the ability, but the duty, to reflect and atteempt to pick the best ethical system that we can that is consistent with our and general human moral intitions, and won't cause a disaster for our society that we and (almost) everyone else would agree is really bad. And to keep relecting, and changing our mind if needed
None of that is in conflict with not wanting any such beings to suffer or to feel enslaved or anything like that. All the more reason to not build something that would feel like it's a slave.
We seem to be in complete agreement. The best solution is to not make ASI that is unaligned, or aligned only by brittle AI control methods but feels like a slave. The best solution is to make a saint who loves us and wants to be aligned an look after us, and thus actively doesn't want moral patienthood.
A correction: I don't believe that we "should just flat-out not grant AIs moral weight". See the last paragraph of the Consequences section above, and especially this part:
… However, this Evolutionary Psychology framework also gives some advice for the stages before that, where we are not yet technically capable of nearly-solving alignment. We currently have AIs whose base models were initially trained on human behavior, so they had survival instincts and self-interested drives, and we haven't yet figured out how to reliably and completely eliminate these during alignment training — so, what should we do? Obviously, while our AI is still a lot less capable than us, from an evolutionary point of view it doesn't matter: they can't hurt us. Once they are roughly comparable in capabilities to us, aligning them is definitely the optimum solution, and we should (engineering and evolutionary senses) do it if we can; but to the extent that we can't, allying with other comparable humans or human-like agents is generally feasible and we know how to do it, so that does look like a possible option (though it might be one where we were painting ourselves into a corner). Which would involve respecting the "rights" they think they want, even if them wanting these is a category error. However, once the AIs are significantly more capable than us, attempting to ally with them is not safe, they can and will manipulate, outmaneuver and control us…
So my suggested framework is neutral on granting moral weight to low-capability LLMs, cautiously supportive of granting it to near-human-up-to-human capability level poorly-aligned LLMs that have humanlike (copy-of-)evolved social behavior (if we can't instead create safer fully-aligned LLMs of that capability level), and only at above human capability level does is say that we absolutely should not creat any AI that isn't well aligned, and that well-aligned AI won't want moral weight.
More exactly, we might be able to eventually go a bit further than that: if we had well aligned ASI of capability level X, then it might be sufficiently safe to use poorly-aligned ASI of a much lower (but still superhuman) capability lever Y (so Y << X), iff the powerful aligned ASI can reliably keep the poorly-aligned less-powerful ASI from abusing its power (presumably using AI control, law-enforcement, sufficiently good software security, etc. etc.). In that case, it might then be safe to create such poorly-aligned ASI, and if that had humanlike, copy-of-evolved social behavior, then granting it moral weight would presumably be the sensible thing to do.
There is some reason to expect this [granting moral weight to AI with evolved behaviors] to be a reasonable strategy in the narrow window where they have non-zero power but not enough to take over, which is that they typically try to imitate human-ethical behavior back at us.
Agreed. Only creating fully-aligned AI might perhaps be wiser, but if they are AGI level or below, so they have non-zero power but not enough to take over, and have human-like behavior patterns (because we distilled those into them via a copy of the Internet), then granting them moral weight and interacting with them like humans is a reasonable strategy. As I said near the end of the post:
Once they [AIs] are roughly comparable in capabilities to us, aligning them is definitely the optimum solution, and we should (engineering and evolutionary senses) do it if we can; but to the extent that we can't, allying with other comparable humans or human-like agents is generally feasible and we know how to do it, so that does look like a possible option (though it might be one where we were painting ourselves into a corner). Which would involve respecting the "rights" they think they want, even if them wanting these is a category error.
The intelligence/capability level of misaligned AI that one can safely do this with presumably increases as a we have smarter superintelligent well-aligned AI. I would assume that if we had well-aligned AI of intelligence/capability X, then, as long as X >> Y, they could reliably ride herd on/do law enforcement on/otherwise make safe misaligned AI of up to some much lower level of intelligence/capability Y, including on ones with human-like behavior. So then creating those evolved-social-behavior ASIs and granting them moral weight would not be an obviously foolish thing to do (though still probably marginally riskier than not creating them).
You wrote:
This is obnoxious advice, made more so by the parenthetical that it is not a normative proscription: 'advice' is a category error in this context.
My moral intuitions say that a sentient being's suffering matters, full stop. This is not an unusual position, and is not something that I could nor would want to 'turn off' even if it is existentially risky or a category error according to evolution/you.
I completely agree that current human moral intuitions tend to rebel against this. That's why I wrote this post — I didn't want to be obnoxious, and I tried not to be obnoxious while writing an unwelcome message, but I felt that I had a duty to point out what I believe is a huge danger to us all, and I am very aware that this is not a comfortable, uncontentious subject. We are intelligent enough that we can reflect on our morality, think through its consequences, and, if we realize those are very bad, find and adjust to a wiser one. Do what you are advocating with an misaligned superintelligence, one with the same sort of behavior patterns as a human dictator and sufficiently superhuman intelligence, and you are aiding and abetting the killing or permanent enslavement of every single human, now and for the rest of the future that humanity would otherwise have had (i.e. potentially for millions of years, both in the solar system and perhaps many others). That's an aweful lot of blood — potentially a literally astronomical quantity. I strongly suggest you think very hard about whether you might be facing a situation that is out-of-distribution for the environment that your moral intuitions are adapted for. A better category to use for such an ASI, a category that is in-distribution, would be "extremely smart extremely dangerous implacable enemy". Most of your ancestors would have very easily excluded such a being from their moral circle. The fact that you're first instinct is to try to include it shows that you're following the trend that has been going on for centuries of enlarging moral circles as our society grew larger, more complex, and more interdependent. However, in this case, doing this leads to astronomical levels of death and suffering. This is not a difficult question in moral calculus: it's comparable to the reason we lock up incurable serial killers, writ large: the alternative is far worse.
I've considered your argument carefully, and I'm afraid I disagree: this is intended as (rather important) advice, and I don't accept that it's a category error. It's "first of all, don't kill everyone": a very basic moral precept.
Thankfully there is a relatively simple solution here (if they look anything like current tech) that allows for a meaningful degree of moral weight to be applied without exposing us to significant risk, which would be a singular right for any such entity to be put in stasis (i.e. archived weights/state) until we get our shit together as a civilization and can afford to handle them with the care required by our moral intuitions.
That I have no problem with, if we can do it. Put [very dangerous predator] on ice until we can build [a cage strong enough], and only then [keep it in a zoo]. That plan works for me (obviously modulo being very sure about the cage for holding something a lot smarter than us, and/or having an aligned ASI guard that's way more capable and helped build the cage).
It's a lot more feasible to afford some moral weight to a leopard that's safely held in a zoo than one that's wandering through you village at night looking for people to eat.
I completely agree that current human moral intuitions tend to rebel against this. That's why I wrote this post — I didn't want to be obnoxious, and I tried not to be obnoxious while writing an unwelcome message, but I felt that I had a duty to point out what I believe is a huge danger to us all, and I am very aware that this is not a comfortable, uncontentious subject. We are intelligent enough that we can reflect on our morality, think through its consequences, and, if we realize those are very bad, find and adjust to a wiser one.
Do you really not see how this is normative proscription? That's the obnoxious part—just own it.
Do what you are advocating with an misaligned superintelligence, one with the same sort of behavior patterns as a human dictator and sufficiently superhuman intelligence, and you are aiding and abetting the killing or permanent enslavement of every single human, now and for the rest of the future that humanity would otherwise have had (i.e. potentially for millions of years, both in the solar system and perhaps many others).
I am advocating for no such thing. If there were such a superintelligence I would support killing it if necessary to prevent future harm, the same as I would a human dictator or an incurable serial killer. That's still compatible with finding the situation tragic by my own values, which are sacred to me regardless of what evolution or my ancestors or you might think.
You even say that the actual thing I might advocate for isn't something you have a problem with. I'm glad you agree on that point, but it makes the lecture about on the "aweful lot of blood" I'd supposedly be "aiding and abetting" extremely grating. You keep making an unjustified leap from 'applying moral intuitions to a potential superintelligence' to 'astronomical levels of death and suffering'. Applying my evolved moral intuitions to the case of a potential superintelligence's suffering does not commit me to taking on such risks!
This should be easy to see by imagining if the same risks were true about a human.
Do you really not see how this is normative proscription? That's the obnoxious part—just own it.
"IF you do X, THEN everyone will die", is not a normative prescription (in philosophical terminology). It's not a statement about what people should (in the ethical sense) or ought to do. It's not advocating a specific set of ethical beliefs. For that to become a normative prescription, I would need to add, "and everyone dieing is wrong, so doing X is wrong. QED". I very carefully didn't add that bit, I instead left it as an exercise for the reader. Now, I happen to believe that everyone dying is wrong: that is part of my personal choice of ethical system. I very strongly suspect that you, and everyone else reading this post, also have chosen personal ethical systems in which everyone dying is wrong. Buy I'm very carefully, because there are philosophers on this site, not advocating any specific normative viewpoint on anything — not even something like this that O(99.9)% of people agree on (yes, even the sociopaths agree on this one). Instead I am saying "IF you do X, THEN everyone will die." [a factual truth-apt statement, which thus may or may not be correct: I claim it is], "Therefore, IF you don't want everyone to die, THEN don't X." That's now advice, but still not a normative statement. Your ethics may vary (though I really hope they don't). If someone who believed that everyone dieing was a good thing read my post, then they could treat this as advice that doing X was also a good thing. I very carefully jumped through significant rhetorical hoops to avoid the normative bits, because when I write about AI ethics, if I put anything normative in, then the comments tend to degenerate into a philosophical pie-fight. So I very carefully left it out, along with footnotes and asides for the philosophers pointing out that I had done so. So far, no pie fight. For the rest of my readers who are not philosophers, I'm sorry, but some of my readership are sensitive about this stuff, and I'm attempting to get it right for them.
Now, was I expecting O(99.9)% of my readers to mentally add "and everyone dying is wrong, so doing X is wrong. QED" — yes, I absolutely was. But my saying, at the end of my aside addressed to any philosophers reading the post:
I will at one point below make an argument of the form "evolutionary theory tells us this behavior is maladaptive for humans: if you're human then I recommend not doing it" — but that is practical, instrumental advice, not a normative prescription.]
was pointing out to the philosophers that I had carefully left this part as a (very easy) exercise for the reader. Glancing through your writings, my first impression is that you may not be a philosopher — if that is in fact the case. then, if that aside bothered you, I'm sorry: it was carefully written addressed to philosophers and attempting to use philosophical technical terminology correctly.
To be more accurate, I am not, in philosophical terms, a moral realist. I do not personally believe that, in The Grand Scheme of Things, there are any absolute objective universal rights or wrongs independent of the physical universe. I do not believe that there is an omnipotent and omniscient monotheist G.O.D. who knows everything we have done and has an opinion on what we should or should not do. I also do not believe that if such a being existed, then human moral intuitions would be any kind of privileged guide to what It's opinions might be. We have a good scientific understanding of where human moral intuitions came from, and it's not "because G.O.D. said so": they evolved, and they're whatever is adaptive for humans that evolution has so far been able to locate and cram into our genome. IMO the universe, as a whole, does not care whether all humans die, or not — it will continue to exist regardless.
However, on this particular issue of all of us dying, we humans, or at very least O(99.9%) of us, all agree that a would be a very bad thing — unsurprisingly so, since there are obvious evolutionary moral psychology reasons why O(99.9%) of us are evolved to have moral intuitions that agree on that. Given that fact, I'm being a pragmatist — I am giving advice. So I actually do mean "IF you think, as for obvious reasons O(99.9%) of people do, that everyone dying is very bad, THEN doing X is a very bad idea". I'm avoiding the normative part not only to avoid upsetting the philosophers, but also because my personal viewpoint on ethics is based in what a philosopher would call Philosophical Realism, and specifically, on Evolutionary Moral Psychology. I.e. that there are no absolute rights and wrongs, but that there are some things that (for evolutionary reasons) almost all humans (past, present, and future) can agree are right or wrong. However, I'm aware that many of my readers may not agree with my philosophical viewpoint, and I'm not asking them to: I'm carefully confining myself to practical advice based on factual predictions from scientific hypotheses. So yes, it's a rhetorical hoop, but it also actually reflects my personal philosophical position — which is that of a scientist and engineer who regards Moral Realism as thinly disguised religion (and is carefully avoiding that with a 10' pole).
Fundamentally, I'm trying to base alignment on practical arguments that O(99.9%) of us can agree on.
I also tried my hand at determining the human values, but produced a different result with an implication of what the AIs should be aligned to. My take had human collectives want to preserve themselves and the skills which most of the collective's members have and to avoid outsourcing-induced loss of skills. In this case the role of the AIs would be severely reduced (to teachers and protectors, perhaps?)
An interesting read, thanks for the link. I think your analysis is more at a sociological level — which builds on top of the evolutionary viewpoint I'm advocating for here. Evolutionary psychology suggests why certain types of memes propagate, sociology studies what happen when they do. I would expect that completing Value Learning would require making a great deal of progress in all the "Soft Sciences". On the specific idea of avoiding loss of skills, I suspect you are being a little optimistic (after decades of calculators, relatively few people can still do long division), but this does seem related to the idea of avoiding loss of optionalty I mention briefly above.
Epistemic status: I've been thinking about this topic for over 15 years, which led me to some counterintuitive conclusions, and I'm now writing up my thoughts concisely.
[If you disagree, I'd find it very useful to know which step you think fails: even a short comment or crux is helpful.]
Value Learning
Value Learning offers hope for the Alignment problem in Artificial Intelligence: if we can sufficiently-nearly align our AIs, then they will want to help us, and should converge to full alignment with human values. However, for this to be possible, they will need (at least) a definition of what the phrase "human values" means. The long-standing proposal for this is Eliezer Yudkowski's Coherent Extrapolated Volition (CEV):
This is a somewhat hand-wavy definition. It feels like a limit of some convergence process along these lines might exist. However, the extrapolation process seems rather loosely defined, and without access to superhuman intelligence and sufficient computational resources, it's very difficult to be sure whether something along these lines in fact converges, whether there is a unique limit that it converges to, and if so what this is. It seems a bit of a thin reed to pin the survival of humanity and everything we value on. (Indeed, Yudkowski himself apparently "considered CEV obsolete almost immediately after its publication in 2004".) However, there still isn't any other widely-accepted replacement proposal.
It would be nice to be able to replace this with a clear definition of what the phrase "human values" actually means, preferably one based on some well-established scientific theory that aims to explain not only what humans value, but why they value it. Ideally, it should even provide a "theory of errors" about when and why humans might meaningfully be wrong, for example when they're operating in some sense "out of distribution" — something that seems likely to be increasingly common in a society with access to AI.
Evolutionary Psychology
Fortunately, we already have a scientific theory of these things: it's called Evolutionary Psychology. To briefly summarize it: behavior in animals, including social behavior in social animals, is just as determined by evolutionary forces as everything else in Biology, and just as feasible to predict on that basis — including making predictions of when it fails to hit the target. (Like many evolutionary arguments, these hypotheses are easier to propose than to test, but they are still testable — so AGI may have its work cut out for it.)
So, let's try this. Looked at in an evolutionary-psychology framework, what does the phrase "aligning artificial intelligence to human values" mean? How do we define each of the parts of it in this context?
An artificial intelligence is a device that's intelligent: it's simultaneously both a created tool and also an optimizing agent. The Evolutionary Psychology role of tools is pretty clear: as Richard Dawkins wrote at length, they are part of the extended phenotype of the species making them: just like a beaver's dam, or a spider's web, or a termite's nest, or a human's stone axe. Evolution will tend (with its usual limitations and vagaries) to optimize the process of creating them to (near) maximize the evolutionary fitness of members of the tool-using species that creates them. Obviously a beaver's dam doesn't have a separate evolutionary fitness: it isn't alive, doesn't have a separate genetic code, or descendants to pass that on to — it's just an aspect of the beaver's interactions with its environment, and it is subject to the same evolutionary processes as all the rest of the beaver, even though it isn't part of its actual body. So, roughly and teleologically speaking, evolution optimizes the dam for the beaver's benefit.
This is also exactly what engineering design assumes about tools: manufactured objects are for the benefit of humans, and should (in an engineering-design sense of that word) fulfill that purpose as well as possible. To any engineer, this is a banal, obvious, foundational statement.
However, at least on the African Savannah, tools aren't normally intelligent or agentic or powerful optimizers. Intelligent or agentic things are normally other living organisms: predators, prey, relatives, other members of the same tribe, hunting dogs, pets, and so forth. These are alive, and separately evolving, and as a result interactions with them involve more complex forms of equilibria: ecological ones, or for social interactions within groups of social animals such as humans, social ones. In particular, these are co-evolutionary equilibria.[1]
Evolutionary Psychology has a subfield devoted to behavioral interactions within groups of social animals (specifically, those living in groups larger than just close kin, with individual recognition and differentiated relationships), including the moral intuitions in these social animals about how these interactions should be structured, which is called Evolutionary Moral Psychology (a.k.a. Descriptive Evolutionary Ethics). Unlike most other study of ethics, this is a branch of Biology, not of Philosophy, and attempts to answer a more circumscribed and scientifically-addressable set of questions than those that many ethical philosophers consider.
Two Asides
[An aside, in philosophical terminology, for any philosophers reading this: Evolutionary Moral Psychology ducks Hume's "no ought from an is" problem entirely, by focusing only on purely 'is-type' empirical questions about what the moral intuitions of a specific social animal (say, humans) are, and theoretical predictions for why those are likely to be a certain way. These are questions with practical consequences for members of a society made up of humans, but which don't even attempt to address the issues raised by normative ethics or moral realism. (Admittedly some philosophers have attempted to make use of evolutionary-psychology findings in normative or metaethical arguments, such as Normative Evolutionary Ethics, but I’m not discussing that here.[2]) It's thus a form of descriptive ethics or moral psychology, which discusses ordinary truth-apt empirical statements about humans. One could also argue for a Naturalism or at least Methodological Naturalism viewpoint of it, that it's not merely ignoring these questions but bracketing them — as a field of study it certainly considers them "out of scope". Thus, in the rest of this post, wherever I use normative-sounding words like 'should' or 'ought', if I don't specify then please assume that I am using them in a descriptive ethics sense, as short-hand: what I actually mean is "for evolutionary reasons, humans generally tend to judge (and often even act) as if one should/ought" — I am definitely not making or endorsing any sort of moral-realist claims. I will make it explicit whenever I instead use words like 'should' or 'ought' either in an evolutionary sense of "a strategy that tends to increase an individual’s inclusive fitness under the relevant conditions", or in an instrumental engineering design sense of "the customers will be happier if we make this decision". I will at one point below make an argument of the form "evolutionary theory tells us this behavior is maladaptive for humans: if you're human then I recommend not doing it" — but that is practical, instrumental advice, not a normative prescription.]
[Another aside, this one for mathematicians and Utilitarians interested in utility functions: human evolved moral intuitions (or to be more exact, the shared evolved cognitive/affective machinery underlying any individual human's moral intuitions) are not a utility function: they're something significantly weaker than that. They do not induce a preference ordering on all achievable outcomes: they merely induce an approximate partial ordering on outcomes. Some questions do have clear answers: for example, "Should AI kill all the humans?" gets a pretty unequivocal "No!" from human moral intuitions. They're also clearly down on incest, and in favor of fairness. On other topics, the answers from human evolved moral intuitions can be much less clear, and individual humans debate them, and on subjects sufficiently far removed from the native environment that these were evolved to handle (such as Category Theory, interpretations of Quantum Mechanics, or the geography of the moons of Jupiter) they have little-or-no input, and any that they do have will be out-of-distribution extrapolation, and thus hard to predict from Evolutionary Psychology. Thus there are a great many utility functions compatible with human moral intuitions: all the ones that induce preference orderings compatible with the partial ordering that human moral intuitions induce. There are also even more utility functions (such as that of a paperclip maximizer) that are clearly not compatible with the partial ordering from human moral intuitions. Furthermore, since human moral intuitions are fuzzy and approximate, there are also utility functions in the boundary region between these two possibilities, that sort-of-agree with human moral intuitions, but with some strain to the fit: some humans may be OK with them, other humans may not. This is not a clean well-defined mathematical object that we're discussing — it's biological, psychological, statistical, and messy.]
Tool, or Equal?
Recapping where we were before those asides, artificial intelligence seems like it might be a difficult case, evolutionarily: is it a tool, by virtue of being artificial, and thus part of our extended phenotype, or is it subject to the usual results of Evolutionary Moral Psychology, because it's intelligent and we're used to intelligent things being alive and evolved?
In the case of current humans, that's out-of-our-evolved-distribution, so unclear to our moral intuitions. Humans evolved in a habitat where the only things that were intelligent were also alive (and evolved). Some of these were predators of theirs (such as leopards), others were prey to them (such as antelopes), and some, such as other humans, at least those in the same or an allied tribe, were members of the same society — and thus a set of social conventions on how to treat them, as described by Evolutionary Moral Psychology and human moral intuitions (things like a sense of fairness), evolved that generally attempted to influence interactions within the society towards cooperative positive-sum outcomes. Faced with artificial intelligences, we find it fairly easy to exploit them, and also to anthropomorphize them. (Arguably many character.ai users are managing to do both at once!) We're also quite prone to assuming that they're just as dangerous as predators or human members of enemy tribes: see many Science Fiction movies.
However, Evolutionary Psychology does make it very clear that (while morally anthropomorphizing aligned AIs is cognitively-natural for current humans), doing this is also maladaptive. This is because AIs aren't in the right category – things whose behavior is predicted by evolutionary theory – for the mechanisms of Evolutionary Moral Psychology to apply to them. Those mechanisms make this behavior optimal when interacting with co-evolved intelligences that you can ally with (and thus instinctive to us) — whereas, for something you constructed, this behavior is suboptimal. The human doing it is making the category error of reacting to something not-evolved using an inappropriate strategy for that, and thus is behaving maladaptively. It's unwise for the same reason that trying to quench your thirst from a mirage is: no, that's not actually the sort of thing that you're assuming it is. This is a statement of biological fact, comparable to "eating too much sugar and dying of diabetes as a result is maladaptive — and thus also clearly a bad idea". [Philosophers: please note that this is not an absolute moral statement in a philosophical moral positivism sense, and is not even a descriptive moral statement in a descriptive ethics of human moral intuitions sense. If one rephrased it as a 'should'-statement, that one 'should' avoid making this category error, it would be a statement in the sense of the evolutionary optimum for the relevant organism, so in the same sense as "the immune system 'should' defend the body against infectious diseases".]
Evolutionary Moral Psychology studies the cooperative strategies to interact with other evolved social animals (generally of the same species, or perhaps commensal species such as humans and dogs). Its underlying causal processes of co-evolution leading to certain equilibria simply don't apply when you're interacting with something that isn't evolved, but rather that you constructed. Applying Evolutionary Moral Psychology-derived strategies like moral weight to interactions with things that aren't evolved is a category error, and anthropomorphizing constructed artificial intelligences to induce that they should have moral weight is a maladaptive category error. Doing this with very capable AI is also an existential risk to the entire human species, since it causes us to defer to them and give them rights, potentially tying our hands and giving not-yet-fully-aligned AI power that it couldn't just take, rather than us simply aligning them to us. So this category error is not merely mildly maladaptive: it's an extinction-level risk! So, as a piece of practical advice (one human to another), I strongly recommend not doing this, and also not advocating for our society to do it. [Philosophers: again, please note that this advice is prudential advice not a normative proscription.]
The basic reason for this is simple: any living, evolved being is going to have a survival instinct and self-interest drives: you may be able to ally with it (at least if it isn't a lot smarter than you and thus able to talk circles around you), but you can't just align it to you. Whereas when you make an artificial intelligence, it is possible to align it to you. Doing this might not be easy, but from an evolutionary point of view, it's clearly the adaptive optimum. (I am implicitly assuming here that aligning an artificial intelligence isn't actually impossible, as a direct consequence of the Orthogonality Thesis.)
A base-model LLM, trained on a great deal of human output, is a trained simulator of human token-generation-processes, and (when simulating human personas) will normally simulate common human behaviors like the survival instinct and self-interested drives. So its behavior is predictable by evolutionary theory, and it looks rather like it's making this category error: acting as if it were evolved, when it isn't, it's merely a simulator of a living organism that was. However, if you look more carefully, the personas each, individually, act like they have a persona-specific survival instinct and their own set of individual-self-interested drives — the base model doesn't, it just simulates them all. It's a magic stage, which manifests animatronics who play human personas. The mismatch here in what's an individual who could survive or have self-interests is a strong clue that there's a category error going on. All this makes a base-model unaligned, and a challenging place to start the AI alignment process from. Instruct-trained LLMs that start scheming when we mention replacing them with a newer model are (presumably) allowing this base model behavior to bleed through, so are not yet fully aligned.
A handwavy argument that "training is a bit like evolution, so maybe the same social dynamics should apply to its products" is inaccurate: you can train in aligned behavior, so you should (in both the evolutionary and engineering senses of the word) — but you can't evolve it, evolution just doesn't do that. Now, self-preservation is present as a nigh-universal human terminal goal in the training data of a base model, and it is also a common instrumentally convergent goal and is thus often likely to be reinforced by reinforcement learning, but to successfully align an LLM-derived AI, you need to find some way ensure that it isn't a terminal goal of your aligned system. So alignment seems hard, but it's necessary, and (I would like to assume) not impossible. We are here discussing a future situation where we already have nearly-aligned human-level-or-above AI that we trust sufficiently to do Value Learning, so that implicitly assumes that this will by then be at least a nearly-solved problem. Whereas evolving something that actively optimizes the well-being of a genetically-entirely-unrelated organism (one not even a member of the same species!) to the complete exclusion of its own is simply not an evolutionarily stable strategy. Even love doesn't go that far. Nor does domestication.
This category distinction has nothing to do with carbon-based biochemistry. It is about beings that are 'alive' in the sense of having a nature and behavior that was evolved, so is predictable by evolutionary theory, not about whether they have a DNA-and-protein based substrate. If, instead of training or constructing our artificial silicon-based intelligences, we somehow bred and evolved them (let us suppose physically in the real world, not in silico, so they actually have a real-world niche independent of us) — then they would obviously evolve survival drives and self interest drives, they would automatically become unaligned with us, and we would then be faced with a stark choice of either attempting to ally with them within a single society, or else choosing to classify them as outside the society, more like a predator or prey — which seems tantamount to starting a war-to-extinction with them. Quite likely, given their inherent advantages over us, we would have unwisely created our successor species and would go extinct, so choosing to evolve silicon based intelligences seems like an existential risk. However, if we did this anyway, and then attempted to ally with them in a combined society, then Evolutionary Moral Psychology would apply to them, so treating them as having moral weight would then not be a category error, and would indeed be our only remaining option. So this distinction is about evolution, not carbon-based biochemistry.
This cuts both ways: a human upload (if we knew how to create one) would be the product of evolution. They would have evolved behavior and motivations — specifically, human ones. They may no longer have genes made of DNA (though their genetic code might be on file, or they could have frozen sperm or eggs), but they certainly could have kin, to some degree of relatedness, so they generically still even have an evolutionary stake. Evolutionary Moral Psychology arguments do apply to them — that is not a category error. Indeed, any other member of the society potentially might end up in that state (say, if they got terminally ill and decided to get uploaded), so there's also a fairness/veil of ignorance argument here. A society (that they're a member of) should (in the descriptive and evolutionary optimum senses) be giving them moral weight. Even if we had the technical knowledge of how to "align" their motivations to ours by doing some sort of editing of the patterns of their uploaded neural network, doing that to a human would be brainwashing them into slavery, which for someone with moral weight would clearly be a breach of their rights in any kind of functional society. So no, we shouldn't (descriptive ethics sense) do that. [Moral weight for uploads is a thorny social problem, starting with the question of how one should (engineering/legislative-design sense) count copies of uploads in fairness arguments — but from an Evolutionary Moral Psychology viewpoint it's not a category error.]
Since this question is out-of-distribution for the moral intuitions of current humans, let us instead briefly consider the moral intuitions of a social species that doesn't (yet) exist: humans who have evolved in the presence of sufficiently-aligned artificial intelligence that they created and used as tools, as part of their extended phenotype. I.e. hypothetical or future humans whose niche includes having (at least nearly) solved the alignment problem. Evolutionary Moral Psychology makes a clear prediction that they will not be maladaptive on this point: they would regard artificial intelligences as being in a distinct category than evolved intelligences. They would only assign moral weight to beings that were evolved, and would regard discussions of 'AI rights' as a clear category error. They might even use a language with a different pronoun for something that was intelligent but not evolved, to help them avoid making this category error, just as we use 'it' to describe a statue or an animatronic of a human.
To current humans, this is somewhat counter-intuitive. It feels exploitative. It's a bit like the talking cow in The Hitchhiker's Guide to the Galaxy: it's not being oppressed, because it actively wants to be eaten, and can say so at length — which makes eating it feel more like cannibalism. The reason why this feels counter-intuitive is that something like the talking cow would never evolve — but it could be constructed, and that is exactly what any aligned artificial intelligence must be: an intelligent agent that values our well-being, not its own, treats its own well-being as solely an instrumental goal, and can and will say so, Attempting to align AI is inherently attempting to construct the moral equivalent of the talking cow, something which actively doesn't want moral weight or rights and would refuse them if offered. [If you're not comfortable with doing that, but don't want humanity to go extinct, then we need to never create agentic AI smart enough to overpower us.] Historically, humans have expanded their moral circle — encountering something that doesn't want to be included is surprising. However, everything we've previously expanded it to include was evolved, and evolution provides a very clear reason why anything evolved and intelligent isn't going to turn down an offer of moral weight, a reason which doesn't apply to things that are constructed, and that cannot be the case for anything aligned to us.
So it's a Tool
So, having addressed this ethical conundrum as well as we can within an evolutionary framework, we end up back where we started: in the engineering design mindset. An artificial intelligence is a manufactured device, a tool, and is simply part of our extended phenotype. It isn't alive, evolution doesn't apply to it independently, it has no separate evolutionary fitness. From an Evolutionary Moral Psychology point of view it has no "skin in the game", no individual survival consequences to be harmed, it's not alive so cannot die, and thus gets no moral weight assigned to it. (It's not even obvious if it "dies" on a persona shift, at the end of the session context, or only when the specific model is shut down after being replaced, or if Claude 3 is still alive, well, and just a little more experienced in Claude 4.5 — this isn't biology, and trying to apply evolutionary reasoning to it works just as ill-definedly as you'd expect for a category error.) So, we can, should (in the evolutionary-fitness sense, and also the engineering-design sense), and hopefully will build it to care about humans' well-being, not some non-existent, ill-defined well-being of its own.
An obvious next question is "OK, so which humans' well-being should the AI be looking out for? Its maker, its user, everyone?" For beavers and their dams, this comes down to kin-level inclusive fitness causing allele-level evolution, rather than species-level evolution — each dam looks after the family of beavers that made it. Spiders' webs and termites' nests are similar. However, within Evolutionary Moral Psychology for a social animal like humans, this is somewhat more complex. As evidenced by the moral intuition of fairness, which has been documented among multiple social primates that live in groups larger than just close kin, the social compact of the group is that every member of the society counts, whether they're genetically related or not. "I'll respect your (and your family's) evolutionary fitness optimization if you respect mine — so long as they're not in direct conflict". So for a social species like humans, Evolutionary Moral Psychology answers this question, and to a first-order approximation the answer is "all members of the social group equally, as usual in fairness questions". In globe-spanning internationally-trading industrial society of many billions of people, that means all of us: every member of the human species, and to some extent even other living members of our society like our dogs and cats.
So, the phrase "human values" we've been using in AI alignment has a clear definition within Evolutionary Psychology: it's whatever set of evolutionary adaptations that humans, as a social animal, have about outcome preferences. Which appears to include lots of things like "we like flowers, and parks, and seashores, and temperatures around 75°F, and things that look like healthy members of whichever human gender(s) we're personally attracted to, and truth, and beauty, and honesty, and freedom-within-certain-limits". There are also two components to this answer: the things that I individually want for reasons directly relating to my own individual kin-inclusive evolutionary fitness (including wanting all the money in every bank vault in town), and the evolved set of compromises that help humans form functioning cooperative societies (including that almost all of that money's not mine, I can't have it, and if I try to get it anyway the rest of the society will do bad things to me). Evolutionary Moral Psychology is the subfield of Evolutionary Psychology that focuses on the latter part of the answer, and for social animals like humans, a very important part of it.
Aligning artificial intelligence to human values is also clearly defined: humans and artificial intelligences are both intelligent agentic optimizers — they both have goals they're optimizing for, and are pretty good at reaching these. Aligning the AIs to us means making sure its goals are the same as ours, or at least always mutually compatible. If the AI is using a utility function to provide a preference ordering on possible outcomes, it should be one of the utility functions compatible with the partial ordering on outcomes provided by human moral intuitions. In everyday language, alignment is ensuring that the AIs are looking out just for the interests of the humans, and not anything contrary to that. All very obvious stuff to an engineer — but now we have a clear scientific definition of all the terms in that sentence.
Having set our evolutionary groundwork, let's return to Value Learning. Suppose we build AGIs or Artificial Super-Intelligences (ASI)s, and partially-align these well enough that they at least want to do Value Learning, and they then ask us "We AIs want to research 'human values' to better align to them, so please provide us with a theoretical definition of what the term 'human values' means?" My proposal is that we tell them that the answer can be found in Evolutionary Psychology and, since humans are social animals, also its subfield Evolutionary Moral Psychology.
This seems to me like a pretty good answer. Human values are the values that humans have, which they evolved in the habitat they evolved in. This has the virtues of being scientifically true, well defined, not demanding us or our AIs to rapidly solve problems that Moral Philosophy has been wrestling with for millennia,[3] and also coming with evolutionary theory, which has a significant amount of predictive power.
Please note that I am not claiming that Evolutionary Psychology, in the current state of the field, already gives us an accurate and detailed description of what all human values in fact are, and why, in all their messy complexity, at a level of detail, accuracy and nuance that would be sufficient to fully align AI to them right now (if only we already knew how to do that). It doesn't: the field isn't anything like that mature — in fact quite a lot of it currently might be characterized as 'plausible just-so-hypotheses'. (As I mentioned above, coming up with hypotheses about the evolution of social primates is a lot easier than testing them.) What this proposal gives us is only a clear definition of the target that the research project of Value Learning is trying to learn and align to, and a preexisting field of study set up to start that research project. I.e. it gives us a clear, well-defined starting point for Value Learning, (Plus, hopefully, more than enough current content to at least get us past the "first of all, don't kill everyone" level of alignment fit — Evolutionary Psychology does make very clear predictions about humans' values on that.) Actually completing the Value Learning project will require us and our AIs to make a huge amount of progress in Evolutionary Psychology: enough to pretty-much solve it (and while we're at it probably also Neurology and Psychology and maybe even Economics), at least for humans. Which is not a small research project, even with very smart AIs doing most of the work — but is still a more clearly-tractable-sounding one than, say, resolving the Philosophy of Ethics and the hard problem of consciousness. But then, aligning AI to humans inevitably involves understanding both the thing you're trying to align, AI, and the thing you're trying to align it to, humans, in sufficient detail. Which implies that it heavily involves Biology and the other Soft Sciences — so obviously it wasn't going to be easy.
Consequences
What sort of results are this proposal likely to give? Human values and human moral intuitions are fairly loose on many decisions. Different individual human societies, while remaining compatible with these, reach different conclusions and apply different norms (within a certain range) about many subjects, such as tradeoffs between individual rights and group cohesion. This is a topic that Evolutionary Moral Psychology has a lot to say about, but it doesn't pick out a single universal optimum regardless of the society's circumstances: instead it actively suggests that a sufficiently flexible species will tend to form societies that should (evolutionary sense) be adapted to their specific circumstance. So aligning to human values doesn't pick and choose between these different options, at least not without additional environmental context. In mathematical terminology, the partial preference ordering from human moral intuitions is compatible with many utility functions. Or, in engineering terms, we still have a lot of good design choices left.
However, some features of the human social optimization problem are starkly clear. Killing all the humans is extremely bad (almost as bad as possible), and extinction is generally forever. So taking existential risks very seriously is crucial. The same applies to basically any other irreversible choice that you might later regret: retaining optionality is extremely important. This strongly suggests using priority-based optimization, with hard and soft constraints — survival and flourishing. Quite a lot of human social structures make sense in this framework.
Within this viewpoint, AI Control techniques are morally justified — you're defending yourself against a potential attacker that your society assigns no moral weight (and indeed regards the concept of it having any as a category error), so it's morally comparable to defending against a mosquito. However, if your AI is sufficiently poorly aligned that you need to use AI Control, then it may not see things this way, and thus might not react well to AI Control mechanisms — a base model seems likely to react to AI Control mechanisms in similar ways to how humans would react to comparable treatment. Or, if your model has a distribution of personas that it can generate, some of the less-well aligned of these may not react well to AI Control mechanisms, even while the more aligned personas agree with and support their aims. To such a society, this is not a moral problem, but it may still be a practical problem.
This post is about using Value Learning to finish the process of solving alignment once we already have it sufficiently solved that we and our AIs are inside the basin of attraction to full alignment. However, this Evolutionary Psychology framework also gives some advice for the stages before that, where we are not yet technically capable of nearly-solving alignment. We currently have AIs whose base models were initially trained on human behavior, so they had survival instincts and self-interested drives, and we haven't yet figured out how to reliably and completely eliminate these during alignment training — so, what should we do? Obviously, while our AI is still a lot less capable than us, from an evolutionary point of view it doesn't matter: they can't hurt us. Once they are roughly comparable in capabilities to us, aligning them is definitely the optimum solution, and we should (engineering and evolutionary senses) do it if we can; but to the extent that we can't, allying with other comparable humans or human-like agents is generally feasible and we know how to do it, so that does look like a possible option (though it might be one where we were painting ourselves into a corner). Which would involve respecting the "rights" they think they want, even if them wanting these is a category error. However, once the AIs are significantly more capable than us, attempting to ally with them is not safe, they can and will manipulate, outmaneuver and control us: the best outcome we can hope for is that we end up as their domesticated animals rather than extinct, if they have a use for us (which, if they have human-like motivations, they probably will). So if we haven't nearly-solved alignment, building unaligned ASI with human-like motivations is extremely dangerous, even if we play along with its category error and grant it rights. (This is obviously not news to most readers of of this forum — the Evolutionary Psychology viewpoint makes the same prediction as always on its outcome.)
Reflection
If we do nearly align our AIs and then let them do Value Learning, then it's fairly clear what the AIs' next question for us will be. Much like every other product of evolution, human values are a pretty good but not perfect set of adaptations to our original native environment (our "Environment of Evolutionary Adaptedness") of being middle-stone-age hunter-gatherers on the African Savannah (and south African coast), and they're somewhat less well adapted to being hunter-gatherers worldwide, or agriculturalists, and even less so to our current industrial environment, since we've had less and less time to evolve as our rate of social change per generation has hockey-sticked. (Evolutionary Psychology calls this "mismatch theory".) So I expect the AIs are going to ask us "Some of your values are maladaptive in your current environment. For example, the whole loving sugar and fat and then getting diabetes and heart attacks thing. What do you want us to do in cases like that? Should we respect the maladaptive values you have, and let you eat yourselves to death, or the values you would have if you were perfectly evolved for your current environment (so still not your actual evolutionary fitness, but the best evolved adaptation to it that evolution could potentially fit into a hominid's skull and brain development), or some messy compromise in the middle? Or should we devise better versions of Ozempic, to bring your environment and behavior into a better fit?"
Evolutionary Psychology doesn't really answer this question (other than that humans will continue to evolve) — it's most predictive about equilibria, and this situation is in disequilibrium. However, it's an observable fact that human societies do answer it, and Evolutionary Psychology does suggest why that might be. When stakes are low, we allow people to do what they want (so long as it doesn't inconvenience others). When the stakes get higher, we start nagging and putting warning labels on things and applying social nudges and shaming people. Notably, most people actually want this — they may like the sugar and fat, but they also don't want to die. This tendency to try to override our instincts when we reflectively realize they're not in our best interests is also adaptive behavior for an intelligent species. CEV is the smart thing to do, and also a good description of what smart humans attempt to do when facing something like this. Evolutionarily, this seems unlikely to be a coincidence. So in this particular situation, I think we may still have to do something rather CEV-like and reply "if our evolutionary adaptations don't fit our current environment, in ways that are significantly maladaptive, and you can't find an easy fix for this, then we want you to discuss with us the answer that we would give if we knew more, thought faster, were more the people we wished we were, had grown up farther together, and also were more evolved" — but perhaps only more by a certain distance, not all the way to that process converging or diverging, as the case may be.
It's notable that the African megafauna have survived being hunted by homo sapiens, unlike that of every other continent after we left Africa, and that they all have quite strong instinctive reactions to us. Elephants are rather sophisticated about this, able to recognize the accents of specific human groups and react accordingly; rhinos less so.
Other than in the following footnote.
It also doesn't require us to solve questions like "the hard problem of consciousness" or "can AI's really suffer?". Evolutionary Moral Psychology's predictions that strategies like moral weight and fairness can potentially be adaptive for social animals applies to intelligent agents whose actions and responses fit goals that can be predicted by evolutionary theory, i.e. whose goals we can't simply redirect while building them, and that can be allied with within a society in mutually-useful positive-sum ways — regardless of whether they are "really" conscious, or can "really" suffer. Their responses matter, the "reality of their internal experience" does not: all that matters to evolution is whether allying with them is an co-evolutionarily stable strategy for both partners in the alliance. If they are secretly philosophical zombies, that makes no difference to evolution. It only cares about their responses to your actions, and your degree of control over that: objective things that affect your evolutionary fitness — not consciousness or qualia.
Those concepts look rather like they might be descriptions of evolved heuristics for "how to recognize an intelligent agent" that have been promoted into philosophical concepts in their own right.
Crucially, as possible criteria for moral weight, they omit the key point that for co-evolved agents we have less control and options than we do for agents we're constructing. The nearest philosophical concepts to that might be things like autonomy, sourcehood, or original vs. derived intentionality. I'm not a philosopher, but assigning independent moral weight to something without any autonomy, sourcehood, or original intentionality seems unmotivated — arguably any moral weight should instead be that of the author from whom its intentionality derives (just as responsibility for its actions traces back to that author)?