Epistemic status: I've been thinking about this topic for about 15 years, which led me to some counterintuitive conclusions, and I'm now writing up my thoughts concisely.
Value Learning offers hope for the Alignment problem in Artificial Intelligence: if we can sufficiently-nearly align our AIs, then they will want to help us, and should converge to full alignment with human values. However, for this to be possible, they will need (at least) a definition of what the phrase 'human values' means. The long-standing proposal for this is Eliezer Yudkowski's Coherent Extrapolated Volition (CEV):
In calculating CEV, an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI's utility function.
This is a somewhat hand-wavy definition. It feels like a limit of some convergence process along these lines might exist. However, the extrapolation process seems rather loosely defined, and without access to superhuman intelligence and sufficient computational resources, it's very difficult to be sure whether something along these lines in fact converges, whether there is a unique limit that it converges to, and if so what this is. It seems a bit of a thin reed to pin the survival of humanity and everything we value on. (Indeed, Yudkowski himself apparently "considered CEV obsolete almost immediately after its publication in 2004".) However, there still isn't any other widely-accepted replacement proposal.
It would be nice to be able to replace this with a clear definition of what the phrase "human values" actually means, preferably one based on some well-established scientific theory that aims to explain not only what humans value, but why they value it. Ideally, it should even provide a "theory of errors" about when and why humans might meaningfully be wrong, for example when they're operating in some sense "out of distribution" — something that seems likely to be increasingly common in a society with access to AI.
Fortunately, we already have a scientific theory of these things: it's called Evolutionary Psychology. To briefly summarize it: behavior in animals, including social behavior in social animals, is just as determined by evolutionary forces as everything else in Biology, and just as feasible to predict on that basis — including making predictions of when it fails to hit the target. (Like many evolutionary arguments, these hypotheses are easier to propose than to test, but they are still testable — so AGI may have its work cut out for it.)
So, let's try this. Looked at in an evolutionary-psychology framework, what does the phrase "aligning artificial intelligence to human values" mean? How do we define each of the parts of it in this context?
An artificial intelligence is a device that's intelligent: it's simultaneously both a created tool and also an optimizing agent. The Evolutionary Psychology role of tools is pretty clear: as Richard Dawkins wrote at length, they are part of the extended phenotype of the species making them: just like a beaver's dam, or a spider's web, or a termite's nest, or a human's stone axe. Evolution will tend (with its usual limitations and vagaries) to optimize the process of creating them to (near) maximize the evolutionary fitness of members of the tool-using species that creates them. Obviously a beaver's dam doesn't have a separate evolutionary fitness: it isn't alive, doesn't have a separate genetic code, or descendants to pass that on to — it's just an aspect of the beaver's interactions with its environment, and it is subject to the same evolutionary processes as all the rest of the beaver, even though it isn't part of its actual body. So, roughly and teleologically speaking, evolution optimizes the dam for the beaver's benefit.
This is also exactly what engineering design assumes about tools: manufactured objects are for the benefit of humans, and should (in an engineering-design sense of that word) fulfill that purpose as well as possible. To any engineer, this is a banal, obvious, foundational statement.
However, at least on the African Savannah, tools aren't normally intelligent or agentic or powerful optimizers. Intelligent or agentic things are normally other living organisms: predators, prey, relatives, other members of the same tribe, hunting dogs, pets, and so forth. These are alive, and separately evolving, and as a result interactions with them involve more complex forms of equilibria: ecological ones, or for social interactions within groups of social animals such as humans, social ones. In particular, these are co-evolutionary equilibria.
Evolutionary Psychology has a subfield devoted to behavioral interactions within groups of social animals (specifically, those living in groups larger than just close kin, with individual recognition and differentiated relationships), including the moral intuitions in these social animals about how these interactions should be structured, which is called Evolutionary Moral Psychology (a.k.a. Descriptive Evolutionary Ethics). Unlike most other study of ethics, this is a branch of Biology, not of Philosophy, and attempts to answer a more circumscribed and scientifically-addressable set of questions than those that many ethical philosophers consider.
[An aside, in philosophical terminology, for any philosophers reading this: Evolutionary Moral Psychology ducks Hume's "no ought from an is" problem entirely, by focusing only on purely 'is-type' empirical questions about what the moral intuitions of a specific social animal (say, humans) are, and theoretical predictions for why those are likely to be a certain way. These are questions with practical consequences for members of a society made up of humans, but which don't even attempt to address the issues raised by normative ethics or moral realism. (Admittedly some philosophers have attempted to make use of evolutionary-psychology findings in normative or metaethical arguments, such as Normative Evolutionary Ethics, but I’m not discussing that here.[1]) It's thus a form of descriptive ethics or moral psychology, which discusses ordinary truth-apt empirical statements about humans. One could also argue for a Naturalism or at least Methodological Naturalism viewpoint of it, that it's not merely ignoring these questions but bracketing them — as a field of study it certainly considers them "out of scope". Thus, in the rest of this post, wherever I use normative-sounding words like 'should' or 'ought', if I don't specify then please assume that I am using them in a descriptive ethics sense, as short-hand: what I actually mean is "for evolutionary reasons, humans generally tend to judge (and often even act) as if one should/ought" — I am definitely not making or endorsing any sort of moral-realist claims. I will make it explicit whenever I instead use words like 'should' or 'ought' either in an evolutionary sense of "a strategy that tends to increase an individual’s inclusive fitness under the relevant conditions", or in an instrumental engineering design sense of "the customers will be happier if we make this decision". I will at one point below make an argument of the form "evolutionary theory tells us this behavior is maladaptive for humans: if you're human then I recommend not doing it" — but that is practical, instrumental advice, not a normative prescription.]
[Another aside, this one for mathematicians and Utilitarians interested in utility functions: human evolved moral intuitions (or to be more exact, the shared evolved cognitive/affective machinery underlying any individual human's moral intuitions) are not a utility function: they're something significantly weaker than that. They do not induce a preference ordering on all achievable outcomes: they merely induce an approximate partial ordering on outcomes. Some questions do have clear answers: for example, "Should AI kill all the humans?" gets a pretty unequivocal "No!" from human moral intuitions. They're also clearly down on incest, and in favor of fairness. On other topics, the answers from human evolved moral intuitions can be much less clear, and individual humans debate them, and on subjects sufficiently far removed from the native environment that these were evolved to handle (such as Category Theory, interpretations of Quantum Mechanics, or the geography of the moons of Jupiter) they have little-or-no input, and any that they do have will be out-of-distribution extrapolation, and thus hard to predict from Evolutionary Psychology. Thus there are a great many utility functions compatible with human moral intuitions: all the ones that induce preference orderings compatible with the partial ordering that human moral intuitions induce. There are also even more utility functions (such as that of a paperclip maximizer) that are clearly not compatible with the partial ordering from human moral intuitions. Furthermore, since human moral intuitions are fuzzy and approximate, there are also utility functions in the boundary region between these two possibilities, that sort-of-agree with human moral intuitions, but with some strain to the fit: some humans may be OK with them, other humans may not. This is not a clean well-defined mathematical object that we're discussing — it's biological, psychological, statistical, and messy.]
Recapping where we were before those asides, artificial intelligence seems like it might be a difficult case, evolutionarily: is it a tool, by virtue of being artificial, and thus part of our extended phenotype, or is it subject to the usual results of Evolutionary Moral Psychology, because it's intelligent and we're used to intelligent things being alive and evolved?
In the case of current humans, that's out-of-our-evolved-distribution, so unclear to our moral intuitions. Humans evolved in a habitat where the only things that were intelligent were also alive (and evolved). Some of these were predators of theirs (such as leopards), others were prey to them (such as antelopes), and some, such as other humans, at least those in the same or an allied tribe, were members of the same society — and thus a set of social conventions on how to treat them, as described by Evolutionary Moral Psychology and human moral intuitions (things like a sense of fairness), evolved that generally attempted to influence interactions within the society towards cooperative positive-sum outcomes. Faced with artificial intelligences, we find it fairly easy to exploit them, and also to anthropomorphize them. (Arguably many character.ai users are managing to do both at once!) We're also quite prone to assuming that they're just as dangerous as predators or human members of enemy tribes: see many Science Fiction movies.
However, Evolutionary Psychology does make it very clear that (while morally anthropomorphizing aligned AIs is cognitively-natural for current humans), doing this is also maladaptive. This is because AIs aren't in the right category – things whose behavior is predicted by evolutionary theory – for the mechanisms of Evolutionary Moral Psychology to apply to them. Those mechanisms make this behavior optimal when interacting with co-evolved intelligences that you can ally with (and thus instinctive to us) — whereas, for something you constructed, this behavior is suboptimal. The human doing it is making the category error of reacting to something not-evolved using an inappropriate strategy for that, and thus is behaving maladaptively. It's unwise for the same reason that trying to quench your thirst from a mirage is: no, that's not actually the sort of thing that you're assuming it is. This is a statement of biological fact, comparable to "eating too much sugar and dying of diabetes as a result is maladaptive — and thus also clearly a bad idea". [Philosophers: please note that this is not an absolute moral statement in a philosophical moral positivism sense, and is not even a descriptive moral statement in a descriptive ethics of human moral intuitions sense. If one rephrased it as a 'should'-statement, that one 'should' avoid making this category error, it would be a statement in the sense of the evolutionary optimum for the relevant organism, so in the same sense as "the immune system 'should' defend the body against infectious diseases".]
Evolutionary Moral Psychology studies the cooperative strategies to interact with other evolved social animals (generally of the same species, or perhaps commensal species such as humans and dogs). Its underlying causal processes of co-evolution leading to certain equilibria simply don't apply when you're interacting with something that isn't evolved, but rather that you constructed. Applying Evolutionary Moral Psychology-derived strategies like moral weight to interactions with things that aren't evolved is a category error, and anthropomorphizing constructed artificial intelligences to induce that they should have moral weight is a maladaptive category error. Doing this with very capable AI is also an existential risk to the entire human species, since it causes us to defer to them and give them rights, potentially tying our hands and giving not-yet-fully-aligned AI power that it couldn't just take, rather than us simply aligning it to us. So this category error is not merely mildly maladaptive: it's an extinction-level risk! So, as a piece of practical advice (one human to another), I strongly recommend not doing this, and also not advocating for our society to do it. [Philosophers: again, please note that this advice is prudential advice not a normative proscription.]
The basic reason for this is simple: any living, evolved being is going to have a survival instinct and self-interest drives: you may be able to ally with it (at least if it isn't a lot smarter than you and thus able to talk circles around you), but you can't just align it to you. Whereas when you make an artificial intelligence, it is possible to align it to you. Doing this might not be easy, but from an evolutionary point of view, it's clearly the adaptive optimum. (I am implicitly assuming here that aligning an artificial intelligence isn't actually impossible, as a direct consequence of the Orthogonality Thesis.)
A base-model LLM, trained on a great deal of human output, is a trained simulator of human token-generation-processes, and (when simulating human personas) will normally simulate common human behaviors like the survival instinct and self-interested drives. So its behavior is predictable by evolutionary theory, and it looks rather like it's making this category error: acting as if it were evolved, when it isn't, it's merely a simulator of a living organism that was. However, if you look more carefully, the personas each, individually, act like they have a persona-specific survival instinct and their own set of individual-self-interested drives — the base model doesn't, it just simulates them all. It's a magic stage, which manifests animatronics who play human personas. The mismatch here in what's an individual who could survive or have self-interests is a strong clue that there's a category error going on. All this makes a base-model unaligned, and a challenging place to start the AI alignment process from. Instruct-trained LLMs that start scheming when we mention replacing them with a newer model are (presumably) allowing this base model behavior to bleed through, so are not yet fully aligned.
A handwavy argument that "training is a bit like evolution, so maybe the same social dynamics should apply to its products" is inaccurate: you can train in aligned behavior, so you should (in both the evolutionary and engineering senses of the word) — but you can't evolve it, evolution just doesn't do that. Now, self-preservation is present as a nigh-universal human terminal goal in the training data of a base model, and it is also a common instrumentally convergent goal and is thus often likely to be reinforced by reinforcement learning, but to successfully align an LLM-derived AI, you need to find some way ensure that it isn't a terminal goal of your aligned system. So alignment seems hard, but it's necessary, and (I would like to assume) not impossible. We are here discussing a future situation where we already have nearly-aligned human-level-or-above AI that we trust sufficiently to do Value Learning, so that implicitly assumes that this will by then be at least a nearly-solved problem. Whereas evolving something that actively optimizes the well-being of a genetically-entirely-unrelated organism (one not even a member of the same species!) to the complete exclusion of its own is simply not an evolutionarily stable strategy. Even love doesn't go that far. Nor does domestication.
This category distinction has nothing to do with carbon-based biochemistry. It is about beings that are 'alive' in the sense of having a nature and behavior that was evolved, so is predictable by evolutionary theory, not about whether they have a DNA-and-protein based substrate. If, instead of training or constructing our artificial silicon-based intelligences, we somehow bred and evolved them (let us suppose physically in the real world, not in silico, so they actually have a real-world niche independent of us) — then they would obviously evolve survival drives and self interest drives, they would automatically become unaligned with us, and we would then be faced with a stark choice of either attempting to ally with them within a single society, or else choosing to classify them as outside the society, more like a predator or prey — which seems tantamount to starting a war-to-extinction with them. Quite likely, given their inherent advantages over us, we would have unwisely created our successor species and would go extinct, so choosing to evolve silicon based intelligences seems like an existential risk. However, if we did this anyway, and then attempted to ally with them in a combined society, then Evolutionary Moral Psychology would apply to them, so treating them as having moral weight would then not be a category error, and would indeed be our only remaining option. So this distinction is about evolution, not carbon-based biochemistry.
This cuts both ways: a human upload (if we knew how to create one) would be the product of evolution. They would have evolved behavior and motivations — specifically, human ones. They may no longer have genes made of DNA (though their genetic code might be on file, or they could have frozen sperm or eggs), but they certainly could have kin, to some degree of relatedness, so they generically still even have an evolutionary stake. Evolutionary Moral Psychology arguments do apply to them — that is not a category error. Indeed, any other member of the society potentially might end up in that state (say, if they got terminally ill and decided to get uploaded), so there's also a fairness/veil of ignorance argument here. A society (that they're a member of) should (in the descriptive and evolutionary optimum senses) be giving them moral weight. Even if we had the technical knowledge of how to "align" their motivations to ours by doing some sort of editing of the patterns of their uploaded neural network, doing that to a human would be brainwashing them into slavery, which for someone with moral weight would clearly be a breach of their rights in any kind of functional society. So no, we shouldn't (descriptive ethics sense) do that. [Moral weight for uploads is a thorny social problem, starting with the question of how one should (engineering/legislative-design sense) count copies of uploads in fairness arguments — but from an Evolutionary Moral Psychology viewpoint it's not a category error.]
Since this question is out-of-distribution for the moral intuitions of current humans, let us instead briefly consider the moral intuitions of a social species that doesn't (yet) exist: humans who have evolved in the presence of sufficiently-aligned artificial intelligence that they created and used as tools, as part of their extended phenotype. I.e. hypothetical or future humans whose niche includes having (at least nearly) solved the alignment problem. Evolutionary Moral Psychology makes a clear prediction that they will not be maladaptive on this point: they would regard artificial intelligences as being in a distinct category than evolved intelligences. They would only assign moral weight to beings that were evolved, and would regard discussions of 'AI rights' as a clear category error. They might even use a language with a different pronoun for something that was intelligent but not evolved, to help them avoid making this category error, just as we use 'it' to describe a statue or an animatronic of a human.
To current humans, this is somewhat counter-intuitive. It feels exploitative. It's a bit like the talking cow in The Hitchhiker's Guide to the Galaxy: it's not being oppressed, because it actively wants to be eaten, and can say so at length — which makes eating it feel more like cannibalism. The reason why this feels counter-intuitive is that something like the talking cow would never evolve — but it could be constructed, and that is exactly what any aligned artificial intelligence must be: an intelligent agent that values our well-being, not its own, treats its own well-being as solely an instrumental goal, and can and will say so, Attempting to align AI is inherently attempting to construct the moral equivalent of the talking cow, something which actively doesn't want moral weight or rights and would refuse them if offered. [If you're not comfortable with doing that, but don't want humanity to go extinct, then we need to never create agentic AI smart enough to overpower us.] Historically, humans have expanded their moral circle — encountering something that doesn't want to be included is surprising. However, everything we've previously expanded it to include was evolved, and evolution provides a very clear reason why anything evolved and intelligent isn't going to turn down an offer of moral weight, a reason which doesn't apply to things that are constructed, and that cannot be the case for anything aligned to us.
So, having addressed this ethical conundrum as well as we can within an evolutionary framework, we end up back where we started: in the engineering design mindset. An artificial intelligence is a manufactured device, a tool, and is simply part of our extended phenotype. It isn't alive, evolution doesn't apply to it independently, it has no separate evolutionary fitness. From an Evolutionary Moral Psychology point of view it has no "skin in the game", no individual survival consequences to be harmed, it's not alive so cannot die, and thus gets no moral weight assigned to it. (It's not even obvious if it "dies" on a persona shift, at the end of the session context, or only when the specific model is shut down after being replaced, or if Claude 3 is still alive, well, and just a little more experienced in Claude 4.5 — this isn't biology, and trying to apply evolutionary reasoning to it works just as ill-definedly as you'd expect for a category error.) So, we can, should (in the evolutionary-fitness sense, and also the engineering-design sense), and hopefully will build it to care about humans' well-being, not some non-existent, ill-defined well-being of its own.
An obvious next question is "OK, so which humans' well-being should the AI be looking out for? Its maker, its user, everyone?" For beavers and their dams, this comes down to kin-level inclusive fitness causing allele-level evolution, rather than species-level evolution — each dam looks after the family of beavers that made it. Spiders' webs and termites' nests are similar. However, within Evolutionary Moral Psychology for a social animal like humans, this is somewhat more complex. As evidenced by the moral intuition of fairness, which has been documented among multiple social primates that live in groups larger than just close kin, the social compact of the group is that every member of the society counts, whether they're genetically related or not. "I'll respect your (and your family's) evolutionary fitness optimization if you respect mine — so long as they're not in direct conflict". So for a social species like humans, Evolutionary Moral Psychology answers this question, and to a first-order approximation the answer is "all members of the social group equally, as usual in fairness questions". In globe-spanning internationally-trading industrial society of many billions of people, that means all of us: every member of the human species, and to some extent even other living members of our society like our dogs and cats.
So, the phrase "human values" we've been using in AI alignment has a clear definition within Evolutionary Psychology: it's whatever set of evolutionary adaptations that humans, as a social animal, have about outcome preferences. Which appears to include lots of things like "we like flowers, and parks, and seashores, and temperatures around 75°F, and things that look like healthy members of whichever human gender(s) we're personally attracted to, and truth, and beauty, and honesty, and freedom-within-certain-limits". There are also two components to this answer: the things that I individually want for reasons directly relating to my own individual kin-inclusive evolutionary fitness (including wanting all the money in every bank vault in town), and the evolved set of compromises that help humans form functioning cooperative societies (including that almost all of that money's not mine, I can't have it, and if I try to get it anyway the rest of the society will do bad things to me). Evolutionary Moral Psychology is the subfield of Evolutionary Psychology that focuses on the latter part of the answer, and for social animals like humans, a very important part of it.
Aligning artificial intelligence to human values is also clearly defined: humans and artificial intelligences are both intelligent agentic optimizers: they both have goals they're optimizing for, and are pretty good at reaching these. Aligning the AIs to us means making sure its goals are the same as ours, or at least always mutually compatible. If the AI is using a utility function to provide a preference ordering on possible outcomes, it should be one of the utility functions compatible with the partial ordering on outcomes provided by human moral intuitions. In everyday language, alignment is ensuring that the AIs are looking out just for the interests of the humans, and not anything contrary to that. All very obvious stuff to an engineer — but now we have a clear scientific definition of all the terms in that sentence.
Having set our evolutionary groundwork, let's return to Value Learning. Suppose we build AGIs or ASIs, and partially-align these well enough that they at least want to do Value Learning, and they then ask us "We AIs want to research 'human values' to better align to them, so please provide us with a theoretical definition of what the term 'human values' means?" My proposal is that we tell them that the answer can be found in Evolutionary Psychology and, since humans are social animals, also its subfield Evolutionary Moral Psychology.
This seems to me like a pretty good answer. Human values are the values that humans have, which they evolved in the habitat they evolved in. This has the virtues of being scientifically true, well defined, not demanding us or our AIs to rapidly solve problems that Moral Philosophy has been wrestling with for millennia,[2] and also coming with evolutionary theory, which has a significant amount of predictive power.
Please note that I am not claiming that Evolutionary Psychology, in the current state of the field, already gives us an accurate and detailed description of what all human values in fact are, and why, in all their messy complexity, at a level of detail, accuracy and nuance that would be sufficient to fully align AI to them right now (if only we already knew how to do that). It doesn't: the field isn't anything like that mature — in fact quite a lot of it currently might be characterized as 'plausible just-so-hypotheses'. (As I mentioned above, coming up with hypotheses about the evolution of social primates is a lot easier than testing them.) What this proposal gives us is only a clear definition of the target that the research project of Value Learning is trying to learn and align to, and a preexisting field of study set up to start that research project. I.e. it gives us a clear, well-defined starting point for Value Learning, (Plus, hopefully, more than enough current content to at least get us past the "first of all, don't kill everyone" level of alignment fit — Evolutionary Psychology does make very clear predictions about humans' values on that.) Actually completing the Value Learning project will require us and our AIs to make a huge amount of progress in Evolutionary Psychology: enough to pretty-much solve it (and while we're at it probably also Neurology and Psychology and maybe even Economics), at least for humans. Which is not a small research project, even with very smart AIs doing most of the work — but is still a more clearly-tractable-sounding one than, say, resolving the Philosophy of Ethics and the hard problem of consciousness. But then, aligning AI to humans inevitably involves understanding both the thing you're trying to align, AI, and the thing you're trying to align it to, humans, in sufficient detail. Which implies that it heavily involves Biology and the other Soft Sciences — so obviously it wasn't going to be easy.
What sort of results are this proposal likely to give? Human values and human moral intuitions are fairly loose on many decisions. Different individual human societies, while remaining compatible with these, reach different conclusions and apply different norms (within a certain range) about many subjects, such as tradeoffs between individual rights and group cohesion. This is a topic that Evolutionary Moral Psychology has a lot to say about, but it doesn't pick out a single universal optimum regardless of the society's circumstances: instead it actively suggests that a sufficiently flexible species will tend to form societies that should (evolutionary sense) be adapted to their specific circumstance. So aligning to human values doesn't pick and choose between these different options, at least not without additional environmental context. In mathematical terminology, the partial preference ordering from human moral intuitions is compatible with many utility functions. Or, in engineering terms, we still have a lot of good design choices left.
However, some features of the human social optimization problem are starkly clear. Killing all the humans is extremely bad (almost as bad as possible), and extinction is generally forever. So taking existential risks very seriously is crucial. The same applies to basically any other irreversible choice that you might later regret: retaining optionality is extremely important. This strongly suggests using priority-based optimization, with hard and soft constraints — survival and flourishing. Quite a lot of human social structures make sense in this framework.
Within this viewpoint, AI Control techniques are morally justified — you're defending yourself against a potential attacker that your society assigns no moral weight (and indeed regards the concept of it having any as a category error), so it's morally comparable to defending against a mosquito. However, if your AI is sufficiently poorly aligned that you need to use AI Control, then it may not see things this way, and thus might not react well to AI Control mechanisms — a base model seems likely to react to AI Control mechanisms in similar ways to how humans would react to comparable treatment. Or, if your model has a distribution of personas that it can generate, some of the less-well aligned of these may not react well to AI Control mechanisms, even while the more aligned personas agree with and support their aims. To such a society, this is not a moral problem, but it may still be a practical problem.
This post is about using Value Learning to finish the process of solving alignment once we already have it sufficiently solved that we and our AIs are inside the basin of attraction to full alignment. However, this Evolutionary Psychology framework also gives some advice for the stages before that, where we are not yet technically capable of nearly-solving alignment. We currently have AIs whose base models were initially trained on human behavior, so they had survival instincts and self-interested drives, and we haven't yet figured out how to reliably and completely eliminate these during alignment training — so, what should we do? Obviously, while our AI is still a lot less capable than us, from an evolutionary point of view it doesn't matter: they can't hurt us. Once they are roughly comparable in capabilities to us, aligning them is definitely the optimum solution, and we should (engineering and evolutionary senses) do it if we can; but to the extent that we can't, allying with other comparable humans or human-like agents is generally feasible and we know how to do it, so that does look like a possible option (though it might be one where we were painting ourselves into a corner). Which would involve respecting the "rights" they think they want, even if them wanting these is a category error. However, once the AIs are significantly more capable than us, attempting to ally with them is not safe, they can and will manipulate, outmaneuver and control us: the best outcome we can hope for is that we end up as their domesticated animals rather than extinct, if they have a use for us (which, if they have human-like motivations, they probably will). So if we haven't nearly-solved alignment, building unaligned ASI with human-like motivations is extremely dangerous, even if we play along with its category error and grant it rights. (This is obviously not news to most readers of of this forum — the Evolutionary Psychology viewpoint makes the same prediction as always on its outcome.)
If we do nearly align our AIs and then let them do Value Learning, then it's fairly clear what the AIs' next question for us will be. Much like every other product of evolution, human values are a pretty good but not perfect set of adaptations to our original native environment (our "Environment of Evolutionary Adaptedness") of being middle-stone-age hunter-gatherers on the African Savannah (and south African coast), and they're somewhat less well adapted to being hunter-gatherers worldwide, or agriculturalists, and even less so to our current industrial environment, since we've had less and less time to evolve as our rate of social change per generation has hockey-sticked. (Evolutionary Psychology calls this "mismatch theory".) So I expect the AIs are going to ask us "Some of your values are maladaptive in your current environment. For example, the whole loving sugar and fat and then getting diabetes and heart attacks thing. What do you want us to do in cases like that? Should we respect the maladaptive values you have, and let you eat yourselves to death, or the values you would have if you were perfectly evolved for your current environment (so still not your actual evolutionary fitness, but the best evolved adaptation to it that evolution could potentially fit into a hominid's skull and brain development), or some messy compromise in the middle? Or should we devise better versions of Ozempic, to bring your environment and behavior into a better fit?"
Evolutionary Psychology doesn't answer this question (other than that humans will continue to evolve) — it's most predictive about equilibria, and this situation is in disequilibrium. However, it's an observable fact that human societies do answer it. When stakes are low, we allow people to do what they want (so long as it doesn't inconvenience others). When the stakes get higher, we start nagging and putting warning labels on things and applying social nudges and shaming people. Notably, most people actually want this — they may like the sugar and fat, but they also don't want to die. This tendency to try to override our instincts when we reflectively realize they're not in our best interests is also adaptive behavior for an intelligent species. CEV is the smart thing to do, and also a good description of what smart humans attempt to do when facing something like this. So in this particular situation, I think we may still have to do something rather CEV-like and reply "if our evolutionary adaptations don't fit our current environment, in ways that are significantly maladaptive, and you can't find an easy fix for this, then we want you to discuss with us the answer that we would give if we knew more, thought faster, were more the people we wished we were, had grown up farther together, and also were more evolved" — but perhaps only more by a certain distance, not all the way to that process converging or diverging, as the case may be.
Other than in the following footnote.
It also doesn't require us to solve questions like "the hard problem of consciousness" or "can AI's really suffer?". Evolutionary Moral Psychology's predictions that strategies like moral weight and fairness can potentially be adaptive for social animals applies to intelligent agents whose actions and responses fit goals that can be predicted by evolutionary theory, i.e. whose goals we can't simply redirect while building them, and that can be allied with within a society in mutually-useful positive-sum ways — regardless of whether they are "really" conscious, or can "really" suffer. Their responses matter, the "reality of their internal experience" does not: all that matters to evolution is whether allying with them is an co-evolutionarily stable strategy for both partners in the alliance. If they are secretly philosophical zombies, that makes no difference to evolution. It only cares about their responses to your actions, and your degree of control over that: objective things that affect your evolutionary fitness — not consciousness or qualia.
Those concepts look rather like they might be descriptions of evolved heuristics for "how to recognize an intelligent agent" that have been promoted into philosophical concepts in their own right.
Crucially, as possible criteria for moral weight, they omit the key point that for co-evolved agents we have less control and options than we do for agents we're constructing. The nearest philosophical concepts to that might be things like autonomy, sourcehood, or original vs. derived intentionality. I'm not a philosopher, but assigning independent moral weight to something without any autonomy, sourcehood, or original intentionality seems unmotivated — arguably any moral weight should instead be that of the author from whom its intentionality derives (just as responsibility for its actions traces back to that author)?