TL;DR: The dynamics of human learning processes and reward circuitry are more relevant than evolution for understanding how inner values arise from outer optimization criteria.
This post is related to Steve Byrnes’ Against evolution as an analogy for how humans will create AGI, but more narrowly focused on how we should make inferences about values.
Thanks to Alex Turner, Charles Foster, and Logan Riggs for their feedback on a draft of this post.
How should we expect AGI development to play out?
True precognition appears impossible, so we use various analogies to AGI development, such as evolution, current day humans, or current day machine learning. Such analogies are far from perfect, but we still may be able to extract useful information by carefully examining them.
In particular, we want to understand how inner values relate to the outer optimization criteria. Human evolution is one possible source of data on this question. In this post, I’ll argue that human evolution actually provides very little usable evidence on AGI outcomes. In contrast, analogies to the human learning process are much more fruitful.
One way people motivate extreme levels of concern about inner misalignment is to reference the fact that evolution failed to align humans to the objective of maximizing inclusive genetic fitness. From Eliezer Yudkowsky’s AGI Ruin post:
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about…
I don't think that "evolution -> human values" is the most useful reference class when trying to understand how outer optimization criteria relate to inner values. Evolution didn't directly optimize over our values. It optimized over our learning process and reward circuitry. Once you condition on a particular human's learning process + reward circuitry configuration + the human's environment, you screen off the influence of evolution on that human's values. So, there are really (at least) two classes of observations from which we can draw evidence:
I will present five reasons why I think evidence from (2) “human learning -> human values” is more relevant to predicting AGI.
The relationship we want to make inferences about is:
I think that "AI learning -> AI values" is much more similar to "human learning -> human values" than it is to "evolution -> human values". Steve Byrnes makes this case in much more detail in his post on the matter. Two of the ways I think AI learning more closely resembles human learning, and not evolution, are:
"AI learning -> AI values", "human learning -> human values", and “evolution -> human values” each represent very different optimization processes, with many specific dissimilarities between any pair of them. However, I think the balance of dissimilarities points to "human learning -> human values" being the closer reference class for "AI learning -> AI values". As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the "human learning -> human values" analogy, not the "evolution -> human values" analogy.
Additionally, I think we have a lot more total empirical evidence from "human learning -> human values" compared to from "evolution -> human values". There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from "human learning -> human values" should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
One common objection is that “human learning” represents a tiny region in the space of all possible mind designs, and so we cannot easily generalize our observations of humans to minds in general. This is, of course, true, and it greatly limits the strength of any AI-related conclusions we can draw from looking at "human learning -> human values". However, I again hold that inferences from "evolution -> human values" suffer from an even more extreme version of this same issue. "Evolution -> human values" represent an even more restricted look at the general space of optimization processes than we get from the observed variations in different humans' learning processes, reward circuit configurations, and learning environments.
Human evolution happened hundreds of thousands of years ago. We are deeply uncertain about the details of the human ancestral environment and which traits were under what selection pressure. We are still unsure about what precise selection pressure led humans to be so generally intelligent at all. We are very far away from being able to precisely quantify all the potentially values-related selection pressures in the ancestral environment, or how those selection pressures changed our reward systems or our tendencies to form downstream values.
In contrast, human within lifetime learning happens all the time right now. It’s available for analysis and even experimental intervention. Given two evidence sources about a given phenomenon, where one evidence source is much more easily accessible than the other, then all else equal, the more accessible evidence source should represent a greater fraction of our total information on the phenomenon. This is another reason why we should expect evidence from humans to account for a greater proportion of our total information about how inner values relate to outer optimization criteria.
I think that a careful account of how evolution shaped our learning process in the ancestral environment implies that evolution had next to no chance of aligning humans with inclusive genetic fitness.
There are no features of the ancestral environment which would lead to an ancestral human learning about the abstract idea of inclusive genetic fitness. There were no ancestral humans that held an explicit representation of inclusive genetic fitness. So, there was never an opportunity for evolution to select for humans who attached their values to an explicit representation of inclusive genetic fitness.
Regardless of how difficult it is, in general, to get learning systems to form values around different abstract concepts, evolution could not have possibly gotten us to form a value around the particular abstraction of inclusive genetic fitness because we didn’t form such an abstraction in the ancestral environment. Ancestral humans had zero variance in their tendency to form values around inclusive genetic fitness. Evolution cannot select for traits that don’t vary across a population, so evolution could not have selected for humans that formed their values around inclusive genetic fitness.
In contrast, the sorts of things that we humans end up valuing are usually the sorts of things that are easy to form abstractions around. Thus, we are not doomed by the same difficulty that likely prevented evolution from aligning humans to inclusive genetic fitness.
This point is extremely important. I want to make sure to convey it correctly, so I will quote two previous expressions of this point by other sources:
Risks from Learned Optimization notes that the lack of environmental data related to inclusive genetic fitness effectively increases the description length complexity of specifying an intelligence that deliberately optimizes for inclusive genetic fitness:
…description cost is especially high if the learned algorithm’s input data does not contain easy-to-infer information about how to optimize for the base objective. Biological evolution seems to differ from machine learning in this sense, since evolution’s specification of the brain has to go through the information funnel of DNA. The sensory data that early humans received didn’t allow them to infer the existence of DNA, nor the relationship between their actions and their genetic fitness. Therefore, for humans to have been aligned with evolution would have required them to have an innately specified model of DNA, as well as the various factors influencing their inclusive genetic fitness. Such a model would not have been able to make use of environmental information for compression, and thus would have required a greater description length. In contrast, our models of food, pain, etc. can be very short since they are directly related to our input data.
From Alex Turner (in private communication):
If values form because reward sends reinforcement flowing back through a person's cognition and reinforces the thoughts which (credit assignment judges to have) led to the reward, then if a person never thinks about inclusive reproductive fitness, they can never ever form a value shard around inclusive reproductive fitness. Certain abstractions, like lollipops or people, are convergently learned early in the predictive-loss-minimization process and thus are easy to form values around. But if there aren't local mutations which make a person more probable to think thoughts about inclusive genetic fitness before/while the person gets reward, then evolution can't instill this value. Even if the descendents of that person will later be able to think thoughts about fitness.
There are many sources of empirical evidence that can inform our intuitions regarding how inner goals relate to outer optimization criteria. My current (not very deeply considered) estimate of how to weight these evidence sources is roughly:
Edit: since writing this post, I've learned a lot more about inductive biases and what deep learning theory we currently have, so my relative weightings have shifted quite a lot towards "current results in machine learning".
I think that using "human learning -> human values" as our reference class for inner goals versus outer optimization criteria suggests a much more straightforward relationship between the two, as compared to the (lack of a) relationship suggested by "evolution -> human values". Looking at the learning trajectories of individual humans, it seems like a given person's values have a great deal in common with the sorts of experiences they've found rewarding in their lives up to that point in time. E.g., a person who grew up with and displayed affection for dogs probably doesn't want a future totally devoid of dogs, or one in which dogs suffer greatly.
Please note that I am not arguing that humans are inner aligned, or that looking at humans implies inner alignment is easy. Humans are misaligned with maximizing their outer reward source (activation of reward circuitry). I operationalize this misalignment as: "After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they'd acquired from their learning environment which are misaligned to reward maximization in the new environment".
For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are extremely bad by light of the human's preexisting values (e.g., killing a beloved family member), most humans won't push the button.
I also think this regularity in inner values is reasonably robust to large increases in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence. It's probably not as robust to your choice of which specific human to try this with. E.g., many people would screw themselves over with reckless self-modification. My point is that higher capabilities alone do not automatically render inner values completely alien to those demonstrated at lower capabilities.
(Part 2 will address whether the “sharp left turn” demonstrated by human capabilities with respect to evolution implies that we should expect a similar sharp left turn in AI capabilities.)
I would consider that you cannot weight these things along a single metric. Say evolution -> human values really is only 4% of your value alignment, if that 4% is the fundamental core then it's not part of the sum of all values, but a coefficient or a base where the other stuff is the exponent. It's the hardware the software has to be loaded on, but not totally tabula rasa either.
Correct me if I'm wrong, but this would assume that if you could somehow make a human level intelligence snake and raise it in human society (let's pretend nobody considers it weird that there's a snake taking Chemistry class with them), then that snake would be 96% aligned with humanity?
My intuition would be along the lines of the parable of the scorpion and the frog.
The "4%" wasn't addressing the question of "where do humans get their values from?" It was addressing "When trying to make predictions about AGI outcomes, how much weight should we assign these various sources of evidence?"
My perspective isn't blank slatism. The genome has various leavers by which it can influence the sorts of values that a human forms. E.g., the snake wouldn't have human-like reward circuitry, so it would probably learn to value very different things than a human which went through the same experiences. For more on this, see: “Learning from scratch” in the brain.
E.g., the snake wouldn't have human-like reward circuitry, so it would probably learn to value very different things than a human which went through the same experiences.
So in this case I think we then agree. But it seems a bit at odds with the 4% weighting of genetic roots. If we agree the snake would exhibit very different values despite experiencing the 'human learning' part then shouldn't this adjust the 60% weight you grant that? Seems the evolutionary roots made all the difference for the snake. Which is the whole point about initial AGI alignment having to be exactly right.
Otherwise I understand your post to be 'for humans, how much of human value is derived from evolution vs learning'. But that's using humans as evidence who are human to begin with.
This is a neat distillation of Steve's piece, and also a helpful and persuasive extension. I appreciated the arguments 2, 3, and 4 in particular ('2. We have more total evidence from human outcomes', '3. Human learning trajectories represent a broader sampling of the space of possible learning processes', '4. Evidence from humans are more accessible than evidence from evolution').
I wanted to raise two counterpoints, without a strong opinion on how much weight they deserve.
I'm confused. What is the outer optimization target for human learning?
My two top guesses below.
To me it looks like human values are result of humans learning from environment (which was influenced by humans before and includes current humans). So it's kind of like human values are what humans learned by definition. So observing that humans learned human values doesn't tell us anything.
Or maybe you mean something like parents / society / ... teaching new humans their values? I see some other problems there:
The outer optimization target for the human learning process is kind of indeterminate, but to the extent we can determine it, it's something like "learn the things that causally contributed to IGF in the ancestral environment." This isn't the same as IGF itself. It would include cooperation, sex drive, a fear of death, a taste for sugary and fatty foods, etc. We seem to be pretty well aligned from that perspective.
Also, if you view evolution from a wider perspective, we're not that misaligned, since it's just trying to find sticky patterns that reproduce themselves a lot, and it seems likely that human civilization will conquer the lightcone in some form or another fairly soon (even if it's misaligned AI doing it).
trying to find patterns that reduce themselves a lot with minimal change in the patterns (but still some change) seems like a better model of evolution to me, and by that metric, if we solve ai alignment with us, I think we'll end up mostly solving our alignment with dna's values - much of what dna valued has been lost, but those who care about the environment for its own sake and beauty will represent a high enough capability group to construct the repair process. if given the chance to do so by an AI that respects their values, anyway.
5: Evolution could not have succeeded anyways
Evolution had to succeed. In order for evolution to be noticed and/or modeled by anything, the patterns of neurons had to align perfectly, even if there was a one-in-a-trillion chance of something like neurons randomly forming the correct general intelligence, anywhere, ever. The fact that we came from neuron brute forcing doesn't tell us that much about whether neuron brute forcing can create general intelligence.
Animals and insects aren't evidence at all; given that intelligence evolved, there would be plenty of offshoots.
By "evolution succeeds," the OP means "succeeds at aligning humans with caring about inclusive genetic fitness" – not at creating general intelligence.
The fact that we came from neuron brute forcing doesn't tell us that much about whether neuron brute forcing can create general intelligence.
The link you include mentions that anthropic updating on our observations can sometimes give us evidence on how hard something was likely to be initially (e.g., the cold war example where survival is evidence that things were less dangerous than we might have thought, all else equal). You can do something similar with the evolution of intelligence: This paper argues that if the evolution of human-level intelligence had been very unlikely, we'd be closer to the extremes of when Earth is no longer hospitable to big-brained life forms. The fact that the sun isn't going to expand for a while longer (and make Earth uninhabitable) or that asteroid risks aren't massively overdue for us compared to evolutionary timescales suggests that the evolution of general intelligence on earth wasn't some freak accident that would almost never happen again under similar circumstances.