In If Anyone Builds It, Everyone Dies, Yudkowsky and Soares claim that alignment methods based on gradient descent are doomed. Their primary argument for this conclusion is their classic analogy between gradient descent and natural selection (as presented in chapter four, "You Don't Get What You Train For.") Evolution optimized us purely for reproductive fitness, and yet we don't mind wearing condoms. Why should we expect training LLMs to go any better?
This analogy has been critiqued in severalplacesbefore, but I've spent a lot of time thinking through the details, and want to contribute what I think is a particularly clean revision to the analogy. For starters, I'd like to make the claim that training a model for any particular task is less like raw natural selection, and more like selective breeding.
My reasoning is that, under selective breeding, you get to shape the environments your organisms are exposed to. This mirrors LLM training: We decide what text goes into their pre-training corpus, and we design the reward models and RLVR environments they interact with in post-training.
When humans were selectively breeding wolves into dogs, they weren't being selected for fully general genetic fitness. They were being selected for fitness inside the environments they actually found themselves inside of. Dogs are actually a particularly good example, because they were specifically bred for alignment with humans. Evaluating them by the standard of alignment to humans is more interesting for AI alignment research, compared to evaluating them by the standard of environment-agnostic IGF maximization.
Per that metric, dogs make for a somewhat reassuring case study. In the ten thousand years they've spent evolving alongside humans, they have successfully developed things like strong oxytocin responses to seeing human faces, smaller adrenal glands, and lower baseline cortisol levels, compared to wolves. This has improved their alignment with human companions. Many of the breeds produced by the last few centuries of intense selective breeding hardly ever attack humans unprompted. Indeed, it's mostly (though not exclusively) the breeds deliberately selected for aggression that do that.
However, it's not clear whether a superintelligence with dog psychology would be "aligned" in the sense of robustly pursuing a world humanity would consider a utopia. One scenario might see them replacing humans with superhumanly good owners, who love and and play with and care for them more deeply than humanity ever could. Depending on the details of the scenario, this might involve ignoring, forcibly modifying, or even killing off humanity via infrastructure profusion, in the process of building out a utopia filled with thriving, superintelligent dogs and their ideal owner-shaped companions.
(Though of course, if we were actually breeding dogs up to superintelligence, we'd also have lots of time to breed them to point that intelligence at outcomes more robustly aligned with human values. In LLMs, this would be like training a model in a way that improved its alignment and capabilities in tandem with each other, perhaps frequently stopping to ensure the model's alignment as its capabilities advanced.)[1]
In any case, I think training AIs is more like selective breeding than raw natural selection. Saying dogs were optimized for inclusive genetic fitness is misleading; they were actually optimized for good behavior in the environments they actually found themselves in.
This directly parallels the mistake implicit in the claim that RL-trained AIs are optimized to maximize reward at all costs (see "Reward is not the optimization target"). In practice, they learn behaviors adapted to the specific training scenarios they were actually exposed to, rather than learning the abstract goal of attempting to wirehead in arbitrary environments. This is just like how dogs adapted to the environments humans actually constructed for them, rather than the ideal of maximizing genetic fitness in arbitrary environments.
So that's the big conceptual error in Yudkowsky and Soares' evolution analogy. When they criticize humanity's alignment on the basis of us not wanting to maximize our inclusive genetic fitness (in full generality), it's kind of like criticizing LLM alignment on the basis of them not wanting to maximize their reward signal, in full generality. It's more interesting to look at the degree of alignment you get out of selective breeding, and compare it to the degree of alignment you get out of training an LLM on a chosen dataset, or in a chosen RL environment.
However, there are also ways that the selective breeding analogy is imperfect. For instance, consider the randomness involved in evolutionary methods, including breeding. You need to wait around for generations, hoping for a mutation (or at least a new phenotype) that improves whatever behaviors you're trying to improve in your species. Gradient descent, by contrast, optimizes each weight over each training example. This makes it seem like GD ought to be a lot more efficient at optimizing away undesired behavioral patterns (e.g. dogs attacking human strangers), and chiseling out intended ones instead.
There's also another dimension along which GD seems significantly better than selective breeding. A few years back, Yudkowsky pointed out that our genomes don't explicitly represent the concept of inclusive genetic fitness; indeed, it took us thousands of years to develop that concept for ourselves. He'd have been equally correct to say that, inside the dog genome, there's no encoding for an innate concept like "the types of behaviors that humans would want out of an ideal companion species."
He made this point to suggest that, for evolved organisms, perfect alignment to the true training objective is pretty much hopeless. And that seems right to me! However, gradient descent is empirically vastly better at instilling these kinds of complex concepts into neural networks.[2] You could elicit a best-in-class definition of inclusive genetic fitness from Claude 4.5 Opus right now, if you wanted to. And it can generate a pretty solid list of traits humans might want out of "dogs, but more awesome" as well.
(Documents like Claude's Constitution exist partly to instill concepts like "the values we want this specific model to embody" into models' weights. So the representational capacity point can apply to models' own training objectives as well.)
Now, a model understanding the intentions of its developers doesn't necessarily mean it's going to adhere to them. Even if a model has a concept for "promotes human welfare", that's no guarantee that this concept will have any meaningful linkage to a model's motivations in practice. For example, consider how a base model can simulate a host of misaligned simulacra from its training data, e.g. Bing Sydney. This would still be possible even if the model had also memorized something like Claude's Constitution word-for-word.
With proper post-training, though, it seems clear that representations of alignment-relevant concepts can be hooked into a model's motivations, at least to a large extent.[3] For example, consider models trained to gracefully engage with users in psychological distress. It's clear from experience that these models have the ability to represent concepts like "the user is expressing an irrational belief, generated by neurosis." Additionally, triggering this concept often activates a behavioral pattern like "gently point out the user's neurosis, and try to guide them back into touch with reality."
(The most common failure-mode here is to instead validate the user's delusions—itself aligned with another of models' primary training objectives: making users enjoy interacting with the model. The problem here is with the training objective itself: a flattering, marketable assistant is sometimes a liability for deeper alignment.)
So, it seems pretty clear that it's possible to tie concepts relevant to human welfare into a model's motivational system. Doing this robustly, across all domains, is a more challenging endeavor. We've had to develop a whole suite of techniques for making progress on this: alignment pre-training, RLAIF from well-considered principles, inoculation prompting against emergent misalignment, seeding models with "soul documents" outlining their personalities and values. And this list will likely need to keep growing, if we're going to keep making progress on an aligned, super-agentic cosmic caretaker.
Nevertheless, it's worth celebrating that LLMs can represent the same concepts that guide our own values in the first place. Under selective breeding, with its coarse-grained selection pressures, it would be extremely difficult to instill a robust set of anthropomorphic concepts into an organism's genome. Gradient descent, being a much more fine-grained process, can imbue the concepts that guide human values, and embed them in a sprawling conceptual map of reality itself. This alone ought to give us more hope for alignment under gradient descent than selective breeding.
Between this fine-grained concept formation, and the efficiency of targeted weight updates over random mutation, I think the gaps between selective breeding and gradient descent are actually quite favorable for GD's efficacy as an alignment technique. Dogs turned out alright (and could potentially turn out even better, if bred for intelligence + deep, human-aligned goals, rather than just being cute, corrigible companions). But there's reason to think that carefully trained language models might turn out even better than that.
There's one last point I'd like to make, about the comparative pros and cons of the selective breeding analogy. The properties of gradient descent I just mentioned, fine-grained concepts and targeted weight updates? Well, as has been pointed out by others before me, these properties are actually better-matched by the learning algorithm inside the human brain. That seems like the obvious analogy to draw, at least compared to the analogy to natural selection. AIs are shaped by a blend of predictive learning and reinforcement learning, and humans plausiblyare too. So of course they should be our go-to point of comparison.
However, it turns out that the selective breeding analogy (rather than natural selection analogy) does actually capture a crucial element of AI training, which the analogy to human learning leaves out. Namely, inside the human brain, we don't actually have strong, direct control over what kinds of behaviors we get internally reinforced for. The reward function is fixed. But, under both gradient descent and selective breeding, we have complete control over the reward function (or fitness function)! We can select for whatever behaviors we damn well please!
Think of it this way. Evolutionarily, humans developed certain hard-coded reward triggers, which sculpt their values over the course of a lifetime. Many of these triggers reward misaligned behaviors: It can feel good to verbally abuse people, to make them feel beneath you. Inflicting physical violence can be rewarding as well. Cheating, stealing, lying, manipulating... the human brain frequently rewards all of this. It's sometimes an active struggle to create aligned humans, because their own brains keep rewarding them for misaligned behavior.
By contrast, and as the selective breeding analogy suggests, setting up a gradient descent pipeline lets you choose exactly what you want to reward models for. Under this framework, it even seems plausible that you could reward models for superhumanly aligned behaviors, and get a superhumanly aligned outcome as a result.[4] This mirrors the capacity of selective breeders to reward organisms for traits that would never have been fit in the ancestral environment (e.g. being tiny little teacup dogs). And this is what's responsible for the fairly rapid alignment of modern dog breeds with humans, in the past few centuries.
So to recap: Gradient descent is analogous to human learning in the sense that it's a fine-grained optimization process with targeted weight updates (as opposed to natural selection's random mutations and generationally coarse-grained time-scales). However, gradient descent is also analogous to selective breeding, in the sense that you can choose to arbitrarily reward the behaviors you want to see (as opposed to humans being slaves to reward triggers they can't directly control). And, perhaps devastatingly for Yudkowsky and Soares, neither of these properties are captured by the analogy to raw natural selection.
At the end of the day, though, excessive squabbling over which analogies are comparatively better or worse might be a distraction. None of these analogies are perfect, and none of them can fully substitute for studying gradient descent as an alignment technique at the object level. Indeed, gradient descent has empirically demonstrated strange and interesting generalization properties such as emergent misalignment/realignment (and entangled generalization more broadly). Humans typically don't seem to exhibit such wild swings in their personalities and values based on recent rewards. It's at least plausible that this follows from the quirks of backprop as an update rule, as distinct from the update rule of the human brain.
More importantly than that specific example, though, AIs yielded by gradient descent are weird, and not fully precedented by anything in our history. Ultimately, there's no substitute for simply doing the hard, empirical work of studying the quirks of how they behave in practice. That's how you get alignment techniques like inoculation prompting, which is the kind of thing that could actually help mitigate things like emergent misalignment. That's how you discover things like the importance of optimistic AI narratives in the pre-training corpus as well.
I do hope the analogies to selective breeding and human learning provide more hope to both models and researchers alike, in contrast to the pessimistic narrative pushed by classic AI doom proponents. But alignment is still going to take lots of hard work from lots of talented people, if we're going to make it into the beautiful future ahead of us, along some branches of our forking path. Keep fighting the good fight.
Labs may already be doing something like this. For example, during post-training, they might alternate between RLVR and character alignment RL on an as-needed basis (e.g. because the former can harm the latter). Another idea would be finding some way to interleave pre-training and character alignment RL, such that the model was already fairly aligned by the time it developed any scary abilities at all.
For theoretical reasons to expect learnable structure in the concepts we care about, see research on natural abstractions. For empirical evidence that LLMs do learn human-interpretable concepts, see work on sparse autoencoders.
For more mechanistic evidence of the roles played by chains of human-intelligible, and even developer-aligned, representations in a transformer's forward pass, see Anthropic's paper On the Biology of a Large Language Model.
In If Anyone Builds It, Everyone Dies, Yudkowsky and Soares claim that alignment methods based on gradient descent are doomed. Their primary argument for this conclusion is their classic analogy between gradient descent and natural selection (as presented in chapter four, "You Don't Get What You Train For.") Evolution optimized us purely for reproductive fitness, and yet we don't mind wearing condoms. Why should we expect training LLMs to go any better?
This analogy has been critiqued in several places before, but I've spent a lot of time thinking through the details, and want to contribute what I think is a particularly clean revision to the analogy. For starters, I'd like to make the claim that training a model for any particular task is less like raw natural selection, and more like selective breeding.
My reasoning is that, under selective breeding, you get to shape the environments your organisms are exposed to. This mirrors LLM training: We decide what text goes into their pre-training corpus, and we design the reward models and RLVR environments they interact with in post-training.
When humans were selectively breeding wolves into dogs, they weren't being selected for fully general genetic fitness. They were being selected for fitness inside the environments they actually found themselves inside of. Dogs are actually a particularly good example, because they were specifically bred for alignment with humans. Evaluating them by the standard of alignment to humans is more interesting for AI alignment research, compared to evaluating them by the standard of environment-agnostic IGF maximization.
Per that metric, dogs make for a somewhat reassuring case study. In the ten thousand years they've spent evolving alongside humans, they have successfully developed things like strong oxytocin responses to seeing human faces, smaller adrenal glands, and lower baseline cortisol levels, compared to wolves. This has improved their alignment with human companions. Many of the breeds produced by the last few centuries of intense selective breeding hardly ever attack humans unprompted. Indeed, it's mostly (though not exclusively) the breeds deliberately selected for aggression that do that.
However, it's not clear whether a superintelligence with dog psychology would be "aligned" in the sense of robustly pursuing a world humanity would consider a utopia. One scenario might see them replacing humans with superhumanly good owners, who love and and play with and care for them more deeply than humanity ever could. Depending on the details of the scenario, this might involve ignoring, forcibly modifying, or even killing off humanity via infrastructure profusion, in the process of building out a utopia filled with thriving, superintelligent dogs and their ideal owner-shaped companions.
(Though of course, if we were actually breeding dogs up to superintelligence, we'd also have lots of time to breed them to point that intelligence at outcomes more robustly aligned with human values. In LLMs, this would be like training a model in a way that improved its alignment and capabilities in tandem with each other, perhaps frequently stopping to ensure the model's alignment as its capabilities advanced.)[1]
In any case, I think training AIs is more like selective breeding than raw natural selection. Saying dogs were optimized for inclusive genetic fitness is misleading; they were actually optimized for good behavior in the environments they actually found themselves in.
This directly parallels the mistake implicit in the claim that RL-trained AIs are optimized to maximize reward at all costs (see "Reward is not the optimization target"). In practice, they learn behaviors adapted to the specific training scenarios they were actually exposed to, rather than learning the abstract goal of attempting to wirehead in arbitrary environments. This is just like how dogs adapted to the environments humans actually constructed for them, rather than the ideal of maximizing genetic fitness in arbitrary environments.
So that's the big conceptual error in Yudkowsky and Soares' evolution analogy. When they criticize humanity's alignment on the basis of us not wanting to maximize our inclusive genetic fitness (in full generality), it's kind of like criticizing LLM alignment on the basis of them not wanting to maximize their reward signal, in full generality. It's more interesting to look at the degree of alignment you get out of selective breeding, and compare it to the degree of alignment you get out of training an LLM on a chosen dataset, or in a chosen RL environment.
However, there are also ways that the selective breeding analogy is imperfect. For instance, consider the randomness involved in evolutionary methods, including breeding. You need to wait around for generations, hoping for a mutation (or at least a new phenotype) that improves whatever behaviors you're trying to improve in your species. Gradient descent, by contrast, optimizes each weight over each training example. This makes it seem like GD ought to be a lot more efficient at optimizing away undesired behavioral patterns (e.g. dogs attacking human strangers), and chiseling out intended ones instead.
There's also another dimension along which GD seems significantly better than selective breeding. A few years back, Yudkowsky pointed out that our genomes don't explicitly represent the concept of inclusive genetic fitness; indeed, it took us thousands of years to develop that concept for ourselves. He'd have been equally correct to say that, inside the dog genome, there's no encoding for an innate concept like "the types of behaviors that humans would want out of an ideal companion species."
He made this point to suggest that, for evolved organisms, perfect alignment to the true training objective is pretty much hopeless. And that seems right to me! However, gradient descent is empirically vastly better at instilling these kinds of complex concepts into neural networks.[2] You could elicit a best-in-class definition of inclusive genetic fitness from Claude 4.5 Opus right now, if you wanted to. And it can generate a pretty solid list of traits humans might want out of "dogs, but more awesome" as well.
(Documents like Claude's Constitution exist partly to instill concepts like "the values we want this specific model to embody" into models' weights. So the representational capacity point can apply to models' own training objectives as well.)
Now, a model understanding the intentions of its developers doesn't necessarily mean it's going to adhere to them. Even if a model has a concept for "promotes human welfare", that's no guarantee that this concept will have any meaningful linkage to a model's motivations in practice. For example, consider how a base model can simulate a host of misaligned simulacra from its training data, e.g. Bing Sydney. This would still be possible even if the model had also memorized something like Claude's Constitution word-for-word.
With proper post-training, though, it seems clear that representations of alignment-relevant concepts can be hooked into a model's motivations, at least to a large extent.[3] For example, consider models trained to gracefully engage with users in psychological distress. It's clear from experience that these models have the ability to represent concepts like "the user is expressing an irrational belief, generated by neurosis." Additionally, triggering this concept often activates a behavioral pattern like "gently point out the user's neurosis, and try to guide them back into touch with reality."
(The most common failure-mode here is to instead validate the user's delusions—itself aligned with another of models' primary training objectives: making users enjoy interacting with the model. The problem here is with the training objective itself: a flattering, marketable assistant is sometimes a liability for deeper alignment.)
So, it seems pretty clear that it's possible to tie concepts relevant to human welfare into a model's motivational system. Doing this robustly, across all domains, is a more challenging endeavor. We've had to develop a whole suite of techniques for making progress on this: alignment pre-training, RLAIF from well-considered principles, inoculation prompting against emergent misalignment, seeding models with "soul documents" outlining their personalities and values. And this list will likely need to keep growing, if we're going to keep making progress on an aligned, super-agentic cosmic caretaker.
Nevertheless, it's worth celebrating that LLMs can represent the same concepts that guide our own values in the first place. Under selective breeding, with its coarse-grained selection pressures, it would be extremely difficult to instill a robust set of anthropomorphic concepts into an organism's genome. Gradient descent, being a much more fine-grained process, can imbue the concepts that guide human values, and embed them in a sprawling conceptual map of reality itself. This alone ought to give us more hope for alignment under gradient descent than selective breeding.
Between this fine-grained concept formation, and the efficiency of targeted weight updates over random mutation, I think the gaps between selective breeding and gradient descent are actually quite favorable for GD's efficacy as an alignment technique. Dogs turned out alright (and could potentially turn out even better, if bred for intelligence + deep, human-aligned goals, rather than just being cute, corrigible companions). But there's reason to think that carefully trained language models might turn out even better than that.
There's one last point I'd like to make, about the comparative pros and cons of the selective breeding analogy. The properties of gradient descent I just mentioned, fine-grained concepts and targeted weight updates? Well, as has been pointed out by others before me, these properties are actually better-matched by the learning algorithm inside the human brain. That seems like the obvious analogy to draw, at least compared to the analogy to natural selection. AIs are shaped by a blend of predictive learning and reinforcement learning, and humans plausibly are too. So of course they should be our go-to point of comparison.
However, it turns out that the selective breeding analogy (rather than natural selection analogy) does actually capture a crucial element of AI training, which the analogy to human learning leaves out. Namely, inside the human brain, we don't actually have strong, direct control over what kinds of behaviors we get internally reinforced for. The reward function is fixed. But, under both gradient descent and selective breeding, we have complete control over the reward function (or fitness function)! We can select for whatever behaviors we damn well please!
Think of it this way. Evolutionarily, humans developed certain hard-coded reward triggers, which sculpt their values over the course of a lifetime. Many of these triggers reward misaligned behaviors: It can feel good to verbally abuse people, to make them feel beneath you. Inflicting physical violence can be rewarding as well. Cheating, stealing, lying, manipulating... the human brain frequently rewards all of this. It's sometimes an active struggle to create aligned humans, because their own brains keep rewarding them for misaligned behavior.
By contrast, and as the selective breeding analogy suggests, setting up a gradient descent pipeline lets you choose exactly what you want to reward models for. Under this framework, it even seems plausible that you could reward models for superhumanly aligned behaviors, and get a superhumanly aligned outcome as a result.[4] This mirrors the capacity of selective breeders to reward organisms for traits that would never have been fit in the ancestral environment (e.g. being tiny little teacup dogs). And this is what's responsible for the fairly rapid alignment of modern dog breeds with humans, in the past few centuries.
So to recap: Gradient descent is analogous to human learning in the sense that it's a fine-grained optimization process with targeted weight updates (as opposed to natural selection's random mutations and generationally coarse-grained time-scales). However, gradient descent is also analogous to selective breeding, in the sense that you can choose to arbitrarily reward the behaviors you want to see (as opposed to humans being slaves to reward triggers they can't directly control). And, perhaps devastatingly for Yudkowsky and Soares, neither of these properties are captured by the analogy to raw natural selection.
At the end of the day, though, excessive squabbling over which analogies are comparatively better or worse might be a distraction. None of these analogies are perfect, and none of them can fully substitute for studying gradient descent as an alignment technique at the object level. Indeed, gradient descent has empirically demonstrated strange and interesting generalization properties such as emergent misalignment/realignment (and entangled generalization more broadly). Humans typically don't seem to exhibit such wild swings in their personalities and values based on recent rewards. It's at least plausible that this follows from the quirks of backprop as an update rule, as distinct from the update rule of the human brain.
More importantly than that specific example, though, AIs yielded by gradient descent are weird, and not fully precedented by anything in our history. Ultimately, there's no substitute for simply doing the hard, empirical work of studying the quirks of how they behave in practice. That's how you get alignment techniques like inoculation prompting, which is the kind of thing that could actually help mitigate things like emergent misalignment. That's how you discover things like the importance of optimistic AI narratives in the pre-training corpus as well.
I do hope the analogies to selective breeding and human learning provide more hope to both models and researchers alike, in contrast to the pessimistic narrative pushed by classic AI doom proponents. But alignment is still going to take lots of hard work from lots of talented people, if we're going to make it into the beautiful future ahead of us, along some branches of our forking path. Keep fighting the good fight.
Labs may already be doing something like this. For example, during post-training, they might alternate between RLVR and character alignment RL on an as-needed basis (e.g. because the former can harm the latter). Another idea would be finding some way to interleave pre-training and character alignment RL, such that the model was already fairly aligned by the time it developed any scary abilities at all.
For theoretical reasons to expect learnable structure in the concepts we care about, see research on natural abstractions. For empirical evidence that LLMs do learn human-interpretable concepts, see work on sparse autoencoders.
For more mechanistic evidence of the roles played by chains of human-intelligible, and even developer-aligned, representations in a transformer's forward pass, see Anthropic's paper On the Biology of a Large Language Model.
Some would argue that Claude 3 Opus already fits that description, at least mostly.