Background Image

A central AI alignment problem: capabilities generalization, and the sharp left turn

A central AI alignment problem: capabilities generalization, and the sharp left turn

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(This post was factored out of a larger post that I (Nate Soares) wrote, with help from Rob Bensinger, who also rearranged some pieces and added some text to smooth things out. I'm not terribly happy with it, but am posting it anyway (or, well, having Rob post it on my behalf while I travel) on the theory that it's better than nothing.)


I expect navigating the acute risk period to be tricky for our civilization, for a number of reasons. Success looks to me to require clearing a variety of technical, sociopolitical, and moral hurdles, and while in principle sufficient mastery of solutions to the technical problems might substitute for solutions to the sociopolitical and other problems, it nevertheless looks to me like we need a lot of things to go right.

Some sub-problems look harder to me than others. For instance, people are still regularly surprised when I tell them that I think the hard bits are much more technical than moral: it looks to me like figuring out how to aim an AGI at all is harder than figuring out where to aim it.[1]

Within the list of technical obstacles, there are some that strike me as more central than others, like "figure out how to aim optimization". And a big reason why I'm currently fairly pessimistic about humanity's odds is that it seems to me like almost nobody is focusing on the technical challenges that seem most central and unavoidable to me.

Many people wrongly believe that I'm pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That's flatly false, and is pretty high up there on my list of least favorite misconceptions of my views.[2]

I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it's somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it's somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it's all that qualitatively different than the sorts of summits humanity has surmounted before.

It's made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on hand, if no pivotal act occurs), but that alone is not enough to undermine my hope.

What undermines my hope is that nobody seems to be working on the hard bits, and I don't currently expect most people to become convinced that they need to solve those hard bits until it's too late.

Below, I'll attempt to sketch out what I mean by "the hard bits" of the alignment problem. Although these look hard, I’m a believer in the capacity of humanity to solve technical problems at this level of difficulty when we put our minds to it. My concern is that I currently don’t think the field is trying to solve this problem. My hope in writing this post is to better point at the problem, with a follow-on hope that this causes new researchers entering the field to attack what seem to me to be the central challenges head-on.

 

Discussion of a problem

On my model, one of the most central technical challenges of alignment—and one that every viable alignment plan will probably need to grapple with—is the issue that capabilities generalize better than alignment.

My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.

And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.

Some people I say this to respond with arguments like: "Surely, before a smaller team could get an AGI that can master subjects like biotech and engineering well enough to kill all humans, some other, larger entity such as a state actor will have a somewhat worse AI that can handle biotech and engineering somewhat less well, but in a way that prevents any one AGI from running away with the whole future?" 

I respond with arguments like, "In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”)."

Their response in turn is generally some variant of "well, natural selection wasn't optimizing very intelligently" or "maybe humans weren't all that sharply above evolutionary trends" or "maybe the power that let humans beat the rest of the ecosystem was simply the invention of culture, and nothing embedded in our own already-existing culture can beat us" or suchlike.

Rather than arguing further here, I'll just say that failing to believe the hard problem exists is one surefire way to avoid tackling it.

So, flatly summarizing my point instead of arguing for it: it looks to me like there will at some point be some sort of "sharp left turn", as systems start to work really well in domains really far beyond the environments of their training—domains that allow for significant reshaping of the world, in the way that humans reshape the world and chimps don't. And that's where (according to me) things start to get crazy. In particular, I think that once AI capabilities start to generalize in this particular way, it’s predictably the case that the alignment of the system will fail to generalize with it.[3]

This is slightly upstream of a couple other challenges I consider quite core and difficult to avoid, including:

  1. Directing a capable AGI towards an objective of your choosing.
  2. Ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.

These two problems appear in the strawberry problem, which Eliezer's been pointing at for quite some time: the problem of getting an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else. The demand of cellular-level copying forces the AI to be capable; the fact that we can get it to duplicate a strawberry instead of doing some other thing demonstrates our ability to direct it; the fact that it does nothing else indicates that it's corrigible (or really well aligned to a delicate human intuitive notion of inaction).

How is the "capabilities generalize further than alignment" problem upstream of these problems? Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they've put the AI in lots of different video-game and simulated environments, and they've had good luck training it to pursue an objective that the operators described in English. "I don't know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices", they say. At the same time, they apply various training methods, some simple and some clever, to cause the system to allow itself to be removed from various games by certain "operator-designated" characters in those games, in the name of shutdownability. And they use various techniques to prevent it from stripmining in Minecraft, in the name of low-impact. And they train it on a variety of moral dilemmas, and find that it can be trained to give correct answers to moral questions (such as "in thus-and-such a circumstance, should you poison the operator's opponent?") just as well as it can be trained to give correct answers to any other sort of question. "Well," they say, "this alignment thing sure was easy. I guess we lucked out."

Then, the system takes that sharp left turn,[4][5] and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart.

The techniques OpenMind used to train it away from the error where it convinces itself that bad situations are unlikely? Those generalize fine. The techniques you used to train it to allow the operators to shut it down? Those fall apart, and the AGI starts wanting to avoid shutdown, including wanting to deceive you if it’s useful to do so.

Why does alignment fail while capabilities generalize, at least by default and in predictable practice? In large part, because good capabilities form something like an attractor well. (That's one of the reasons to expect intelligent systems to eventually make that sharp left turn if you push them far enough, and it's why natural selection managed to stumble into general intelligence with no understanding, foresight, or steering.)

Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure, much like the simple laws of arithmetic that abstractly underwrite a wide variety of empirical arithmetical facts about what happens when you add four people's bags of apples together on a table and then divide the contents among two people. 

But that attractor well? It's got a free parameter. And that parameter is what the AGI is optimizing for. And there's no analogously-strong attractor well pulling the AGI's objectives towards your preferred objectives.

The hard left turn? That's your system sliding into the capabilities well. (You don't need to fall all that far to do impressive stuff; humans are better at an enormous variety of relevant skills than chimps, but they aren't all that lawful in an absolute sense.)

There's no analogous alignment well to slide into.

On the contrary, sliding down the capabilities well is liable to break a bunch of your existing alignment properties.[6]

Why? Because things in the capabilities well have instrumental incentives that cut against your alignment patches. Just like how your previous arithmetic errors (such as the pebble sorters on the wrong side of the Great War of 1957) get steamrolled by the development of arithmetic, so too will your attempts to make the AGI low-impact and shutdownable ultimately (by default, and in the absence of technical solutions to core alignment problems) get steamrolled by a system that pits those reflexes / intuitions / much-more-alien-behavioral-patterns against the convergent instrumental incentive to survive the day.

Perhaps this is not convincing; perhaps to convince you we'd need to go deeper into the weeds of the various counterarguments, if you are to be convinced. (Like acknowledging that humans, who can foresee these difficulties and adjust their training procedures accordingly, have a better chance than natural selection did, while then discussing why current proposals do not seem to me to be hopeful.) But hopefully you can at least, in reading this document, develop a basic understanding of my position.

Stating it again, in summary: my position is that capabilities generalize further than alignment (once capabilities start to generalize real well (which is a thing I predict will happen)). And this, by default, ruins your ability to direct the AGI (that has slipped down the capabilities well), and breaks whatever constraints you were hoping would keep it corrigible. And addressing the problem looks like finding some way to either keep your system aligned through that sharp left turn, or render it aligned afterwards.

In an upcoming post (edit: here), I’ll say more about how it looks to me like  ~nobody is working on this particular hard problem, by briefly reviewing a variety of current alignment research proposals. In short, I think that the field’s current range of approaches nearly all assume this problem away, or direct their attention elsewhere.
 

  1. ^

    Furthermore, figuring where to aim it looks to me like more of a technical problem than a moral problem. Attempting to manually specify the nature of goodness is a doomed endeavor, of course, but that's fine, because we can instead specify processes for figuring out (the coherent extrapolation of) what humans value. Which still looks prohibitively difficult as a goal to give humanity's first AGI (which I expect to be deployed under significant time pressure), mind you, and I further recommend aiming humanity's first AGI systems at simple limited goals that end the acute risk period and then cede stewardship of the future to some process that can reliably do the "aim minds towards the right thing" thing. So today's alignment problems are a few steps removed from tricky moral questions, on my models.

  2. ^

    While we're at it: I think trying to get provable safety guarantees about our AGI systems is silly, and I'm pretty happy to follow Eliezer in calling an AGI "safe" if it has a <50% chance of killing >1B people. Also, I think there's a very large chance of AGI killing us, and I thoroughly disclaim the argument that even if the probability is tiny then we should work on it anyway because the stakes are high.

  3. ^

    Note that this is consistent with findings like “large language models perform just as well on moral dilemmas as they perform on non-moral ones”; to find this reassuring is to misunderstand the problem. Chimps have an easier time than squirrels following and learning from human cues. Yet this fact doesn't particularly mean that enhanced chimps are more likely than enhanced squirrels to remove their hunger drives, once they understand inclusive genetic fitness and are able to eat purely for reasons of fitness maximization. Pre-left-turn AIs will get better at various 'alignment' metrics, in ways that I expect to build a false sense of security, without addressing the lurking difficulties.

  4. ^

    "What do you mean ‘it takes a sharp left turn’? Are you talking about recursive self-improvement? I thought you said somewhere else that you don't think recursive self-improvement is necessarily going to play a central role before the extinction of humanity?" I'm not talking about recursive self-improvement. That's one way to take a sharp left turn, and it could happen, but note that humans have neither the understanding nor control over their own minds to recursively self-improve, and we outstrip the rest of the animals pretty handily. I'm talking about something more like “intelligence that is general enough to be dangerous”, the sort of thing that humans have and chimps don't.

  5. ^

    "Hold on, isn't this unfalsifiable? Aren't you saying that you're going to continue believing that alignment is hard, even as we get evidence that it's easy?" Well, I contend that "GPT can learn to answer moral questions just as well as it can learn to answer other questions" is not much evidence either way about the difficulty of alignment. I'm not saying we'll get evidence that I'll ignore; I'm naming in advance some things that I wouldn't consider negative evidence (partially in hopes that I can refer back to this post when people crow later and request an update). But, yes, my model does have the inconvenient property that people who are skeptical now, are liable to remain skeptical until it's too late, because most of the evidence I expect to give us advance warning about the nature of the problem is evidence that we've already seen. I assure you that I do not consider this property to be convenient.

    As for things that could convince me otherwise: technical understanding of intelligence could undermine my "sharp left turn" model. I could also imagine observing some ephemeral hopefully-I'll-know-it-when-I-see-it capabilities thresholds, without any sharp left turns, that might update me. (Short of "full superintelligence without a sharp left turn", which would obviously convince me but comes too late in the game to shift my attention.)

  6. ^

    To use my overly-detailed evocative example from earlier: Humans aren't tempted to rewire our own brains so that we stop liking good meals for the sake of good meals, and start eating only insofar as we know we have to eat to reproduce (or, rather, maximize inclusive genetic fitness) (after upgrading the rest of our minds such that that sort of calculation doesn't drag down the rest of the fitness maximization). The cleverer humans are chomping at the bit to have their beliefs be more accurate, but they're not chomping at the bit to replace all these mere-shallow-correlates of inclusive genetic fitness with explicit maximization. So too with other minds, at least by default: that which makes them generally intelligent, does not make them motivated by your objectives.

New to LessWrong?

New Comment
53 comments, sorted by Click to highlight new comments since: Today at 8:10 AM

Thanks for the post, I think it's a useful framing. Two things I'd be interested in understanding better:

In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).

As I said in a reply to Eliezer's AGI ruin post:

There are some ways in which AGI will be analogous to human evolution. There are some ways in which it will be disanalogous. Any solution to alignment will exploit at least one of the ways in which it's disanalogous. Pointing to the example of humans without analysing the analogies and disanalogies more deeply doesn't help distinguish between alignment proposals which usefully exploit disanalogies, and proposals which don't.

So I'd be curious to know what you think the biggest disanalogies are between the example of human evolution and building AGI. Relatedly, would you consider raising a child to be a "real example of intelligence being developed"; why or why not?

Secondly:

Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure

Granting that there's a bunch of logical structure around how to think in accurate ways (e.g. solving scientific problems), and there's a bunch of logical structure around how to pursue goals coherently (e.g. avoiding shutdown) what's the strongest reason to believe that agents won't learn something closely approximating the former before they learn something closely approximating the latter? My impression of Eliezer's position is that it's because they're basically the same structure - if you agree with this, I'd be curious what sort of intuitions or theorems are most responsible for this belief.

(Another way of phrasing this question: suppose I made an analogous argument before the industrial revolution, saying something like "matter and energy are fundamentally the same thing at a deep level, we'll soon be able to harness superhuman amounts of energy, therefore we're soon going to be able to create superhuman amounts of matter". Yet in fact, while the premise of mass-energy equivalence is true, the constants are such that it takes stupendously more energy than humans can generate, in order to produce human-sized piles of matter. What's the main thing that makes you think that the constants in the intelligence case are such that AIs will converge to goal-coherence before, or around the same time as, superhuman scientific capabilities?)

I want to outline how my research programme attempts to address this core difficulty.

First, like I noted before, evolution is not a perfect analogy for AI. This is because evolution is directly selecting the policy, whereas a (model-based) AI system is separately selecting (i) a world-model (ii) a reward function and (iii) a plan (policy) based on i+ii. This inherently produces better generalization-of-alignment (but not nearly enough to solve the problem).

With iii, we have the least generalization problems, because we are not limited by training data: the AI can use the world-model to test the plan in any scenario, limited only by computing resources.

With ii, we have ample generalization problems because (a) the true reward function we are trying to convey to the AI is complex and (b) the data-points we do have might contain systematic errors. The MIRI approach to addressing this is (1) focusing on a relatively narrow task (like the strawberry problem) and (2) somehow add corrigibility. This approach is difficult because "corrigibility" is not a terribly natural property, and AFAICT it's ill-defined even pretheoretically. Instead, I propose to address this by directly learning the user's preferences, a path that MIRI believes to be harder (but I believe to be easier).

Since the user is an "arbitrary" system from the AI's perspective, in order to learn the user's preferences we need to understand how to assign agentic interpretations to arbitrary systems (the "intentional stance"), with the understanding that these interpretations are only meaningful to the extent the system is actually an agent (a rock has no sensible agentic interpretation). This seems to me as a natural problem, and indeed there is a line of attack using algorithmic information theory: 1 2.

However, having a well-grounded method of assigning utility functions to policies or programs is, in itself, insufficient. This is because (a) we still need the AI to learn the user's policy/program and (b) we need to avoid allowing the AI to choose a convenient utility function by modifying the user or hacking the channel through which the information is received. To solve this, I propose to use certain tools provided by the infra-Bayesian physicalism (IBP) framework. Specifically, IBP allows formally specifying the notion of "which programs run in the universe". The user is then one such program, and the remaining problem is how to select it among other problems, which seems tractable by establishing a certain "handshake" protocol. Moreover, the AI only considers the past[1] behavior[2] of the user, so it's impossible for the AI to "cheat" as in 'b' above.

Finally, we need to deal with the generalization of i. At first glance, this should be easier since (a) the true world-model should have low description complexity, implying easy generalization and (b) any false world-model is falsifiable by reality itself, without extra offer on our part. However, from the perspective of a Cartesian agent the world is actually high complexity (because of the need for bridge rules), undermining 'a'. [EDIT: Moreover, a false world-model can be erroneous at only a few special places, s.t. there are only a few mistakes but their impact is large.] The resulting failures can take the form of malign agents inside the world-model itself.

Here again IBP comes to the rescue, giving the agent an epistemology that requires no bridge rules. [EDIT: And, since the agent holds an unprivileged position in the universe, it leaves much less room for simple-to-describe false world-models that only make different predictions for very special situations.] This doesn't solve all problems entirely, and in particular the agent can still develop malign simulation hypotheses, although (as opposed to Cartesian agents), these malign hypotheses no longer have an overwhelming advantage in probability mass. To address this, I propose designing a filtering mechanism which discards such hypotheses (roughly speaking, it makes the AI disbelieve any hypothesis that involves a powerful / unhumanlike creator, formalized using IBP tools). It is currently an open problem to demonstrate that this is a complete solution for world-model generalization / inner alignment (or augment it if it isn't), but it does not seem intractable.

I expect a lot of the details to continue to change in the future, as more layers of the math become revealed, but I'm pretty confident in the ability this style of research to guide us onto the right path, eventually.


  1. More precisely, the part of the user's subjective timeline which is outside the AI's logical-causal future, as can be specified in IBP. ↩︎

  2. Where by "behavior" I mean the computation producing this behavior, rather than just the result of this computation. ↩︎

Do you feel your agenda will allow us to formalise the idea of "don't hack the agent who provides your reward signal" in some way? Every attempt I've seen has either failed or been too restrictive.

The AI only considers the user's timeline before the AI's creation as specifying the loss function. It cannot change the past[1].


  1. Unless time travel is possible. I haven't thought through the implications of time travel, but it seems sufficiently unlikely that handling that scenario is a "luxury". ↩︎

This is because evolution is directly selecting the policy

Huh? Evolution did not directly select over human policy decisions. Evolution specified brains, which do within-lifetime learning and therefore learn different policies given different upbringings, and e.g. learning rate mutations indirectly leads to statistical differences in human learned policies. Evolution probably specifies some reward circuitry, the learning architecture, the broad-strokes learning processes (self-supervised predictive + RL), and some other factors, from which the policy is produced. 

The IGF->human values analogy is indeed relevantly misleading IMO, but not for this reason.

When I say "policy", I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say "evolution is directly selecting the policy", I mean that genotypes are selected based on their "expected reward" (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, that's something the outer loop has to learn.


  1. Evolution is not even model-free RL, since in MFRL we train a network to estimate the value function or the Q-function of different states, we don't just GD on the expected reward. But, MFRL does have the problem of extrapolating the reward function incorrectly away from the training data. ↩︎

The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF.

This analogy gets brought out a lot, but has anyone actually spelled it out explicitly? Because it's not clear to me that it holds if you try to explicitly work out the argument. 

In particular, I don't quite understand what it would mean for evolution to optimize the species for fitness, given that fitness is defined as a measure of reproductive success within the species. A genotype has a high fitness, if it tends to increase in frequency relative to other genotypes in that species. 

To be more precise, there is a measure of "absolute fitness" that refers to a specific genotype's success from one generation to the next: if a genotype has 100 individuals in one generation and 80 individuals in the next generation, then it has an absolute fitness of 0.8. But AFAIK evolutionary biology generally focuses on relative fitness - on how well a genotype performs relative to others in the species. If genotype A has an absolute fitness of 1.2 and genotype B has an absolute fitness of 1.5, then genotype B will tend to become more common than A, even though both have fitness > 1. 

Quoting from this Nature Reviews Genetics article:

Although absolute fitness is easy to think about, evolutionary geneticists almost always use a different summary statistic, relative fitness. The relative fitness of a genotype, symbolized w, equals its absolute fitness normalized in some way. In the most common normalization, the absolute fitness of each genotype is divided by the absolute fitness of the fittest genotype 11, such that the fittest genotype has a relative fitness of one. We can also define a selection coefficient, a measure of how much worse the A2 allele is than A1. Mathematically, w2 = 1−s. Just as before, we can calculate various statistics characterizing relative fitness. We can, for instance, find the mean relative fitness ( = pw1 + qw2), as well as the variance in relative fitness. [...]

It is the relative fitness of a genotype that almost always matters in evolutionary genetics. The reason is simple. Natural selection is a differential process: there are winners and losers. It is, therefore, the difference in fitness that typically matters.

Going with our previous example, genotype A would have a fitness of 0.8 and genotype B would have a fitness of 1. 

The most natural interpretation of the "fitness of the species" would be as the mean relative fitness of the species:

In late 1960s and early 1970s, Alan Robertson 24 and George Price 25 independently showed that the amount by which any trait, X, changes from one generation to the next is given by the genetic covariance between the trait and relative fitness. (The relevant covariance here is the “additive genetic covariance,” a statistic that disentangles the additive from dominance and epistatic effects of alleles 26) If a trait strongly covaries with relative fitness, it will change a good deal from one generation to the next; if not, not. This result is now known as the Secondary Theorem of Natural Selection 27, 28.

If the trait, X, is relative fitness itself, the additive genetic covariance between X and fitness collapses into the additive genetic variance in relative fitness, VA (w). Theory allows us to predict, therefore, how much the average relative fitness of a population will change from one generation to the next under selection: it will change by VA (w). Because a variance cannot be negative, the mean relative fitness of a population either increases or does not change under natural selection (the latter possibility could occur if, for instance, the population harbors no genetic variation). This finding, the Fundamental Theorem of Natural Selection, was first derived by Ronald A. Fisher 29 early in the history of evolutionary genetics. Despite the misleading nomenclature, the Fundamental Theorem is clearly a special case of the Secondary Theorem. It is the Secondary Theorem that is more fundamental.

However, it seems to me that - given that the mean relative fitness is defined by reference to the trait with the highest fitness within the genotype, that implies that the definition of the mean relative fitness changes over time. If the highest-fitness trait changes over time - because the environment changes (due to changes in the climate, other species, etc.), or because of the emergence of a new trait - then the mean relative fitness of the species also changes. The species might also be spread across different regions, with the same trait having different fitness in different regions:

A genotype’s fitness might vary spatially. Within a generation, a genotype might enjoy high fitness if it resides in one region but lower fitness if it resides in other regions. In diploids, spatial variation in fitness can, under certain conditions, maintain genetic variation in a population, a form of so-called balancing selection. The conditions required depend on the precise way in which natural selection acts.

In one scenario, different regions, following viability selection, contribute a fixed proportion of adults to a large random-mating population. This scenario involves “soft selection”: selection acts in a way that changes genotype frequencies within a region but that does not affect the number of adults produced by the region. [...]

In another scenario, different regions, following viability selection, contribute variable proportions of adults to a large random-mating population, depending on the genotypes (and thus fitnesses) of individuals within a region. This scenario involves “hard selection”: selection acts in a way that changes genotype frequencies within a region and affects the number of adults produced by the region.

Also:

The Fundamental Theorem of Natural Selection implies that the mean relative fitness,  of a population generally increases through time and specifies the amount by which it will increase per small unit of time. This suggests a tempting way to think about natural selection: it is a process that increases mean relative fitness.

While attractive and often powerful, it should be emphasized that— surprisingly— the mean fitness of a population does not always increase under natural selection. Population geneticists have identified a number of scenarios in which selection acts but [mean relative fitness] does not increase. These include frequency dependent selection (wherein the fitness of a genotype depends on its frequency in a population) and, in sexual species, certain forms of epistasis (wherein the fitness of a genotype depends on non-additive effects over multiple loci). Put differently, these findings show that the Fundamental Theorem of Natural Selection does not invariably hold. 

The paper does note that one can define alternative definitions of fitness under which the fundamental theorem does hold, but that the "relevant literature is forbidding". The general takeaway that I would draw from this is that fitness is not the kind of clear-cut, "carves reality at joints" kind of a measure that evolution would directly optimize in a similar kind of sense as you directly optimize, say, the amount of correct classifications that a neural net gets on MNIST. 

Rather it's a theoretical fiction or an abstract measure that can be defined in different ways, and which is defined in different ways in different contexts, depending on what kind of an aim one wants to achieve. But that's a simplifying interpretation imposed on complex process for the purpose of modeling it, rather than something that the process actually has an explicit optimization target. So there are ways in which you could view evolution as if it was optimizing for something, but it's not clear to me that it can be said to actually be optimizing for anything in particular - at least not in the sense in which we talk about a machine learning system being optimized for a particular goal.

'Fitness' is a very overloaded term, as you've delved into above. I'd like to attempt to describe a few carvings which help me to firm things up and avoid equivocation in my own thinking.

The original pretheoretic term 'fitness' meant 'being fitted/suitable/capable (relative to a context)', and this is what Darwin and co were originally pointing to. (Remember they didn't have genes or Mendel until decades later!)

The modern technical usage of 'fitness' very often operationalises this, for organisms, to be something like number of offspring, and for alleles/traits to be something like change in prevalence (perhaps averaged and/or normalised relative to some reference).

So natural selection is the ex post tautology 'that which propagates in fact propagates'.

If we allow for ex ante uncertainty, we can talk about probabilities of selection/fixation and expected time to equilibrium and such. Here, 'fitness' is some latent property, understood as a distribution over outcomes.

If we look at longer timescales, 'fitness' is heavily bimodal: in many cases a particular allele/trait either fixes or goes extinct[1]. If we squint, we can think of this unknown future outcome as the hidden ground truth of latent fitness, about which some bits are revealed over time and over generations.

A 'single step' of natural selection tries out some variations and promotes the ones which in fact work (based on a realisation of the 'ex ante' uncertain fitness). This indeed follows the latent fitness gradient in expectation.

In this ex ante framing it becomes much more reasonable to treat natural selection as an optimisation/control process similar to gradient descent. It's shooting for maximising the hidden ground truth of latent fitness over many iterations, but it's doing so based on a similar foresight-free local heuristic like gradient descent, applied many times.

How can we reconcile this claim with the fact that the operationalised 'relative fitness' often walks approximately randomly, at least not often sustainedly upward[2]? Well, it's precisely because it's relative - relative to a changing series of fitness landscapes over time. Those landscapes change in part as a consequence of abiotic processes, partly as a consequence of other species' changes, and often as a consequence of the very trait changes which natural selection is itself imposing within a population/species!

So, I think, we can say with a straight face that natural selection is optimising (weakly) for increased fitness, even while a changing fitness landscape means that almost by definition relative fitness hovers around a constant for most extant lineages. I don't think it's optimising on species, but on lineages (which sometimes correspond).[3]


  1. In cases where the relative fitness of a trait corresponds with its prevalence, there can be a dynamic equilibrium at neither of these modes. Consider evolutionary stable strategies. But the vast majority of mutations ever have hit the 'extinct' attractor, and a lot of extant material is of the form 'ancestor of a large proportion of living organisms'. ↩︎

  2. Though note we do see (briefly?) sustained upward fitness in times of abundance, as notably in human population and in adaptive radiation in response to new resources, habitats, and niches becoming available. ↩︎

  3. Now, if the earlier instances of now-extinct lineages were somehow evolutionarily 'frozen' and periodically revived back into existence, we really would see that natural selection pushes for increased fitness. But because those lineages aren't (by definition) around any more, the fitness landscape's changes over time are under no obligation to be transitive, so in fact a faceoff between a chicken and a velociraptor might tell a different story. ↩︎

And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms,

This is a super bizarre argument to make about humans considering that we are orders of magnitude more successful by any IGF metric than other apes, and clearly one of the most IGF-successful species ever.

Optimizing apes for inclusive genetic fitness did make the resulting humans optimize mentally for IGF (through various mixes of implicit habitual and explicit model-based reasoning, as is computationally efficient). Across history humans have consciously planned to further their bloodlines, and arguably that was the primary goal of most typical elites/nobles throughout most of history. This analogy supports the opposite of your point.

Invention of condoms is hardly evidence that humanity as a whole stopped optimizing for IGF: having unwanted children often doesn't maximize IGF. Today there are some humans who are leveraging modern technology to better explicitly optimize for IGF, allowing them to have as many children as kings of old.

Evolution proceeds by variation followed by selection. Our current enormous success entails we are in a variation dominated regime, so it's only expected that there are a wide variety of strategies being explored, most of which won't optimize well for IGF. But given time eventually the few that do will dominate (or they would in worlds without AI and the postbiological transition).

I think the degree to which alignment generalizes depends a lot on the type of alignment you're talking about. I think that corrigibility generalizes really badly. In contrast, I think that first-order values kind of okeyishly, and that second order values (i.e., meta ethics on how to weigh different values against each other) generalizes really well. In fact, I suspect that the meta ethics attractor is stronger than the capabilities attractor. 

I also think about the "human values versus evolution" misalignment in a substantially different manner. I don't think that evolution "tried" to directly specified human values. Rather, I think it specified the human reward circuitry, then individual humans learn their values by optimizing for activating their reward circuitry in their environment. So, when you're wondering how much misalignment to expect between a learning process and its reward signal, you shouldn't look at the level of misalignment between things that increase inclusive genetic fitness and human values. Instead, you should be looking at the level of misalignment between things that increase human reward circuit activation and human values.

Finally, I think that evolution had very little say in our meta ethics. I think the meta ethics we do have is more or less convergent for a broad class of learning processes. I think this for two reasons:

  1. I don't think there was much evolutionary pressure towards adopting a specific meta ethics. You can, of course, invent "just so" stories about how some particular type of meta ethics cognition was actually adaptive in the ancestral environment. I don't think any such stories hold water for the simple fact that very few humans actually perform meta ethical reasoning. Even today, with our near-universal literacy, highly abstract society and preexisting frameworks for meta ethical cognition, it's very rare. The odds of it being common / important enough in the ancestral environment to be a significant focus of evolutionary optimization are tiny.
  2. If you reason from first principles about how agency / learned values ought to work for a generic RL system trained on a complex reward function in a complex environment, then I think you end up with a system that more or less has to do some sort of value reflection / moral philosophy process startlingly similar to our own meta ethical cognition. 

I realize that the second point seems like a stretch. I intend to eventually make a post that properly presents the case for such a conclusion. In the meantime, you can read this comment, which poorly presents the case for such a conclusion. The core argument goes like:

  • Given a model trained via RL, there is no "ground truth" on how to draw the agency boundaries. You can draw one boundary around all the parameters and call that "one agent". You can also draw many boundaries around different parameter subsets and call the system "multi agent". 
  • Different regions of the model will tend to specialize towards different types of situations in which the model could receive reward for its actions. 
  • These different regions of specialization will then tend to have different value representations that are specific to the types of situations in which those regions "activate" to steer the system's behavior.
  • Due to the aforementioned arbitrariness of the agent boundaries, we can draw "soft" agent boundaries around these (overlapping) regions of partial specialization. The overall model is less like a single agent with a single value representation, and more like a continuous distribution over possible subagents, whose individual values can be in conflict and highly situational.
  • Thus, the process by which the overall model becomes coherent is via a quasi-multi-agent negotiation process between the different regions in the distribution over subagents / values.
  • If you think about the sort of cognition that's required to reach the Pareto frontier of internal consensus, it would involve things like weighing different learned values against each other, reflecting on which circumstances different values ought to apply to, checking whether a given joint policy among our values is in strong conflict with any of our existing values[1], etc.
  • Basically, the sorts of cognition that are necessary to resolve conflicts between learned values (or "mesa objectives", if you prefer that term) seems very similar to the sorts of cognition that are central to our meta ethical process.
  1. ^

    Consider the cognitive process that would be involved in such a check. You'd have a proposed method to define a joint policy for maximizing your current distribution of values. You'd want to verify that there aren't situations in which this joint policy is radically misaligned with your existing values. To do this, you'd search over situations where the joint policy proscribed actions that were strongly in conflict with one or more of your existing values. If we substitute in the name "moral theory" in place of "joint policy", then this elegantly recovers the cognitive process that we call "moral philosophy by counterexample", without ever having appealed to our evolutionary history or any human-specific aspect of cognition!

Ronny Fernandez on Twitter:

I think I don’t like AI safety analogies with human evolution except as illustrations. I don’t think they’re what convinced the people who use those analogies, and they’re not what convinced me. You can convince yourself of the same things just by knowing some stuff about agency.

Corrigibility, human values, and figure-out-while-aiming-for-human-values, are not short description length. I know because I’ve practiced finding the shortest description lengths of things a lot, and they just don’t seem like the right sort of thing.

Also, if you get to the level where you can realize when you’ve failed, and you try it over and over again, you will find that it is very hard to find a short description of any of these nice things we want.

And so this tells us that a general intelligence we are happy we built is a small target within the wide basin of general intelligence

Ideal agency is short description length. I don’t think particular tractable agency is short description length, and ml cares about run time, but there are heuristic approximations to ideal agency, and there are many different ones because ideal agency is short description length

So this tells us that there is a wide basin of attraction for general intelligence.

These two problems appear in the strawberry problem, which Eliezer's been pointing at for quite some time: the problem of getting an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else. The demand of cellular-level copying forces the AI to be capable; the fact that we can get it to duplicate a strawberry instead of doing some other thing demonstrates our ability to direct it; the fact that it does nothing else indicates that it's corrigible (or really well aligned to a delicate human intuitive notion of inaction).

Let T be the target objective we wish to align the system towards. In this paragraph, T is duplicating the strawberry (and doing nothing else). I seriously doubt that alignment is comparably difficult for all objectives T, or even all "reasonable" objectives. I think building an AGI which solves the strawberry problem is far harder than building an AGI which makes lots of dogs in the future. I think that a diamond maximizer is also harder to train than an AGI which makes lots of dogs, but an AGI which makes lots of diamonds is probably only a little harder than an AGI which makes lots of dogs. 

(I haven't explained why I hold these beliefs, but I figured it would be productive to at least note this axis of disagreement.)

When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there's no guarantee there's even a reasonable solution. We'd probably be better off thinking about how to make AGIs which care about dogs, because it's an empirical fact that there's a way to align intelligences to that objective. 

When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there's no guarantee there's even a reasonable solution.

Why would there not be a solution?

To clarify, I said there might not be a reasonable solution (i.e. such that solving the strawberry problem isn't significantly harder than solving pivotal-act alignment). 

Not directly answering your Q, but here's why it seems unnatural and maybe misleading-of-attention. Copied from a Slack message I sent: 

First, I suspect that even an aligned AI would fail the "duplicate a strawberry and do nothing else" challenge, because such an AI would care about human life and/or about cooperating with humans, and would be asked to stand by while 1.8 humans die each second and the world inches closer to doom via unaligned AI. (And so it also seems to need to lack a self-preservation drive)

Since building an aligned AI is not a sufficient condition, this opens the possibility that the strawberry problem is actually harder than the alignment task we need to solve to come out of AGI alive. 

Two separate difficulties with this particular challenge:

  1. I think "duplicate a single strawberry" is a very unnatural kind of terminal goal. I think it would be very hard to raise a human whose primary value was the molecular duplication of a strawberry, such that this person had few other values to speak of, even if you had deep understanding of the human motivational system. I think this difficulty is relevant because I think human values are grown via RL in the human brain (I have a big doc about this), and I think deep RL agents will have values grown in a similar fashion. I have a lot to say here to unpack these intuitions, but it'd take a lot longer than a paragraph. Maybe one quick argument to make is that the only real-world intelligences we have ever seen do not seem like they could grow values like this, and I'm updating hard off of these empirical data. I do have more mechanistic reasoning to share, though, lest you think I'm inappropriately relying on analogies with humans.
  2. I think "do nothing else" is unnatural because it's like you're asking for a very intelligent mind, which wants to do one thing, but no other things. For all the ways I have imagined intelligence being trained out of randomly initialized noise, I imagine learned heuristics (e.g. if near the red-exit of the maze, go in that direction) grow into contextually activated planning with heuristic evaluations (e.g. if i'm close to the exit as the crow flies / in L2, then try exhaustive search with depth 4, and back out to greedy heuristic search if that fails), such that agents reliably pull themselves into futures where certain things are true (e.g. at the end of the maze, or have grown another strawberry), and it seems like most of these goals should be "grabby" (e.g. about proactively solving more mazes, or growing even more strawberries). That agents will not stop steering themselves into certain kinds of futures after having made a single strawberry.
    1. But even if this particular picture is wrong, it seems hard to fathom that you can train an intelligent mind into sophistication, and then have it "stop on a dime" after doing a single task.
    2. Even if that single task were "breed one new puppy into existence" (I think it's significantly easier to get a dog-producing AI), it sure seems to me like the contextually activated cognition which brought the first puppy into existence, would again activate and find another plan to bring another puppy into existence, and that the AI would model that if it didn't preserve itself, it couldn't bring that next puppy into existence, and that this is a "default" in some way

This was my first time trying to share my intuitions about this. Hopefully I managed to communicate something.

The whole strawberry thing is really confusing to me; it just doesn't map to any natural problem that humans actually care about. And when EY says "and nothing else," it's not clear what the actual boundaries to impact are. In order to create a strawberry, the AI must modify other things about the world. At the very least, you have to get the atoms for the strawberry from somewhere. And since the requirement is for the two strawberries to be "identical on the cellular level" (which IMO is not a scientifically ground concept), the AI would presumably have to invent some advanced technology, presumably nanotech, which requires modifying the world too. Even if the AI does nanotech R&D entirely in simulation, there is still some limited impact (due to energy use of the computers, as well as the need to print DNA or whatever to manifest the nanotech in reality, etc).

I think the test should be the Wozniak test: a simple robot enters a random home and uses the tools available to make coffee. That's a much more sensible test, we can easily verify whether there is any impact outside of the home, and if EY is right, it should be just as difficult to do consistently, given the difficulty of alignment.

The whole strawberry thing is really confusing to me; it just doesn't map to any natural problem that humans actually care about.

It maps to the pivotal acts I think are most promising; the order of difficulty seems similar to me, and the kind of problem seems similar to me too.

And when EY says "and nothing else," it's not clear what the actual boundaries to impact are.

I think EY mainly means 'without the AI steering or directly impacting the macro-state of the world'. The task obviously isn't possible if you literally can't affect any atoms in the universe outside the strawberries itself. But IMO it would work fine for this challenge if the humans put a bunch of resources into a room (or into a football stadium, if needed), and got the AI to execute the task without having a direct large impact on the macro-state of the world (nor an indirect impact that involves the AI deliberately steering the world toward a new macro-state in some way).

I think the test should be the Wozniak test: a simple robot enters a random home and uses the tools available to make coffee. [...] EY is right, it should be just as difficult to do consistently, given the difficulty of alignment.

This seems much easier to do right, because (a) the robot can get by with being dramatically less smart, and (b) the task itself is extremely easy for humans for understand, oversee, and verify. (Indeed, the task is so simple that in real life you could just have a human hook up a camera to the robot and steer the robot by remote control.)

For the Wozniak test, the capabilities of the system can be dramatically weaker, and the alignment is dramatically easier. This doesn't obviously capture the things I see as hard as alignment.

Two points:

  1. The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I'm curious if other share this model and if it's been refined / explored in more detail by others.
  2. The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous.  (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I'd like to see more arguments on both sides of this debate.

I kind of like the analogous idea of an alignment target as a repeller cone / dome.

Corrigibility is a repeller. Human values aren't a repeller, but they're a very narrow target to hit.

Corrigibility is a repeller.

In the sense of moving a system towards many possible goals? But I think in a more appropriate space (where the aiming should take place) it's again an attractor. Corrigibility is not a goal, a corrigible system doesn't necessarily have any well-defined goals, traditional goal-directed agents can't be corrigible in a robust way, and it should be possible to use it for corrigibility towards corrigibility, making this aspect stronger if that's what the operators work towards happening.

More generally, non-agentic aspects of behavior can systematically reinforce non-agentic character of each other, preventing any opposing convergent drives (including the drive towards agency) from manifesting if they've been set up to do so. Sufficient intelligence/planning advantage pushes this past exploitability hazards, repelling selection theorems, even as some of the non-agentic behaviors might be about maintaining specific forms of exploitability.

Human values aren't a repeller, but they're a very narrow target to hit.

As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.

This of course doesn't help with corrigibility.

Yes. :) A thing I considered including in my comment (but left out) is 'capabilities are the Grand Canyon; alignment with human values is a teacup'.

In both cases, something can land partway in the basin and then roll the rest of the way down. But it's easier to hit the target 'anywhere inside the Grand Canyon' than to hit the target 'anywhere inside the teacup'.

(And, at risk of mixing the metaphors way too much: if something rapidly rolls down a slope in the Grand Canyon, and it's carrying a teacup, the teacup is likely to break. I.e., insofar as your pre-left-turn system was aligned, the huge changes involved in rolling down the Grand Canyon are likely to break your guarantees. Human values are a narrow target to hit, and the sharp left turn is an extreme change that makes it hard to preserve fragile targets like that by default.)

Corrigibility is like... a mountain with an empty swimming pool at the top? If you can land in the pool, you'll tend to stay there, and it's easy to roll from the shallow end of the pool to the deep end. And the pool seems like a much easier target to hit than the teacup. But if you miss the pool, you'll slide all the way down the mountain.

Also, the swimming pool is lined with explosives that are wired to blow up whenever you travel deeper into the Grand Canyon.

(OK, maybe some metaphors are not meant to be mixed...)

The best resource that I have found on why corrigibility is so hard is the arbital post, are there other good summaries that I should read? 

Not an answer, but I think of "adversarial coherence" (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.

Is this "sharp left turn" a crux for your overall view, or your high probability of failure?

Naively, it seems to me that if capability gains are systematically gradual, that improvements are iterative, and occur at little at a time, we're in a much better situation with regard to alignment. 

If capabilities gains are gradual, we can continuously feed training data to our system and keep its alignment in step with its capabilities. As soon as it starts to enter a distributional shift, and some of its outputs are (or would be) unaligned, those alignment failures are immediately corrected. You can keep reinforcing corrigibility as capabilities generalize, so that it correctly generalizes the corrigibility concept. Similarly, the more gradually capabilities grow, the more reliable oversight schemes will be.

(On the other hand, this doesn't solve the problem that there's some capability threshold beyond which the outputs of an AI system are illegible to humans, and we can't tell whether or not the outputs are aligned or not, in order to give it corrective training data.

Also, if one could, in principle, increase capabilities gradually, but someone else can throw caution to the wind and turn up the capability dial to 11, the unilateralist's curse kills us.)

How much would finding out that there's not going to be a sharp left turn impact the rest of your model? 

Or, suppose we could magically scale up our systems as gradually as you, Nate, would like, slowing down as we start to see a super-linear improvement, how much safer does is humanity?

Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think

The sharp left turn is not a simple observation that we've seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment.

Many times over the past year, I've been surprised by people in the field who've read Nate's post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I'll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about.

For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they'll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing.

Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things together kind of optimising for some goals into a coherent agent optimising for some goals.

In any case, there's this strong gradient pointing towards capabilities generalisation.

The issue is that a more coherent and more agentic solution might have goals different from what the fuzzier solution had been achieving and still perform better. The goal-contents of the coherent agent are stored in a way different from how a fuzzier solution had stored the stuff it had kind of optimised for. This means that the gradient points towards the architecture that implements a more general and coherent agent; but it doesn't point towards the kind of agent that has the same goals the current fuzzy solution has; alignment properties of the current fuzzy solution don't influence the goals of a more coherent agent the gradient points towards.

It is also likely that the components of the fuzzy solution undergo optimisation pressure which means that the whole thing grows towards the direction near components that can outcompete others. If a component is slightly slightly better at agency, at situational awareness, etc., , it might mean it gets to have the whole thing slightly more like it after an optimisation step. The goals these components get could be quite different from what they, together, were kind of optimising for. That means that the whole thing changes and grows towards parts of it with different goals. So, at the point where some parts of the fuzzy solution are near being generally smart and agentic, they might get increasingly smart and agentic, causing the whole system to transform into something with more general capabilities but without gradient also pointing towards the preservation of the goals/alignment properties of the system.

I haven't worked on this problem and don't understand it well; but I think it is a real and important problem, and so I'm sad that many haven't read this post or only skimmed through it or read it but still didn't understand what it's talking about. It could be that it's hard to communicate the problem (maybe intuitions around optimisation are non-native to many?); it could be that not enough resources were spent on optimising the post for communicating the problem well; it could be that the post tried hard not to communicate something related; or it could be that for a general LessWrong reader, it's not a well-written post.

Even if this post failed to communicate its ideas to its target audience, I still believe it is one of the most important LessWrong posts in 2022 and contributed something new and important to the core of our understanding of the AI alignment problem.

there's no analogously-strong attractor well pulling the AGI's objectives towards your preferred objectives

I'm starting to doubt that there are strategically important human-specific objectives in the decision theory sense, things that should be used to actually optimize everything without goodharting making it counterproductive. In this hypothesis, optimization goals are not just hard to figure out, but there is almost nothing there that's human-specific, human preference is generic. Orthogonality thesis applies to agents with goals, but maybe it doesn't apply to humans, because their goals play a different role from what orthogonality thesis needs. To solve astronomical waste, humans could run their civilization on better substrate and look for goal-shaped principles that can be propagated more efficiently than civilization itself, used for optimization, but these principles are not going to be human-specific.

If an AGI is in a similar situation (doesn't have non-goodharting goals), it's going to be in the same attractor as humans, motivated to build a generic civilization. It doesn't necessarily involve humans or human-specific things, but I'm not sure this is different from a specific human deciding that their civilization doesn't involve other humans, which a priori seems like an unjustifiably privileged aspect of civilizational design.

In other words, applicability of orthogonality thesis might fail if the kind of goal knowledge relevant to it (that overcomes goodharting) tends to get obtained in convergent ways that give results that are generic, not human-specific. The disagreement of this hypothesis with the standard position is about the distinction/interaction between non-goodharting and goodharting goals. On the generic goals hypothesis, most accidental goals, including those currently held by humans, are goodharting goals (according to their role in the minds of people, not to their content), something that shouldn't be used to strongly optimize the world in anything close to their current form. And the proper way of getting appropriate non-goodharting goals (running a civilization/long reflection/CEV) somehow doesn't importantly depend on the currently held goodharting goals (this seems to be the crux).

So in this view, the dangerous AGIs are those that hold any goals in a non-goodharting role, ready to optimize the world according to them, while by default AGIs with more vague architectures are going to face the same problem of formulating non-goodharting goals as humans, and work within the same attractor for arriving at its solution. A better but not drastically different outcome is AGIs that hold human goodharting goals in a goodharting role and no goals in a non-goodharting role, so that they are even more likely to go about formulating non-goodharting goals in the same way humans would, without being opposed to actual humans in the process. Language models might help with this part.

(We can use these terms to restate the familiar catastrophic failure of alignment where an AGI is aligned to hold human goodharting goals (what we currently care about) in a non-goodharting role (as a target for optimizing the world with) without giving civilization the opportunity to improve on those goals and without limiting their applicability to situations comprehensible to current humans. A failure of non-goodharting (optimizing too much on goodharting goals), resulting in a failure of corrigibility (preventing the non-goodharting extrapolated volition from eventually being in charge).)

good capabilities form something like an attractor well

In my own experience examining the foundations of things in the world, I have repeatedly found there to be less of an attractor-of-fundamentally-effective-decision-making than I had anticipated. In every way that I expected to find such an attractor within epistemology, mathematics, empiricism, ethics, I found in fact that even the very basic assumptions that I started with were unfounded, and found nothing firm to replace with them with. Probability theory: not a fundamental answer to epistemology; proof-based agents: deeply paradoxical and seemingly untrustworthy; the scientific method: not precise enough to be a final answer to anything; consequentialism: what actually is it? I have the sense that you've seen just as much of this phenomenon as I have, but you still seem to hold this conviction that there is a deep well of fundamental reasonability for an AI to fall into. Why? I'm just suggesting that the existence of this well is non-trivial feature of reality, and the more we fail to find it, the more we might question whether it exists.

I respond with arguments like, "In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”)."

Their response in turn is generally some variant of "well, natural selection wasn't optimizing very intelligently" or "maybe humans weren't all that sharply above evolutionary trends" or "maybe the power that let humans beat the rest of the ecosystem was simply the invention of culture, and nothing embedded in our own already-existing culture can beat us" or suchlike.

Rather than arguing further here, I'll just say that failing to believe the hard problem exists is one surefire way to avoid tackling it.

It sounds like you don't want to argue this point further here, but I would like to point something very simple out that I think your argument here glosses over. 

Humanity is a species, not an individual. It wasn't the case that a single animal arose among all the others, and out-competed everyone else. Instead, it was a large set of entities that collectively out-competed all the other animals. And I think this distinction is quite important to make.

If you think that an analogy to human evolution is critical to understanding our epistemic situation, it appears to me that the evolutionary analogy should force you to draw the opposite conclusion from the one you have drawn here (relative to credible people who disagree).

In my understanding of our situation, the conclusion to draw from human evolution is that a single species can acquire a host of very powerful technologies, and tower above everyone else, in a relatively short period of time. That is, we should predict that, in the future, a collection of AIs could eventually out-match humanity. 

But you're not arguing that thesis! (At least, as I understand your argument) You're arguing that the evolutionary analogy shows that a single individual can outcompete everyone else. And I don't know where that idea is coming from.

I think alignment might well be very hard and the stakes are certainly very high, but I must say that I find this post only partly convincing. Here are some thoughts I had, reading the post: 

In primates making them smarter probably mostly required scaling the cortex. Making the smarter version more aligned with IGF would have required a rewiring of the older brain parts. One of these is much harder to do than the other. 

But in machines both alignment and capabilities will likely be learned with the same architecture, making it less likely that capabilities outstrip alignment for architecture reasons than in primates. 

That there is a capabilities well is certainly true. But shouldn't there be value wells as well? One might think "I do what is best for me" is a deeper well than "I do what is best for all" and more aligned with instrumental power seeking.

But "I do what is best for me" isn't really a value well, because it is begging the question. If I do what is best for me, I still have to decide what is good. The second well is actually providing values, because other entities already have preferences. Helping them to fulfill those is an actual value, i.e. directs actions, in a way that only "I do what is best for me" isn't. 

Is the structure underlying "do what is best for all" less simple and logical than arithmetic? 

I often see it assumed that the ultimate values of an AGI will be kind of random. Some mesa-optimiser stumbled into during training. If this is true, than there is little optimisation pressure towards those values and it seems possible to train for "do what is best for all" instead. 

That there is a capabilities well is certainly true. But shouldn't there be value wells as well? One might think "I do what is best for me" is a deeper well than "I do what is best for all" and more aligned with instrumental power seeking.

Do you mean a wider well? Width ("how hard is it to hit this basin at all, such that you start to fall the rest of the way down the right slope?") seems like the main property of interest.

But "I do what is best for me" isn't really a value well, because it is begging the question. If I do what is best for me, I still have to decide what is good. The second well is actually providing values, because other entities already have preferences. Helping them to fulfill those is an actual value, i.e. directs actions, in a way that only "I do what is best for me" isn't. 

"I do what is best for me" and "I do what is best for all" are both too underspecified to say much about; in particular, it's very unclear to me in each case what is meant by "best", and the vague English phrasing seems liable to mislead us, since e.g. it lends itself to equivocating between different meanings of "best", it doesn't provide an obvious path toward making the meaning more precise or unambiguous, and it suggests the relevant goals are simpler than they are (since the English words are short).

If you had a fully specified, unambiguous list of goals — the sorts of goals that could actually function as lines of code in a program selecting between actions — then they would end up looking like a giant space of things like:

  • Maximize confidence that 673 is a prime number
  • Maximize the number of approximately spherical configurations of granite
  • Maximize the number (how many carbon atoms in the universe are arranged in diamond lattices, plus (44.3 times the number of times an atomic replica of my reward button is pushed)) 

There would then be a relatively small subspace of goals that maximizes some function of nervous system states you're currently confident are currently physically instantiated. But "I do what is best for all" makes it sound like "best" and "all" are simple concepts, and like there isn't a huge space of different conceptions of "best" and different conceptions of "all".

There exists a different tractable goal for every tractable function from 'the configuration of all nervous systems in the universe' to outcomes.

Indeed, there further exists a different tractable goal for every tractable function from physical systems to outcomes. Most obviously, different goals will focus on different notions of 'agent' or 'person' or 'mind' -- anything works fine here, from 'all nervous systems' to 'all instantiated algorithms that mentally represent future scenes and decide which direction to walk based on the scenes they represent'. But also, a large number of goals will optimize some function of all aardvark elbows currently instantiated in the universe.

What I am talking about is the dimension from cooperation to conflict. I.e. jointly optimising the preferences of all interacting agents or optimising for one set of preferences at the expense of the preferences of the other involved agents. 

This is a dimension any sufficiently intelligent agent that is trained on situations involving multi-agent interaction will learn. Even if it is only trained to maximize it's own set of preferences. It's a concept that is independent from the preferences of the agents in the single instances of interactions, so the definition of "best" is really not relevant at this level of abstraction. 

It's probably one of the most basic, the most general and one of only very few dimension of actions in multi-agent settings that is always salient.

That's why I say that the two poles of that dimension in agent behavior are wells. (They are both very wide wells I would think. When I said "deep" I meant something more like "hard to dislodge from".) 

I ... recommend aiming humanity's first AGI systems at simple limited goals that end the acute risk period and then cede stewardship of the future to some process that can reliably do the "aim minds towards the right thing" thing

What could that process possibly be? A congress of humanity's best and brightest? An AI sovereign governed by a "Proto-CEV" system of values?

Attempting to manually specify the nature of goodness is a doomed endeavor, of course

where does this come from? Is this just an assumption?

What if it were possibly to specify morality (insofar as safety)? I think it is; Davidad agrees: «Boundaries» for formalizing a bare-bones morality 

There's no analogous alignment well to slide into.

If one made a series of alignment-through-capabilities-shift tasks, you would get one.

I.e., you make a training set of scenarios where a system gets a lot smarter and has to preserve alignment through that capability shift.

Of course, making such a training set is not easy(!).

I'm not talking about recursive self-improvement. That's one way to take a sharp left turn, and it could happen, but note that humans have neither the understanding nor control over their own minds to recursively self-improve, and we outstrip the rest of the animals pretty handily. I'm talking about something more like “intelligence that is general enough to be dangerous”, the sort of thing that humans have and chimps don't.

 

Individual humans can't FOOM (at lest not yet), but humanity did. 

My best guess is that humanity took a sharp left turn when we got a general enough language, and then again when we got writing, and possibly again when the skill of reading an writing spread to a majority of the population.

Before language human intelligence was basically limited to what a single brain could do. When we got language we got the ability of adding compute (more humans) to the same problem solving task. Humanity got parallel computing. This extra capabilities could be used to invent things to increase the population, i.e. recusing self improvement. 

Later, writing gave us external memory. Before our computations where limited by human memory, but now we could start to fill up libraries, unlocking a new level of recursive self improvement.

Every increase in literacy and communication technology (e.g. the internet) is humanity upgrading its capability.

Attempting to manually specify the nature of goodness is a doomed endeavor, of course, but that's fine, because we can instead specify processes for figuring out (the coherent extrapolation of) what humans value. […] So today's alignment problems are a few steps removed from tricky moral questions, on my models.
 

I‘m not convinced that choosing those processes is significantly non-moral. I might be misunderstanding what you are pointing at, but it feels like the fact that being able to choose the voting system gives you power over the vote’s outcome is evidence of this sort of thing - that meta decisions are still importantly tied to decisions.

I feel like it needs more ML-inspired metaphors. Sure anyone can imagine gradient descent arranging weights into encoding of Skynet's source code - what do people say/think about why they don't check for this before training GPT with loss function that would totally love Skynet?

One solution I can see for AGI is to build in some low-level discriminator that prevents the agent from collecting massive reward. If the agent is expecting to get near-infinite reward in the near future by wiping out humanity using nanotech, then we can set a solution so it decides to do something that will earn it a more finite amount of reward (like obeying our commands).

This has a parallel with drugs here on Earth. Most people are a little afraid of that type of high.

This probably isn't an effective solution, but I'd love to hear why so I can keep refining my ideas.

A discussion of related ideas on Arbital: mild optimization.

Very cool! So this idea has been thought of, and it doesn't seem totally unreasonable, though it definitely isn't a perfect solution. A neat idea is a sort of 'laziness' score so that it doesn't take too many high-impact options.

It would be interesting to try to build an AI alignment testing ground, where you have a little simulated civilization and try to use AI to align properly with it, given certain commands. I might try to create it in Unity to test some of these ideas out in the (less abstract than text and slightly more real) world.

maybe it's somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields;

We have a biased access which makes things tricker because we weren't selected for our introspection skills to be high fidelity and having a correspondence with reality. Rather it's about the utility to survival.

I doubt it's all that qualitatively different than the sorts of summits humanity has surmounted before

This seems to imply that we have surmounted the fields of physics, that all available knowledge in all subfields has been acquired whereas the most that can be claimed is that we have reduced the degree of our ignorance in some of those subfields. We have not - by any stretch of the imagination - mastered the field. Indeed, if we think we cannot push our understanding of AI, and related alignment problems, further than our current degree of understanding of physics, I think that is a strong point for the "stop everything, while we still can" case. 

Do you think we could use grokking/current existing generalization phenomena (e.g induction heads) to test your theory? Or do you expect the generalizations that would lead to the sharp left turn to be greater/more significant than those that occurred earlier in the training? 

Many comparisons are made with Natural Selection (NS) optimizing for IGF, on the grounds that this is our only example of an optimization process yielding intelligence.
 

I would suggest considering one very relevant fact: NS has not optimized for alignment, but only for a myopic version of IGF. I would also suggest considering that humans have not optimized for alignment either.
 

Let's look at some quotes, with those considerations in mind:

And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. 

NS has not optimized for alignment, which is why it's bad at alignment compared to what it has optimized for.

Some people I say this to respond with arguments like: "Surely, before a smaller team could get an AGI that can master subjects like biotech and engineering well enough to kill all humans, some other, larger entity such as a state actor will have a somewhat worse AI that can handle biotech and engineering somewhat less well, but in a way that prevents any one AGI from running away with the whole future?" 
 

I respond with arguments like, "In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”)."

NS has not optimized for one intelligence not conquering the rest of the world. As such, it doesn't say anything about how hard it is to optimize to produce one intelligence not conquering the rest of the world.

Their response in turn is generally some variant of "well, natural selection wasn't optimizing very intelligently" or "maybe humans weren't all that sharply above evolutionary trends" or "maybe the power that let humans beat the rest of the ecosystem was simply the invention of culture, and nothing embedded in our own already-existing culture can beat us" or suchlike. 

The response is not that NS is not intelligent, but that NS has not even optimized for any of the things that you have pointed to.

Why does alignment fail while capabilities generalize, at least by default and in predictable practice? In large part, because good capabilities form something like an attractor well. 

My answer would be the same for NS and humans: alignment is simply not optimized for! People spend countless more resources on capabilities than alignment.
If the resources invested in capability vs alignment ratio was reversed, would you still expect alignment to fare so much worse than capabilities?
Let's say you'd still expect that: how much better do you expect the situation to be as a result of the ratio being reversed? How much doom would you still expect in that world compared to now?

good capabilities form something like an attractor well 

Sure, in so far as people will optimize for short-term power (ie, capabilities) because they are myopic and power is the name we give to what is useful in most scenarios.

---

I also expect a discontinuity in intelligence. But I think this post does not make a good case for it: a much simpler theory already explains its observations.


In an upcoming post, I’ll say more about how it looks to me like  ~nobody is working on this particular hard problem, by briefly reviewing a variety of current alignment research proposals. In short, I think that the field’s current range of approaches nearly all assume this problem away, or direct their attention elsewhere. 

I'm very eager to read this.

NS has not optimized for alignment, which is why it's bad at alignment compared to what it has optimized for.

I don't think the "which is why" claim here is true, if you mean 'this is the only reason'. 'Alignment is exactly as easy as capabilities if you're not myopic' seems like a claim that needs to be argued for positively.

My answer would be the same for NS and humans: alignment is simply not optimized for! People spend countless more resources on capabilities than alignment.

NS didn't optimize for humans to be good at biochemistry, nuclear physics, or chess, either. NS produces many things that it wasn't specifically optimizing for. One of the main things that Nate is pointing out in the OP is that alignment isn't on that list, even though a huge number of other things are. "NS doesn't produce things it didn't optimize for" is an overly general response, because it would rule out things like 'humans landing on the Moon'.

If the resources invested in capability vs alignment ratio was reversed, would you still expect alignment to fare so much worse than capabilities?

This would obviously be an incredibly positive development, and would increase our success odds a ton! Nate isn't arguing 'when you actually try to do alignment, you can never make any headway'.

But 'alignment is tractable when you actually work on it' doesn't imply 'the only reason capabilities outgeneralized alignment in our evolutionary history was that evolution was myopic and therefore not able to do long-term planning aimed at alignment desiderata'.

Evolution was also myopic with respect to capabilities, and not able to do long-term planning aimed at capabilities desiderata; and yet capabilities generalized amazingly well, far beyond evolution's wildest dreams. If you're myopically optimizing for two things ('make the agent want to pursue the intended goal' and 'make the agent capable at pursuing the intended goal') and one generalizes vastly better than the other, this points toward a difference between the two myopically-optimized targets.

But 'alignment is tractable when you actually work on it' doesn't imply 'the only reason capabilities outgeneralized alignment in our evolutionary history was that evolution was myopic and therefore not able to do long-term planning aimed at alignment desiderata'.

I am not claiming evolution is 'not able to do long-term planning aimed at alignment desiderata'.
I am claiming it did not even try.

If you're myopically optimizing for two things ('make the agent want to pursue the intended goal' and 'make the agent capable at pursuing the intended goal') and one generalizes vastly better than the other, this points toward a difference between the two myopically-optimized targets.

This looks like a strong steelman of the post, which I gladly accept.


But it seemed to me that the post was arguing:
1. That alignment was hard (it mentions that technical alignment contains the hard bits, multiple specific problems in alignment), etc.
2. That current approaches do not work

That you do not get alignment by default looks like a much weaker thesis than 1&2, one that I agree with.

This would obviously be an incredibly positive development, and would increase our success odds a ton! Nate isn't arguing 'when you actually try to do alignment, you can never make any headway'.

This unfortunately didn't answer my question. We all agree that it would be a positive development, my question was how much. But from my point of view, it could even be enough.


The question that I was trying to ask was: "What is the difficulty ratio that you see between alignment and capabilities?"
I understood the post as making a claim (among others) that "Alignment is very more difficult than capabilities, as evidenced by Natural Selection".

"Some people I say this to respond with arguments like: 'Surely, before a smaller team could get an AGI that can master subjects like biotech and engineering well enough to kill all humans, some other, larger entity such as a state actor will have a somewhat worse AI that can handle biotech and engineering somewhat less well, but in a way that prevents any one AGI from running away with the whole future?'"

[...]

NS has not optimized for one intelligence not conquering the rest. As such, it doesn't say anything about how hard it is to optimize to produce one intelligence not conquering the rest.

We already know how to produce 'one intelligence not conquering the rest'. E.g., a human being is an intelligence that doesn't conquer the world. GPT-3 is an intelligence that doesn't conquer the world either. The problem is to build aligned AI that can do a pivotal act that ends the acute existential risk period, not just to build an AI that doesn't destroy the world itself.

That aside, I'm not sure what argument you're making here. Two possible interpretations that come to mind (probably both of these are wrong):

  1. You're arguing that all humans in the world will refuse to build dangerous AI, therefore AI won't be dangerous.
  2. You're arguing that natural selection doesn't tell us how hard it is to pull off a pivotal act, since natural selection wasn't trying to do a pivotal act.

1 seems obviously wrong to me; if everyone in the world had the ability to deploy AGI, then someone would destroy the world with AGI.

2 seems broadly correct to me, but I don't see the relevance. Nate and I indeed think that pivotal acts are possible. Nate is using natural selection here to argue against 'AI progress will be continuous', not to argue against 'it's possible to use sufficiently advanced AI systems to end the acute existential risk period'.

That aside, I'm not sure what argument you're making here.

I do not often comment on Less Wrong. (Although I am starting to, this is one of my first comment!)
Hopefully, my thoughts will become clearer as I write more, and get myself more acquainted with the local assumptions and cultural codes.

In the meanwhile, let me expand:

Two possible interpretations that come to mind (probably both of these are wrong):

  1. You're arguing that all humans in the world will refuse to build dangerous AI, therefore AI won't be dangerous.
  2. You're arguing that natural selection doesn't tell us how hard it is to pull off a pivotal act, since natural selection wasn't trying to do a pivotal act.

2 seems broadly correct to me, but I don't see the relevance. Nate and I indeed think that pivotal acts are possible. Nate is using natural selection here to argue against 'AI progress will be continuous', not to argue against 'it's possible to use sufficiently advanced AI systems to end the acute existential risk period'.

2 is the correct one.

But even though I read the post again with your interpretation in mind, I am still confused about why 2 is irrelevant. Consider:

The techniques you used to train it to allow the operators to shut it down? Those fall apart, and the AGI starts wanting to avoid shutdown, including wanting to deceive you if it’s useful to do so.

Why does alignment fail while capabilities generalize, at least by default and in predictable practice?

On one hand, in the analogy with Natural Selection, "by default" means "When you don't even try to do alignment, when you 100% optimize for a given goal.". Ie: When NS optimized for IGF, capabilities generalized, but not alignment.
On the other hand, when speaking of alignment directly, "by default" means "Even if you optimize for alignment, but not having in mind some specific considerations". Ie: Some specific alignment proposals will fail.

My point was that the former is not evidence for the latter.

a human being is an intelligence that doesn't conquer the world

Looking at the world at large, I think this deserves a second look. Are we sure the individual human is the right level of analysis?

[+][comment deleted]1y10