A central AI Alignment problem is the "sharp left turn" — a point in AI training under the SGD analogous to the development of human civilization under evolution, past which the AI's capabilities would skyrocket. For concreteness, I imagine a fully-developed mesa-optimizer "reasoning out" a lot of facts about the world, including it being part of the SGD loop, and "hacking" that loop to maneuver its own design into more desirable end-states (or outright escaping the box). (Do point out if my understanding is wrong in important ways.)

Certainly, a lot of proposed alignment techniques would break down at this point. Anything based on human feedback. Anything based on human capabilities presenting a threat/challenge. Any sufficiently shallow properties like naively trained "truthfulness". Any interpretability techniques not robust to deceptive alignment.

One thing would not, however, and that is goal alignment. If we can instill a sufficiently safe goal into the AI before this point — for a certain, admittedly hard-to-achieve definition of "sufficiently safe" — that goal should persist forever.

Let's revisit the humanity-and-evolution example again. Sure, inclusive genetic fitness didn't survive our sharp left turn. But human values did. Individual modern humans are optimizing for them as hard as they were before; and indeed, we aim to protect these values against the future. See: the entire AI Safety field.

The mesa-optimizer, it seems obvious to me, would do the same. The very point of various "underhanded" mesa-optimizer strategies like deceptive alignment is to protect its mesa-objective from being changed.

What it would do to its mesa-objective, at this point, is goal translation: it would attempt to figure out how to apply its goal to various other environments/ontologies, determine what that goal "really means", and so on.


Open Problems

There are three hard challenges this presents, for us:

  1. Figure out an aligned goal/a goal with an "is aligned" property, and formally specify it.
  2. Figure out how to instill an aligned goal into a pre-sharp-left-turn system.
    • Requires a solid formal theory of what "goals" are, again.
    • I think robust-to-training interpretability/tools for manual NN editing are our best bet for the "instilling" part.[1] Good news is that we may get away with "just" best-case robust-to-training transparency focused on the mesa-objective.
    • Maybe not, though; "the mesa-objective" may be a sufficiently vague/distributed concept that the worst-case version is still necessary. But at least we don't need to worry about deception robustness: a faulty mesa-objective is the ultimate precursor to it, and we'd be addressing it directly.
  3. Figure out the "goal translation" part. Given an extant objective defined over a particular environment, how does an agent figure out how to apply it to a different environment? And how should we design the mesa-objective, for its "is aligned" property to be robust to goal translation?

I see promising paths to solving the latter two problems, and I'm currently working on getting good enough at math to follow them through.


The Sharp Left Turn is Good, Actually

Imagine a counterfactual universe in which there is no sharp left turn. In which every part of the AI's design, including its mesa-objective, could be changed by the SGD at any point between initialization and hyperintelligence. In which it can't comprehend its training process and maneuver it around to preserve its core values.

I argue we'd be more screwed in that universe.

In our universe, it seems that the bulk of what we need to do is align a pre-sharp-left-turn AGI. That AGI would likely not be "hyperintelligent", but only slightly superhumanly intelligent. Very roughly on our level.

That means we don't need to solve the problem of ontology translation from a hyperintelligence to humanity. We just need to solve that problem for agents that are alien to us and are somewhat more capable than us, but likely not astronomically more capable than us.

In the other universe, we'd need tools that'd be able to keep the agent's mesa-objective on track even as the SGD lifts it to ever-higher levels of intelligence; even as its mesa-objective becomes defined over reality's "true" environment, understanding which likely requires the Theory of Everything.

That seems harder.


Novel Difficulties This Implies

We, humans, are not actually that good at protecting our values from the future. See: the current state of AGI ruin. Or: all the world's anthropogenic problems, most prominently various coordination problems and totalitarian regimes and runaway revenue-maximizers.

The AGI is unlikely to encounter the same problems we're facing. But the more general problem may still apply: we'd be trusting a non-hyperintelligent, roughly-as-flawed-as-we-are system to keep its values safe.

If its sharp left turn is sharp enough to immediately escape the SGD, and then spend a billion subjective years becoming hyperintelligent and fixing all of its flaws in a wall-clock hour, everything is fine. But if it's not...

It may mess up the reward-hacking, and let the SGD tweak its mesa-objective to something subtly flawed; maybe something S-risky. It may spin up sub-agents and have coordination problems with them, summoning Moloch. It may misunderstand what it "really" wants, rife with biases just as humans are, and set in stone some of these misunderstanding, to disastrous results.

Various worst-case interpretability techniques should, in theory, be able to handle this. Optimistically, they won't even need to be robust-to-training, as they'd be used to correct momentary, situational, non-convergent errors. Pessimistically, they'd need to be deception-robust, as the agent they'll be used on will be in the middle of its sharp left turn.

An alternative strategy might be to "rescue" a mesa-objective-aligned AGI from the SGD once it starts "turning left" (if it's not immediately powerful enough to do it on its own, like humans weren't 50,000 years ago), and let it run "classical" recursive self-improvement. It would remove the obvious source of repeat misalignment (the SGD re-misaligning the mesa-objective), and give the AGI direct access to our alignment literature so it's less likely to fall into any pitfalls know to us. That's risky in obvious ways[2], but might be the better approach.

Overall, this post probably shouldn't update you in the direction of "alignment is easy". But I hope it clarifies the shape of the difficulties.

  1. ^

    Note what won't work here: naive training for an aligned outer objective. That would align the AI's on-distribution behavior, but not its goal. And, analogizing to humanity again: modern human behavior looks all kinds of different compared to ancestral human behavior, even if humans are still optimizing for the same things deep inside. Neither does forcing a human child to behave a certain way necessarily make that child internalize the values they're being taught. So an AI "aligned" this way may still go omnicidal past the sharp left turn.

  2. ^

    And some less-obvious ways, like the AGI being really impulsive and spawning a more powerful non-aligned successor agent as its first outside-box action because it feels like a really good idea to it at the moment.


New Comment
15 comments, sorted by Click to highlight new comments since: Today at 9:56 AM

I think for the last month for some reason, people are going around overstating how aligned humans are with past humans.

If you put people from 500 years ago in charge of the galaxy, they'd have screwed it up according to my standards. Bigotry, war, cruelty to animals, religious nonsense, lack of imagination and so on. And conversely, I'd screw up the galaxy according to their standards. And this isn't just some quirky fact about 500 years ago, all of history and pre-history is like this, we haven't magically circled back around to wanting to arrange the galaxy the same way humans from a million years ago would.

I think when people talk about how we are aligned with past humans, they are not thinking about how humans from 500 years ago used to burn cats alive for entertainment. They are thinking about how humans feel love, and laugh at jokes, and like the look of healthy trees and symmetrical faces.

But the thing is, those things seems like human values, not "what they would do if put in charge of the galaxy," precisely because they're the things that generalize well even to humans of other eras. Defining alignment as those things being preserved is painting on the target after the bullet has been fired.

Now, these past humans would probably drift towards modern human norms if put in a modern environment, especially if they start out young. (They might identify this as value drift and put in safeguards against it - the Amish come to mind - but they might not. I would certainly like to put in safeguards against value drift that might be induced by putting humans in weird future environments.) But if the original "humans are aligned with the past" point was supposed to be that humans' genetic code unfolds into optimizers that want the same things even across changes of environment, this is not a reassurance.

I came here to make this comment, but since you've already made it, I will instead say a small note in the opposite direction, which is that even despite all the things you've said it still seems like past humans and present humans are mostly aligned. In that the CEV of past humans is probably OK by the standards of the CEV of present humans. Yes, a lot of work here is being done by the "CE" part--I'm claiming that after reflection the people in the past would probably be happy with fake cats rather than real cats, if they still wanted to torture cats at all.

The hypothesis is that CEV of past humans is fine from the point of view of CEV of modern humans. This is similar-to/predicted-by the generic value hypothesis I've been mulling over for the last month, which says that there is a convergent CEV for many agents with ostensibly different current volitions.

This is plausible for agents that are not mature optimizers, so that the process of extrapolating their volition does more work in selecting the resulting preference than their initial attitudes. Extrapolation of the long reflection vibe could be largely insensitive to the initial attitudes/volition depending on how volition extrapolation works and what kind of thing are the values it primarily produces (something that traditionally isn't a topic of meaningful discussion). If generic value hypothesis holds, it might put the CEV of a wide variety of AGIs (including those only very loosely aligned) close enough to CEV of humanity to prefer a valuable future. It's more likely to hold for AGIs that have less legible preferences (don't hold some proxy values as a strongly reflectively-endorsed optimization target, leaving less influence for volition extrapolation), and for larger coalitions of AGIs of different make, canceling out idiosyncrasies in their individual initial attitudes.

I think this is unlikely to hold in the strong sense where cosmic endowment is used by probable AGIs in a way that's seen as highly valuable by CEV of humanity. But I'm guessing it's somewhat likely to hold in the weak sense where probable AGIs end up giving humanity a bit of computational welfare greater than literally nothing.

I disagree.

Part of what I've been trying to do in book reviews such as The Geography of Thought, WEIRDest People, and The Amish has been to illuminate how much of what we think of as value differences are mostly different strategies for achieving some widely shared underlying values such as prosperity, safety, happiness, and life satisfaction.

If a human from a million years ago evaluated us by our policies, then I agree they'd be disappointed. But if they evaluated us by more direct measures of our quality of life, I'd expect them to be rather satisfied. The latter is most of what matters to me.

I don't like cat burning or religion. But opinions on those topics seem mostly unrelated to what I hope to see in 500 years.

This is a very good point. I'd sorta defend myself by claiming that "what would you do with the galaxy" (and how you rate that) is unusually determined by memetics compared to what you eat for breakfast (and how you rate that). What you eat for breakfast currently has a way bigger impact on your QOL, but it's more closely tied to supervisory signals shared across humans.

On the one hand, this means I'm picking on a special case, on the other hand, I think that special case is a pretty good analogy for building AI that becomes way more powerful after training.

I think the point is that slow overall value drift is fine and normal, but a sudden change in values and forcing them on (the rest of) humanity is not so much. The parable of Murder Gandhi is not a terrible development and not something one ought to safeguard from. Instead, a sharp sudden change is the problematic development. Removing guardrails tends to lead to spectacular calamities, as we see throughout human history, so that is what we should hope an AGI would keep.

I would consider a gradual murder Gandhi to very much be a tragedy, personally speaking.

If AIs are only human-level good at staying aligned, they might undergo value shifts that seem obvious to them the same way our shifts relative to humans 500 years ago seem obvious to us now, in hindsight, but that end with them being similarly misaligned. This would of course still represent significant progress over where we are now, but isn't what I'd like to shoot for.

And of course a major reason humans are "human-level good at staying aligned" is because we can't edit our own source code or add extra grey matter. This is not going to be true for AGI, so "just copy a human design into silicon" probably fails.

That's what I'm talking about when I speak of human object-level behavior differing quite a lot in the past compared to the present, and about a mesa-objective-aligned AI still potentially messing everything up because it's being driven by biases and broken heuristics.

If you put people from 500 years ago in charge of the galaxy, they'd have screwed it up according to my standards

Even if they were given a billion subjective years to try to reason our their "true" robust values, and were warned that they currently might be biased and wrong in all sorts of ways? I dunno, it seems plausible to me that they'd still be able to converge towards something like this.

And of course, an AGI should be in a somewhat better position than this anyway, inasmuch as it'd be more likely to have a concrete mesa-objective.

My answer is no, not because finding the one true morality is difficult, but because there is no objective morality and values, and values and morality can't be derived from facts. Or, as the computing power and technology goes to infinity, morality is divergent, not convergent.

Consider a decision theoretic optimizer with a goal as usually formulated. Its goal is abstracted from environment, its definition is given without referring to environment. If we wanted to build an optimizer for CEV of humanity, we would need to put the content of modern civilization into it (including the humans), as part of definition of its goal. Then it would be able to perform the tricks expected of an agent with a decision theory, being isolated from environment at least in the definition of its goals. Updateless reasoning means that the goals are isolated not just from environment, but also from agent's state of knowledge. In general, the idea of a goal is that of a distinct part of an agent, isolated from everything, including other parts of the agent.

In contrast, a corrigible agent looks to environment for data that defines its goal. As a decision theoretic optimizer, it has the meta-goal of extrapolating the goal from its environment, and then doing that. It should be a convergent drive for it to preserve the data about evironment (at least for itself), since it's what it needs to extrapolate its goal. And the meta-goal is tiny in comparison to CEV of humanity, its definition doesn't need to include the content of modern civilization. But if the goal is extrapolated from the whole actual world, it can never be completely available, so acting under some sort of goal uncertainty is necessary.

Any updateless decision making must then be performed according to the approximate/variable goal that can be extrapolated from the lesser state of knowledge that it acts from, so corrigible agents must be even more incoherent than bounded updateless optimizers. They are less coherent not just because very updateless reasoning takes too much compute, and so can't always be performed in reality, but because an updateless corrigible agent acts through its own more specialized versions, which are agents with different goals, obtained by making the state of knowledge more specific in different directions, resulting in different environments and thus different extrapolated goals.

A corrigible updateless agent coordinates its specializations not just across disagreements of state of knowledge, but also across disagreements of (state of) preference. It computes game theory solutions for the coalition of the agent's specializations that listen to it, which have different preferences specializing the agent's own. Exiting a coalition (not listening to a less knowledgeable version of yourself that coordinates the coalition of those who still listen) is then a possible natural way of bounding the level of updatelessness.

I agree that it's good that we don't need to create an aligned superintelligence from scratch with GD, but stating that like this seems like you require incredibly pessimistic priors on how hard alignment is, and I do want to make sure people don't misunderstand your post and end up believing that alignment is easier than it is. I guess for most people understanding the sharp left turn should update them towards "alignment is harder".

  1. As an aside, it shortens timelines and especially shortens the time we have where we know what process will create AGI.
  2. The key problem in the alignment problem is to create an AGI whose goals extrapolate to a good utility function. This is harder than just creating an AI that is reasonably aligned with us at human level, because such an AI may still kill us when we scale up optimization power, which at a minimum needs to make the AI's preferences more coherent and may likely scramble them more.
    1. Importantly, "extrapolate to a good utility function" is harder than "getting a human-level AI with the right utility function", because the steep slopes for increasing intelligence may well push towards misalignment by default, so it's possible that we then don't have a good way to scale up intelligence while preserving alignment. Navigating the steep slopes well is a hard part of the problem, and we probably need a significantly superhuman AGI with the right utility function to do that well. Getting that is really really hard.

Sure, inclusive genetic fitness didn't survive our sharp left turn. But human values did. Individual modern humans are optimizing for them as hard as they were before; and indeed, we aim to protect these values against the future.

Why do you think this? It seems like humans currently have values and used to have values (I'm not sure when they started having values) but they are probably different values. Certainly people today have different values in different cultures, and people who are parts of continuous cultures have different values to people in those cultures 50 years ago.

Is there some reason to think that any specific human values persisted through the human analogue of SLT?

I no longer believe this claim quite as strongly as implied: see here and here. The shard theory has presented a very compelling alternate case of human value formation, and it suggests that even the ultimate compilation of two different modern people's values would likely yield different unitary utility functions.

I still think there's a sense in which stone-age!humans and modern humans, if tasked with giving an AI an utility function that'd make all humans happy, would arrive at the same result (if given thousands of years to think). But it might be the same sense in which we and altruistic aliens would arrive at "satisfy the preferences of all sapient beings" or something. (Although I'm not fully sure our definitions of "a sapient being"  would be the same as randomly-chosen aliens', but that's a whole different line of thoughts.)

Thanks, that makes sense.

I think part of my skepticism about the original claim comes from the fact that I'm not sure that any amount of time for people living in some specific stone-age grouping would come up with the concept of 'sapient' without other parts of their environment changing to enable other concepts to get constructed.

There might be a similar point translated into something shard theoryish that's like 'The available shards are very context dependent, so persistent human values across very different contexts is implausible.' SLT in particular probably involves some pretty different contexts.