Review

This was an appendix of Inner and outer alignment decompose one hard problem into two extremely hard problems. However, I think the material is self-contained and worth sharing separately, especially since AGI Ruin: A List of Lethalities has become so influential. (I agree with most of the points made in AGI Ruin, but I'm going to focus on disagreements in this essay.) (Stricken on 1/9/24)


Here are some quotes with which I disagree, in light of points I made in Inner and outer alignment decompose one hard problem into two extremely hard problems (consult its TL;DR and detailed summary for a refresher, if need be).

List of Lethalities 

“Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.  This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again”

(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups. 


Strictly separately, it seems to me that people draw rather strong inferences from a rather loose analogy with evolution. I think that (evolution) → (human values) is far less informative for alignment than (human reward circuitry) → (human values). I don’t agree with a strong focus on the former, given the latter is available as a source of information

We want to draw inferences about the mapping from (AI reward circuitry) → (AI values), which is an iterative training process using reinforcement learning and self-supervised learning. Therefore, we should consider existing evidence about the (human reward circuitry) → (human values) setup, which (AFAICT) also takes place using an iterative, local update process using reinforcement learning and self-supervised learning. 

Brain architecture and training is not AI architecture and training, so the evidence is going to be weakened. But for nearly every way in which (human reward circuitry) → (human values) is disanalogous to (AI reward circuitry) → (AI values), (evolution) → (human values) is even more disanalogous! For more on this, see Quintin's post.

Lethality #18: “When you show an agent an environmental reward signal, you are not showing it something that is a reliable ground truth about whether the system did the thing you wanted it to do; even if it ends up perfectly inner-aligned on that reward signal, or learning some concept that exactly corresponds to 'wanting states of the environment which result in a high reward signal being sent', an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (as seen by the operators).

My summary: Sensory reward signals are not ground truth on the agent’s alignment to our goals. Even if you solve inner alignment, you’re still dead.

My response: We don’t want to end up with an AI which primarily values its own reward, because then it wouldn’t value humans. Beyond that, this item is not a “central” lethality (and a bunch of these central-to-EY lethalities are in fact about outer/inner). We don’t need a function of sensory input which is safe to maximize, that’s not the function of the reward signal. Reward chisels cognition. Reward is not necessarily—nor do we want it to be—a ground-truth signal about alignment

Lethality #19: “Insofar as the current paradigm works at all, the on-paper design properties say that it only works for aligning on known direct functions of sense data and reward functions.  All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like 'kill everyone in the world using nanotech to strike before they know they're in a battle, and have control of your reward button forever after'.  It just isn't true that we know a function on webcam input such that every world with that webcam showing the right things is safe for us creatures outside the webcam.  This general problem is a fact about the territory, not the map; it's a fact about the actual environment, not the particular optimizer, that lethal-to-us possibilities exist in some possible environments underlying every given sense input.” 

My summary: The theory in the current paradigm only tells you how to, at best, align an agent to direct functions of sensory observables. Even if we achieve this kind of alignment, we die. It’s just a fact that sensory observables can’t discriminate between good and bad latent world-trajectories. 

My response: I understand “the on-paper design properties” and “insofar as the current paradigm works at all” to represent Eliezer’s understanding of the properties and the paradigm (he did describe these points as “central difficulties of outer and inner alignment[1]). But on my view, this lethality does not see very relevant or central to alignment. Use reward to supply good cognitive updates to the agent. I don't find myself thinking about reward as that which gets maximized, or which should get maximized.

Also, if you ignore the oft-repeated wrong/under-hedged claim that “RL agents maximize reward” or whatever, the on-paper design properties suggest that reward aligns agents to objectives in reality according to the computations which reward reinforces. I think that machine learning does not, in general, align agents to sense data and reward functions. I think that focusing on the sensory-alignment question can be misleading as to the nature of the reward-chiseling challenge which we confront.

It's true that we don't know that we know how to reliably make superintelligent agents learn human-compatible values. However, by the same coin (e.g. by the arguments in reward is not the optimization target), I can just as equally ask "how do I get agents to care about sensory observables and reward data?". It's not like we know how to ensure deep learning-trained agents care about their sensory observables and reward data. 

Lethality #21: “[...] hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of 'relative inclusive reproductive fitness' - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else.  This abstract dynamic is something you'd expect to be true about outer optimization loops on the order of both 'natural selection' and 'gradient descent'.  The central result:  Capabilities generalize further than alignment once capabilities start to generalize far.

My summary: Perceived alignment on the training distribution is all we know how to run gradients over, but historically, alignment on training does not generalize to alignment on deployment. Furthermore, when the agent becomes highly capable, it will gain a flood of abilities and opportunities to competently optimize whatever vaguely good-seeming internal proxy objectives we entrained into its cognition. When this happens, the AI's capabilities will keep growing, but its alignment will not. 

My response: This perceived disagreement might be important, or maybe I just use words differently than Eliezer.

When I’m not thinking in terms of inner/outer, but “what cognition got chiseled into the AI?”, there isn’t any separate “tendency to fail to generalize alignment” in a deceptive misalignment scenario. The AI just didn’t have the cognition you thought or wanted. 

For simplicity, suppose you want the future to contain lots of bananas. Suppose you think your AI cares about bananas but actually it primarily cares about fruit in general and only pretended to primarily care about bananas, for instrumental reasons. Then it kills everyone and makes a ton of fruit (only some of which are bananas). In that scenario, we should have chiseled different cognition into the AI so that it would have valued bananas more strongly. (Similarly for "the AI cared about granite spheres and paperclips and...")

While this scenario involves misgeneralization, there’s no separate tendency of “alignment shalt not generalize.” 

But suppose you do get the AI to primarily care about bananas early in training, and it retains that banana value shard/decision-influencing-factor into mid-training. At this point, I think the banana-shard will convergently be motivated to steer the AI’s future training so that the AI keeps making bananas. So, if you get some of the early-/mid-training values to care about making bananas, then those early-/mid-values will, by instrumental convergence, reliably steer training to keep generalizing appropriately. If they did not, that would lead to fewer bananas, and the banana-shard would bid for a different path of capability gain! 

(This is not an airtight safety argument, but I think it's a reasonably strong a priori case.)

The main difficulty here still seems to be my already-central expected difficulty of “loss signals might chisel undesired values into the AI.”

“Okay, but as we all know, modern machine learning is like a genie where you just give it a wish, right?  Expressed as some mysterious thing called a 'loss function', but which is basically just equivalent to an English wish phrasing, right?” 

Interlude in AGI Ruin: A List of Lethalities

Eliezer is mockingly imitating a naive AI alignment researcher. My current read, however, is that the bolded part represents his real view. Given that: A loss function is not a “wish” or an expression of your desires. A loss function is a source of gradient updates, a loss function is a chisel with which to shape the agent’s cognition.

Alignment doesn't hit back, the loss function hits back[15] and the loss function doesn't capture what you really want (eg because killing the humans and taking control of a reward button will max reward, deceiving human raters will increase ratings, etc).  If what we wanted was exactly captured in a loss function, alignment would be easier.  Not easy because outer optimization doesn't create good inner alignment, but easier than the present case.

To me, this statement seems weird and sideways of central alignment problems. I perceive Eliezer to be arguing "If only the loss function represented what we wanted, that'd be better." If he meant to connote "loss functions simply won't represent what you want, get over it, that's not how alignment works", we're more likely on the same page. 

My response:

  1. Type error in forcing conversion from "goals" to "gradient-providing function."
  2. The empirical contingency of the wisdom of the frame where the loss function "represents" the goal.

First, I want to say: type error: loss function not of type goal.[2] I imagine Eliezer understands this, at least on the more obvious level of the statement. But I'm going to explain my worldview here so as to better triangulate my meaning. 

I think there's potential for deep confusion here. Loss functions provide gradients to the way the AI thinks (i.e. computes forward passes). Trying to cast human values[3] into a loss function is a highly unnatural type conversion to attempt. Attempting to force the conversion anyways may well damage your view of the alignment problem. 

From Four usages of "loss" in AI:

3: Loss functions "representing" goals

I want a loss function which is aligned with the goal of "write good novels." 

This is an aspirational statement about achieving some kind of correspondence between the loss function and the goal of writing good novels. But what does this statement mean

Suppose you tell me "I have written down a loss function  which is perfectly aligned with the goal of 'write good novels'." What experiences should this claim lead me to anticipate? 

  1. That an agent can only achieve low physical-loss if it has, in physical fact, written a good novel?
  2. That in some mathematical idealization of the learning problem, loss-minimization only occurs when the agent outputs text which would be found in what we rightly consider to be "good novels"? (But in which mathematical idealization?)
  3. That, as a matter of physical fact, if you train an agent on  using learning setup , then you produce a language model which can be easily prompted to output high-quality novels?

Much imprecision comes from loss functions not directly encoding goals. Loss signals are physically implemented parts of the AI's training process which (physically) update the AI's cognition in certain ways...

I think that talking about loss functions being "aligned" encourages bad habits of thought at best, and is nonsensical at worst. I think it makes way more sense to say how you want the agent to think and then act (e.g. "write good novels"—the training goal, in Evan Hubinger's training stories framework) and why you think you can use a given loss function   to produce that cognition in the agent (the training rationale).

Second, we want to train a network which ends up doing what we want. There are several strategies to achieve this. 

It might shake out that, as an empirical fact, the best way to spend an additional increment of alignment research is to make the loss function "represent what you want" in some way. For example, you might more accurately spot flaws in AI-generated alignment proposals, and train the AI on that more accurate signal.

But "make the objective better 'represent' our goals" would be an empirical contingency, not pinned down by the mechanistic function of a loss function. This contingency may be sensitive to the means by which feedback translates into gradient updates. For example, changing the loss function will probably differently affect the gradients provided by:

  1. Advantage actor-critic,
  2. REINFORCE,
  3. Self-supervised learning with teacher forcing, and 
  4. Reward prediction errors. 

Because loss is not the optimization target, there's some level of "goal representation" where I should stop thinking about how "good" the loss function is, and start thinking about e.g. the abstractions learned by self-supervised pre-training. EG If I populate the corpus with more instances of people helping each other, that might change the inductive biases on SGD dynamics to increase the probability of helping-concepts getting hooked in to value shard formation. 

I think it's possible that after more deliberation, I'll conclude "we should just consider some intuitive notion of 'goal representation fidelity' when reasoning about P(alignment | loss function)." I just don't know where or whether this deliberation is supposed to have occurred. So we probably need more of it.  

Because loss functions don't natively represent goals, and because of these empirical contingencies, I'm weirded out by statements like "the loss function doesn't capture what you really want."[4]

Other disagreements with alignment thinkers

Evan Hubinger

Terms like base objective or inner/outer alignment are still great terms for talking about training stories that are trying to train a model to optimize for some specified objective.

Sometimes, inner/outer alignment ideas can be appropriate (e.g. chess). For aligning real-world agents in partially observable environments, I think it’s not that appropriate. (See here for a more detailed discussion of what I eventually realized Evan means here, though.)

Paul Christiano

There is probably no physically-implemented reward function, of the kind that could be optimized with SGD, that we’d be happy for an arbitrarily smart AI to optimize as hard as possible. (I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.)

I read this and think “this all feels like a red herring.” I think this is not necessary because robust grading is not necessary for alignment. However, because reward provides cognitive updates, it’s important to think carefully about what cognitive updates will be provided by the reward given when e.g. a large language model submits an alignment proposal. Those reward events will shape the network’s decision-making and generalization properties, which is what we’re really interested in.

In many problems, “almost all” possible actions are equally terrible. For example, if I want my agent to write an email, almost all possible strings are just going to be nonsense.

One approach to this problem is to adjust the reward function to make it easier to satisfy — to provide a “trail of breadcrumbs” leading to high reward behaviors. I think this basic idea is important, but that changing the reward function isn’t the right way to implement it (at least conceptually).

Instead we could treat the problem statement as given, but view auxiliary reward functions as a kind of “hint” that we might provide to help the algorithm figure out what to do. Early in the optimization we might mostly optimize this hint, but as optimization proceeds we should anneal towards the actual reward function.

Typical examples of proxy reward functions include “partial credit” for behaviors that look promising; artificially high discount rates and careful reward shaping; and adjusting rewards so that small victories have an effect on learning even though they don’t actually matter. All of these play a central role in practical RL.

A proxy reward function is just one of many possible hints. Providing demonstrations of successful behavior is another important kind of hint. Again, I don’t think that this should be taken as a change to the reward function, but rather as side information to help achieve high reward. In the long run, we will hopefully design learning algorithms that automatically learn how to use general auxiliary information.

Thoughts on reward engineering

Why do we need new learning algorithms? The point of reward, on a mechanistic basis, is to update the agent’s cognition. Shaping reward seems fine to me, and I am uncomfortable with this apparent-to-me emphasis as reward “embodying” the agent’s goals.

Nick Bostrom

Summary of value-loading techniques…

Reinforcement learning: A range of different methods can be used to solve “reinforcement-learning problems,” but they typically involve creating a system that seeks to maximize a reward signal. This has an inherent tendency to produce the wireheading failure mode when the system becomes more intelligent. Reinforcement learning therefore looks unpromising. 

Superintelligence, p.253

Historical reasoning about RL seems quite bad. This is a prime example. In one fell swoop, in several pages of mistaken exposition, Superintelligence rules out the single known method for producing human-compatible values. We should forewarn new alignment researchers of these deep confusions before recommending this book.


Thanks to Drake Thomas, ChatGPT, Ulisse Mini, and Peli Grietzer for feedback on this post.

  1. ^

    List of Lethalities was, AFAICT, intended to convey the most important dangers, in the right language. Rob Bensinger (who works at MIRI but was expressing his own views) also commented:

    My shoulder Eliezer (who I agree with on alignment, and who speaks more bluntly and with less hedging than I normally would) says:

    1. The list is true, to the best of my knowledge, and the details actually matter.

      Many civilizations try to make a canonical list like this in 1980 and end up dying where they would have lived just because they left off one item, or under-weighted the importance of the last three sentences of another item, or included ten distracting less-important items.

    So if Eliezer's talking about "how do we get agents to care about non-sensory observables", this indicates to me that I disagree with him about what the central subproblems of alignment are. 

  2. ^

    From Inner and outer alignment decompose one hard problem into two extremely hard problems:

    Outer/inner unnecessarily assumes that the loss function/outer objective should “embody” the goals which we want the agent to pursue.

    For example, shaping is empirically useful in both AI and animals. When a trainer is teaching a dog to stand on its hind legs, they might first give the dog a treat when it lifts its front paws off the ground. This treat translates into an internal reward event for the dog, which (roughly) reinforces the dog to be more likely to lift its paws next time. The point isn’t that we terminally value dogs lifting their paws off the ground. We do this because it reliably shapes target cognition (e.g. stand on hind legs on command) into the dog. If you think about reward as exclusively “encoding” what you want, you lose track of important learning dynamics and seriously constrain your alignment strategies.

  3. ^

    I think this holds for basically any values in a rich, partially observable domain, including paperclip optimization or picking three flowers.

  4. ^

    "The loss function is used to train the AI, and the loss function represents human values" is akin to saying "a hammer is used to build a house, and the hammer represents the architect's design." Just as a hammer is a tool to facilitate the building process, a loss function is a tool to facilitate the learning process. The hammer doesn't represent the design of the house, it is simply a means to an end. Similarly, the loss function doesn't represent human values, it is simply a means to an end of training the AI to perform a task.

    ChatGPT wrote this hammer analogy, given the prompt of a post draft (but the draft didn't include any of my reward-as-chisel analogies).

New Comment
7 comments, sorted by Click to highlight new comments since:
[-]berenΩ693

I broadly agree with a lot of shard theory claims. However, the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.

With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI's 'linguistic cortex' which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us 'truly want' specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.

Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization  -- i.e .effectively planning over a world model -- is necessary in situations where a.) you can't behaviourally clone existing behaviour and b.) you can't self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function. 

I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function. 

Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization  -- i.e .effectively planning over a world model

FWIW I don't consider myself to be arguing against planning over a world model

the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. 

Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.

Can you give me some examples here? I don't know that I follow what you're pointing at. 

(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups. 


I expect that it has also happened to an extent with animals as well. I wonder if anyone has ever looked into this.

Yes they have. There's quite a large literature on animal emotion and cognition and my general synthesis is that animals (at least mammals) have at least the same basic emotions as humans and often quite subtle ones such as empathy and a sense of fairness. It seems pretty likely to me whatever the set of base reward functions encoded in the mammalian basal ganglia and hypothalamus is, it can quite robustly generate expressed behavioural 'values' that fall within some broadly humanly recognisable set.

We don’t need a function of sensory input which is safe to maximize, that’s not the function of the reward signal. Reward chisels cognition. Reward is not necessarily—nor do we want it to be—a ground-truth signal about alignment.

I'm confused about this statement. How can reward be unnecessary as a ground-truth signal about alignment? Especially if "reward chisels cognition"?

How can reward be unnecessary as a ground-truth signal about alignment? Especially if "reward chisels cognition"?

Reward's purpose isn't to demarcate "this was good by my values." That's one use, and it often works, but it isn't intrinsic to reward's mechanistic function. Reward develops certain kinds of cognition / policy network circuits. For example, reward shaping a dog to stand on its hind legs. I don't reward the dog because I intrinsically value its front paws being slightly off the ground for a moment. I reward the dog at that moment because that helps develop the stand-up cognition in the dog's mind.