Note: I am not an alignment researcher, though I have tried some alignment work before. This text is an attempt to clarify some concepts in a way that I think is important and that might help others too, without spending too much time doing so.
I believe that there are two distinct ways in which the idea of a goal is used in discussions about alignment and agency more broadly. Roughly, the distinction that I have in mind is the following:
Examples:
Typically, an agent ends up pursuing target states as a consequence of optimizing for a success metric.
This distinction might already be part of the ontology of many readers. Nonetheless, I think this distinction is sometimes conflated with other distinctions, such as that between terminal and instrumental goals, and inner and outer optimization. Drawing it explicitly helps to make them and other alignment-related points clearer.
Here are some points that I make below using this distinction:
I think it is helpful to contextualize the distinction in the history of alignment work. With apologies to readers who find this repetitive, here is a quick overview of how I see some developments from the last few years.
In the bad old days before deep learning became the battlefield of alignment, the alignment problem was primarily constructed as one of specifying the right goals to an agent. The goals in question were those that would not lead to the destruction of everything we care about if ruthlessly pursued by an arbitrarily capable agent. In these discussions, goals were clearly presented as target states, though without drawing the distinction above.
As soon became apparent, there is no sense in which we can just “give” a goal to an agent: there is no “goal box” akin to the mouth of the Golem into which we can write our command. So how do agents acquire goals, understood as target states? Like us, they mostly learn them.
This learning takes the form of a training process in which an agent receives feedback of some kind through a loss function. The clearest example is a reward function, which provides a numerical score in response to the agent’s behavior, which is used to update the agent’s behavior. In the reinforcement learning formalism, this happens as an agent updates its best guess about how much reward a state or action will lead to in the long run, and uses that to govern its actions.
This changes the alignment problem. The problem is less like that of providing the perfect command to a Golem, and more like designing a perfect parental policy for raising it from a little clay lump into a benevolent God. Just like actual children, the goals of artificial agents trained in this way will almost invariably diverge from the intended goals we were trying to teach. But unlike most actual children, these ones could pose an existential risk.
We can articulate the "new" alignment problem in terms of the distinction:
The agent will pursue its learned target states as goals in one sense, and it will acquire those goals by optimizing for a success metric as a goal in another sense.
Another way to explain this is to think of it in terms of two separate and occasionally continuous optimization processes.
Both target states and success metrics are, in some sense, optimization targets. In an agent with a training cut-off, these processes are dissociable: after deployment, the agent will only engage in action optimization.[3] However, in agents that are continual learners like ourselves, both occur at the same time.[4]
(A similar point is sometimes made with reference to evolution as an optimization process and the resulting innate dispositions as the optimized product. There might be independent points that are helped by this illustration, but I think it is unhelpful for the point that I'm making here. With very few exceptions (e.g. suckling), humans are not born with many target states. Instead, we learn them from feedback. In other words, setting our evolutionary history aside, we need this distinction to explain how we acquire motivational dispositions even during our lifetimes.)
Alex Turner has argued that reward is not the optimization target. Instead, he argues that the reward signal provides a causal explanation for why the agent acquires the states that it does, which merely “carves grooves” into the motivational disposition of an agent.
I believe that Turner is right that agents typically don't learn to pursue reward as such. However, I also don’t think we can view it merely as a carving mechanism for our dispositions. Instead, I think that reward is not a target state, but it is a success metric.
The reason is that I don't see how we explain how agents develop and use target states without appealing to a success metric being optimized, and that for at least two reasons:
Continually learning agents don’t feature much in current discussions, as virtually all contemporary models have training cutoffs. At that cutoff time, their best guesses at what states will lead to reward are frozen, and their basic target states fixed.[5] By contrast, a continually learning agent will be pursuing their current target states while at the same time updating what those states are, as it acquires more information about what states provide reward.
While a frozen agent might be said to only engage in one optimization process–what I call Acting above–a continually learning agent will be doing this while trading it off against the value of new information it might receive about more rewarding states–a part of Learning. To explain this tradeoff, we need the reward to be an object of optimization that can be used to arbitrate such conflicts, rather than a mere “pusher” of target states.
A magnitudinal representation of value seems essential to explain how actual biological agents work. The dominant view in the cognitive neuroscience of motivation is that animal behavior is largely steered by simple mental representations of a quantity of value, often referred to as the “common currency” hypothesis. This is supported by numerous behavioral studies and corroborated with neuronal signals mapped to areas of the brain handling motivational tradeoffs.[6] Goals in the sense of target states seem to be emergent from a continual optimization process where our minds try to make the most accurate estimations of the value of various states, in the way described above. Those estimations are updated in response to information provided by biological reward signals.
For these reasons, I don't think that we can abolish the idea that reward is an optimization target in some sense. My hope is that this distinction can vindicate Turner's main point while accounting for this conclusion.
The distinction between goals as target states and metrics of success is orthogonal to that between terminal and instrumental goals, at least as that distinction is commonly used. Terminal goals are said to be goals that we want to achieve for their own sake, while instrumental goals are goals that we expect to help achieve a terminal goal.
Insofar as this distinction has meaning—and one could reasonably be suspicious of whether "for their own sake" is paying rent—terminal and instrumental goals are best understood as different kinds of target states in a planning process.
Reflective agents, such as ourselves, can take certain states as ends in a plan, and then identify other states that seem important to achieve them. For example, I might take as a terminal goal to prove a difficult conjecture and identify that I need to read up on a particular area to do so, taking that as an instrumental goal. I could say this without taking a stance on why I am motivated to prove the conjecture in the first place.
Another way to put this is that the terminal/instrumental distinction applies to agent policies that have a certain hierarchical structure with ends and means, rather than any agent that is engaging in optimization for a success metric. On this view, the distinction doesn’t apply to agents that lack such structure, even if they can learn new target states by optimizing for a success metric. Where this boundary lies is difficult to pinpoint, but it plausibly includes humans and excludes fruit flies that can learn by optimization in response to reward signals, but cannot plan.
Hubinger et al. draw the distinction between outer and inner optimization. The idea, as I understand it, is that an optimization process can sometimes generate another optimizer with another goal. An example often given of this is how natural selection has created us by optimizing for reproductive fitness, yet we optimize for other things, like sexual encounters with contraception.
At first glance, this distinction might seem identical to that between success metrics and target states: reproductive fitness is a success metric and protected sex is a (misaligned) target state. I think this is wrong, and that it undersells the point made by Hubinger et al. I take it their point is rather that one optimization process (e.g. natural selection) can generate another one (e.g. humans) that optimizes for another success metric.
In the case of humans, the new success metric is a common currency of value as conveyed by reward signals. By optimizing for this success metric, humans acquire different target states than they would have if they had faithfully inherited the success metric of natural selection itself.
As this illustrates, I think the success metric/target state distinction helps to elucidate the inner/outer distinction, rather than overlapping or competing with it. Furthermore, as noted above, I find that the clarification of reward as the product of phylogeny and the optimization target of ontogeny helps me understand the evolutionary analogy.
In many conceptual schemes that represent agents as some form of optimizers or expected value maximizers, the thing being optimized is both a success metric and a target state. This makes it intuitively harder to draw the conceptual distinction. The clearest example of this is utility functions in standard economic theory, which are scalar representations of an agent's preference rankings when those satisfy certain axioms. A similar point can be made about desires as used by philosophers.[7]
In expected utility theory, this utility function is used as the success metric, as agents update their expected utility representations of outcomes by conditionalizing on new evidence about whether utility will be achieved. But that utility is a product of a prior preference ranking that is typically assumed to be over target states that the agent would like (e.g. eating an ice cream).
There is nothing formally wrong with this, but using such a target state as the success metric makes us unable to explain how we acquired those target states in the first place as the product of an optimization process. This is fine for economic theory where we can assume that people have acquired preferences. But it risks confusing us when we're trying to explain how agents acquire those preferences in the first place.[8] To do so, we need to identify another success metric, such as that provided by a reward function in reinforcement learning.[9]
Why not just say that the success metric is reward? Contrary to the standard view of reward, I think that reward signals are best understood as evidence of a success metric of value rather than constitutive of it. This is analogous to how signals from our depth perception can be evidence of distance without constituting it. This won't matter to the rest of this post, though I think it is important and might elaborate on it in the future.
It is natural to propose that reward misspecification occurs when the success metric fails to capture the trainer's intended target states. I think this is confused. Ultimately, what the trainer cares about is just whether the agent's target states matches theirs, however they get there. As Singh et al. (2010) have argued, the most effective signal for achieving that is not always the one most closely matching the intended target states.
This is equivalent to pursuing a maximally greedy strategy.
Of course, drawing this distinction is misleading in the sense that, for continually learning agents, this is just one continuous optimization process.
This is a simplification: Agents can infer new ways to achieve such target states. We return to this point below when talking about terminal and instrumental goals.
For some representative papers, see Assad & Padoa-Schioppa (2006), Levy & Glimcher (2012), and Ballesta et al. (2020). See Hayden & Niv (2021) for criticism and Sripada (2024) for a defense.
Sinhababu (2017) is a representative example.
A notable exception to this is Stigler & Becker (1977) who recognize that basic utilities must be conceived of as something more akin to innate tastes than learned preferences.
The distinction that Mikulik was arguing for in Utility ≠ Reward is very similar to what I am defending here. The main difference is that I take utility (on his understanding) and reward to be instances of target states and success metrics respectively. However, I think it is helpful to have these more general categories, especially since "utility" often refers to something playing both roles.