Steering subsystems: capabilities, agency, and alignment

Seth Herd

Human brains have steering subsystems. LLMs and most RL agents do not. Steering systems, as defined by Max H, are goal-directed AI systems, or optimizers. Here I focus on steering subsystems: the parts of human and AI cognitive systems most directly relevant to goal-direction. These work in three distinct ways (and probably more), each providing a different type and amount of agency, and associated capabilities.

Thinking about types of steering subsystems can clarify our conceptions of agency. Steering subsystems increase risks by adding capabilities. Notably, sophisticated steering subsystems create useful representations of goals. This allows them to break complex tasks into subgoals (e.g., [prevent human interference]:). Adding steering subsystems to otherwise non-agentic AI (like LLMs) may prove irresistible and dangerous, because it may allow rapid capability gains. But this scenario has an upside: aligning a steering subsystem is somewhat simpler than aligning the whole system it steers. Thus, alignment plans that focus on steering subsystems may have an advantage.

I spent a bunch of time trying to work out the brain mechanisms of complex cognition.^[1] This work has some relevance for understanding some different types of steering subsystems and resulting types of agency.

Cognition is goal-directed in different ways when different steering mechanisms are used. There are several distinctions proposed by different cognitive sciences: model-based vs. model-free RL from machine learning; habitual vs. goal-directed behavior from animal neuroscience; automatic vs. controlled processing from cognitive psychology; and System 1 and System 2 thinking from behavioral economics. None of these distinctions seems to cleanly match the brain mechanisms creating different types of goal-directed cognition for human decision-making.^[2] Therefore I'll describe the cognitive mechanisms directly.

Agency is not a binary; it is at least a spectrum. Humans use at least three types of steering:

Types of steering and agency

Systems trained with reward and reward predictions
- No steering subsystems
Systems that predict outcomes of actions and their values
- Limited steering subsystems
Systems that select possible high-reward outcomes as goals
- Full steering subsystems
- Hierarchical subgoal creation for planning
- Implemented only recently, few barriers to improvement

All of these are goal-directed and agentic, but in importantly different ways. So far, AI systems have only realized the latter two in very limited form, but the field is poised for progress in both of those types of steering.

Type 1: predicting reward for RL training

Most high-performing reinforcement learning systems use a critic system of some sort.^[3] This can be (arguably) considered one type of steering subsystem. The critic system is trained to predict the value (sum of future rewards) of world-states and actions. In the simplest configuration, the critic’s value estimate is used to train the actor system; the estimated value of the world-state reached by each action is used as a reward signal to train the policy. Critic systems are ubiquitous in RL systems because they're useful.^[4] In particular they are helpful in bridging temporal gaps when reward is sparse, as it is for most embodied organisms.

This application of a critic is a steering subsystem in a relatively weak sense. It is just extending the effect of reward on training. If the system gets reward for finding diamonds, the critic makes it better at this learning problem by rewarding policies that achieve states that in turn lead to finding diamonds. So I would tend to not call this a steering subsystem, just a method of creating a system that does some steering. It's not a sharp line, so this arrangement of a critic system used solely for RL training could be considered to fall on either side.

In humans, we call this type of learning and behavior habitual. When we don’t have time to do more careful and time-consuming decision-making, we do things that have led to good outcomes in the past in similar contexts.

Most RL agents use only this type of learning. DeepMind's early Atari playing agents used a critic system as an extra head of the network. This type of system uses a critic system to provide a training signal, but it is not used as part of a look-ahead (e.g., tree search) routine, or to create explicit goal representations, as in types 2 and 3 described below. Mammals, including humans, use this type of critic system, as well as types 2 and 3. The dopamine system predicts rewards,^[5] and dopamine drives learning in the rest of the system.^[6]

This type of training often results in "brittle" behaviors, typically classified as habitual or automatic. For example, I might open the fridge to look for a snack when I pass it, even if I’m not hungry. But with enough training, and good enough generalization in the system, this type of learning can produce behaviors that change appropriately to pursue rewards when the contingencies of the environment change. After enough experience, I won't open the fridge when I’m full, because I've learned it isn’t rewarding when my body is signaling satiety. Animal experiments have demonstrated this goal-dependence in habitual behavior with adequate training. There's no sharp limit to the sophistication of internal representations that could be developed with this sort of RL; a system might actually learn to emulate a steering system, even if none is explicitly designed in.

Thus, this classification is fuzzy. But it seems useful for thinking about types and degrees of agency, and how we might align agentic systems.

Type 2: steering toward estimated value of predicted outcomes

Critic systems can also function as steering subsystems by using the value of predicted outcomes to select actions. For instance, when some sort of lookahead is used (like Monte Carlo tree search in AlphaZero), the system chooses its current action based on the one that will lead to good outcomes, as estimated by the critic. This is what we seem to do when playing a game and looking a few moves ahead.

Humans are thought to do this for some decisions. Introspection^[7] as well as data suggest it. Dopamine seems to signal the estimated value of whatever option the animal is currently considering, and to otherwise provide a best-guess estimate of value^[5] that is useful for Bayesian decision-making.^[8] There are probably exceptions, since longer-term and more complex decisions haven’t been thoroughly tested, but the loose match seems pretty certain. It seems that humans probably use a tree search of limited depth, made useful by good abstraction and prediction. This search is (probably) pruned by, and actions chosen using, estimated values of predicted states from the dopamine critic system.

Thus, the system looks into the future and steers toward outcomes deemed valuable according to its internal estimates of value. This type of steering is the beginning of what we might intuitively think of as "real" agentic behavior (or not; definitions vary). Discovering Agents from DeepMind defines it in line with this proposed distinction:

Agents are systems that would adapt their policy if their actions influenced the world in a different way.

This also assumes the system "knows" (accurately represents) those changes, among other assumptions. This might be restated intuitively as a system that actively pursues goals, rather than a system that produces behaviors that tended to achieve goals during its training/design. Again, there's no obvious lower bound on what type of training could produce this definition of agentic behavior. But including a type 2 steering system ensures that the system meets this definition of agency.

Type 3: steering toward self-selected goals

Humans sometimes think in an even more goal-directed and agentic way. We sometimes choose a goal and use that goal representation to drive planning. I might make a goal of going to a store to buy a snack, or starting a successful business. Those goal representations will drive my planning in direct and indirect ways, in the long and short term.

The idea of choosing a goal is at odds with how we use the term in alignment. We often use “goal” synonymously with rewards or maximization goals. I usually use "goals" synonymously with "values". But for humans and similar systems, they’re not synonymous. What we call “our values” are, I think, estimates of future rewards. This is nicely synonymous with the term of values in reinforcement learning, if I’m roughly correct about how that works (see Human preferences as RL critic values - implications for alignment).

When we use the term goals for ourselves, we mean explicit, specific (although abstract) goals like getting a snack, getting back to work, getting a job we like, founding a business, etc. That type of goal, and the associated representations, is the heart of Type 3 steering.

This cognitive capacity has several advantages. It allows for backward-chaining from a desired goal state to actions that might achieve it. More importantly, this ability almost automatically allows an agent to strategically break a complex task into subtasks. Creating subgoals uses the same mechanisms, since humans (and effective AIs) take context into effect when choosing goals. For more complex tasks and problems, this decomposition seems likely to be useful. Engineering improvements in technology will decompose into hundreds of component problems involving material properties, manufacturing processes, economic and human factors, etc.

Thus far, empirical results showing improvements from problem decomposition are weak.^[9] But it seems likely that decomposition is highly useful for effective cognition; the world, and problems in the world, really seem to decompose.

I don't know of any work that fully describes how the brain creates useful goal representations. I haven't published my theories on this in part because it could advance capabilities. But I don't think this is terribly hard to figure out. And I don’t think it requires any breakthroughs to get AI systems to do this type of goal-creation steering in other ways. Indeed, LLMs seem rather adept at breaking a problem into subproblems. Language model agents (LMAs) can perform type 3 steering, even if they’re currently not good at executing the problem-solving plans they create.

Steering subsystems, AI progress, and alignment

Language model agents usually start with the prompt “create a plan to achieve [goal]”. This creates a multi-step plan, and each step is approached separately. This is type 3 steering.

Language model agents have yet to accomplish anything particularly impressive, but they do show promise on some tasks (such as Minecraft). So it seems far too early to rule them out as a path to AGI. Language models have some real intelligence, and it is difficult to guess how far this can be improved by scaffolding with other cognitive systems and software tools into agentic language model cognitive architectures, or language model agents, LMAs. It is so early in the development of language model agents that I give LMAs a round no-idea 50% chance of being the first route to self-improving, self-aware, thoroughly agentic AGI.

If LMAs do achieve AGI, I think this is relatively good news. I think they offer several advantages that make them the easiest-to-align type of plausible AGI. These include easier interpretability and a potentially very low alignment tax. I’ve written about these advantages here. One major advantage is that alignment efforts can center on the steering subsystem: this type of agent can be given a top-level goal of corrigibility, and any other combination of alignment goals. These can be stated in natural language, leveraging the system’s training prior to deployment.

If language model agents aren’t the first route to AGI, I think we’ll still see AGI with powerful, type 2 and 3 steering subsystems, based on the cognitive advantages they offer. If this is correct, we should create alignment approaches that focus on steering subsystems, given their central role in goal-directed behavior.

This is why I like Steve Byrnes’ Plan for mediocre alignment of brain-like [model-based RL] AGI. It works primarily on the critic (steering) subsystem. In essence, the plan is to induce the model to “think” about the thing you want it to steer toward (e.g., “hey, think about human flourishing”), then set the weights from the representational system into the critic system to a high value. Presto, an agent that values human flourishing above all else. It's not a fully developed plan yet, but it does seem more concrete and straightforward than any other suggested approach for training human values into an RL agent. This approach also benefits by making use of the agent’s training/intelligence for alignment, something I’ll focus on in a future post. It would seem to have a low alignment tax, and it can work alongside other alignment approaches, like interpretability measures, scalable oversight, etc.

Loosely brainlike RL agents are a highly plausible route to AGI if language model agents don't achieve it first. And the two approaches can be combined. Using RL to train an “outer loop” of cognitive control for language model agents is a frequently-proposed approach to improving LMAs. So the two alignment approaches above, both of which focus on steering subsystem, might be combined for that type of AGI.

Both of those approaches seem very promising but provide only a loose, “mediocre” alignment with human values. Whether such a rough match is adequate is an important question. If a superintelligence values a subset of human values, will the outcome be satisfactory for humanity? What if it values a superset of our values? A second outstanding issue for these (and other network-based) approaches is the alignment stability problem. Does reflective stability (ref*) ensure long-term stability in a network-based AGI, with values that are defined by distributed representations of semantics? Or might that system’s values shift dramatically as it continues to learn? I think both of these questions merit more careful thought, and they’ll be the subjects of upcoming posts.

Thanks to Steve Byrnes and Max H for helpful discussions and comments on a draft of this article.

^{^}
This work was done in collaboration with Randy O'Reilly and many members of his computational cognitive neuroscience lab from 1999 to 2022. We made neural network models of several brain systems, based on a variety of empirical data, focusing on human cognition and animal single-cell recordings. My focus was understanding how multiple brain systems come together to produce complex decision-making and belief formation.
^{^}
For more than you want to know about the various terminologies, and a bit of high-level theory, see: O’Reilly, R. C., Nair, A., Russin, J. L., & Herd, S. A. (2020). How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing. Frontiers in Psychology, 11, 380.
^{^}
I can’t easily find a comprehensive review of where actor-critic RL (AC RL) or similar systems work, and where it’s not needed. The most impressive instances of RL that I’m aware of all use AC. Those include DeepMind’s prominent RL agents, from the Atari system up through AlphaZero and AlphaFold, OpenAI’s Open Five family of agents, ChatGPTs RLHF (it goes by a different name, but seems firmly in the critic family) and every high-functioning instance of RL in the cognitive neuroscience space I’m more familiar with. I’d love to be corrected if it’s not necessary for some important problems. Here’s a team of real experts calling AC methods “ubiquitous” in RL: Wen, J., Kumar, S., Gummadi, R., & Schuurmans, D. (2021, July). Characterizing the gap between actor-critic and policy gradient. In International Conference on Machine Learning (pp. 11101-11111). PMLR.
^{^}
Including a critic system seems to be useful for a few reasons. It's splitting the learning problem into two separate pieces, what to do in a given situation, and how good each action outcome is. These are similar, and can be collapsed to one problem in either direction. But they appear to be different enough that including both provides better traction on the learning problem. They don't add that computational cost when they're implemented as two heads of the same network, as they usually are in deep network approaches. Having a critic also enables the MCTS-boosting approach taken by AlphaZero and similar algorithms, in which a few-move lookahead is used, the best move(s) trained into the actor. It's necessary to estimate which the best resulting board positions are to make this useful. Finally, critic systems are useful when reward is rare (like most real-world environments), since they provide at least a guess about how likely each action is to eventually lead to reward.
^{^}
Dopamine predicting value as the sum of future rewards is an approximation. It's actually a value delta. Phasic (fast) dopamine release signals the difference between the currently predicted value, and the one predicted just prior. This is termed a reward prediction error, or RPE. This is the temporal difference (TD) critic algorithm, but most actor-critic RL systems don’t seem to employ this temporal difference, although I haven’t dug far enough into the math for high-functioning Q-learning systems (like the DeepMind RL agents) to be certain it’s not hidden in there. The AC-advantage RL approach does something similar. Signaling the derivative rather than the absolute value is advantageous when the last state is a relevant comparison. This is often the case in when possible options are considered sequentially, which is one reason I think the human brain uses that approach (see the introduction to Neural mechanisms of human decision-making for more on this theory, although a clearer, more complete writeup is on my to-do list).
^{^}
Dopamine acts as the output of a critic system consisting of the amygdala and associated subcortical areas. The dopamine signal acts very much like a critic reward signal in an actor-critic RL system, by triggering positive or negative learning directly in the striatum, a large subcortical area that's heavily involved in action selection and decision-making. This system has been relatively well-investigated; for a review see Mollick, J. A., Hazy, T. E., Krueger, K. A., Nair, A., Mackie, P., Herd, S. A., & O'Reilly, R. C. (2020). A Systems-Neuroscience Model of Phasic Dopamine Psychological review, 127(6), 972. Dopamine affects learning in the cortex in less well-understood ways.
^{^}
Introspection is rarely mentioned in published papers. Private conversations suggest that cognitive scientists lean heavily on introspection when producing hypotheses and interpreting data. I take introspection seriously when it's done carefully. From a materialist perspective, it would be quite odd if introspection told us nothing about brain processes. Much has been made of a set of studies showing that introspection can be quite mistaken in some cases. That work is neatly summarized in Tim Wilson's Strangers to Ourselves, which I highly recommend for insight into the likely nature of unconscious processing. However, those studies can be summarized as showing how people mistake how they've made decisions, not actually being wrong about what they're thinking about. The hypothesis that we're aware of roughly the contents of working memory at any given moment, originating in the cognitive revolution, still seems perfectly viable as reviewed here. A critical review of purported counterevidence can be found here.
^{^}
For an excellent review see Gershman, S. J., & Uchida, N. (2019). Believing in dopamine. Nature Reviews Neuroscience, 20(11), 703-714. They review the empirical evidence, and show that dopamine signaling captures uncertainty as well as expected value, and is useful for Bayesian belief formation as well as decision-making.
^{^}
Tree of Thoughts and related work show efficacy in toy problems designed to decompose well, so they're not direct evidence this is important in AGI-relevant domains. LLMs decompose problems remarkably well with appropriate prompting, without steering subsystems to aid them, but it does seem like explicit decomposition mechanisms can only help, and may prove critical in tasks complex enough (like solving engineering problems) where humans definitely use problem decomposition.

31