[Advanced Intro to AI Alignment] 1. Goal-Directed Reasoning and Why It Matters

Towards_Keeperhood

1.1 Summary and Table of Contents

Why would an AI "want" anything? This post answers that question by examining a key part of the structure of intelligent cognition.

When you solve a novel problem, your mind searches for plans, predicts their outcomes, evaluates whether those outcomes achieve what you want, and iterates. I call this the "thinking loop". We will build some intuition for why any AI capable of solving difficult real-world problems will need something structurally similar.

This framework maps onto model-based reinforcement learning, where separate components predict outcomes (the model) and evaluate them (the critic). We'll use model-based RL as an important lens for analyzing alignment in this sequence—not because AGI will necessarily be built this way, but because analogous structure will be present in any very capable AI, and model-based RL provides a cleaner frame for examining difficulties and approaches.

The post is organized as follows:

Section 1.2 distinguishes goal-directed reasoning from habitual/heuristic reasoning, and explains how these interact. It also builds intuition for the thinking loop through everyday examples, and explains why it is necessary for solving novel problems.
Section 1.3 connects this to model-based reinforcement learning - the main framework we will be using for analyzing alignment approaches in future posts.
Section 1.4 explains that some behaviors are not easily compatible with goal-directed reasoning, including some we find intuitive and desirable.
Section 1.5 argues that for most value functions, keeping humans alive isn't optimal. It considers why we might survive - the AI caring about us, successful trade, exotic possibilities - but concludes we mostly need to figure out how to point the values of an AI.
Section 1.6 is a brief conclusion.

1.2 Thinking ahead is useful

1.2.1 Goal-Directed and Habitual Reasoning

Suppose you want to drive to your doctor’s office. To get there, you need to get to your doctor’s town, for which you need to drive on the highway from your town to your doctor’s town, for which you need to drive to the highway, for which you need to turn left. Your mind quickly reasons through those steps without you even noticing, so you turn left.

We can view your mind here as having a desired outcome (being at the doctor’s office) and searching for plans (what route to take) to achieve that outcome. This is goal-directed reasoning, which is an important part of cognition, but not the whole story yet.

Suppose the route to your doctor’s office starts out similar to the route to your workplace. So you’re driving along the highway and —wait was that the intersection where you needed to get out? Darn.

Here, you mistakenly drove further towards your workplace than you wanted to. Why? Because you have heuristics that guessed that this would be the right path to take, because you were thinking about something else so your goal directed reasoning was sleeping. This is an instance of habitual reasoning.

For effectively accomplishing something, there’s usually an interplay between goal-directed and habitual reasoning. Indeed, when you originally planned the route to your doctor’s town, heuristics played a huge role in you being able to think of a route quickly. Probably you drove there often enough to remember what route to take, and even if not, you have general heuristics for quickly finding good routes, e.g. “think of streets between you and your doctor’s town”.

1.2.2 The Thinking Loop

For now we have heuristics for proposing good plans, and a plan-evaluator which checks whether a plan is actually good for accomplishing what we want. If a plan isn’t good, we can query our heuristics again to adjust the plan or propose a new plan, until it’s good.

Let’s split up the plan-evaluator into a model, which predicts the outcomes of a plan, and a critic, which evaluates how good those outcomes are. Let’s also call the plan-proposing heuristics part “actor”. This gives us the thinking loop:

This diagram is of course a simplification. E.g. the actor may be very intertwined with the model rather than separate components, and plans aren’t proposed all at once. What matters is mostly that there is some model that predicts what would happen conditional on some actions/plan, some evaluation function of what outcomes are good/bad, and some way to efficiently find good plans.

Also, “outcomes” is supposed to be interpreted broadly, encompassing all sorts of effects from the plan including probabilistic guesses, not just whether a particular goal we care about is fulfilled.

And of course, this is only one insight into intelligence—there are more insights needed for e.g. figuring out how to efficiently learn a good world model, but we won’t go into these here.

Let’s look at a few examples for understanding thinking-loop structure.

Catching a ball:

You see the ball flying and you have a current plan for how to move your muscles.
The model part of your brain predicts where the ball will end up and where your hands will be given the planned muscle movements.
The critic part checks whether the predicted position of your hands is such that you’ll catch the ball. Let’s say it’s a bit off, so it sends the feedback that the hands should end up closer to the predicted position of the ball.
The actor part adjusts the planned muscle movements such that it expects the hands to end up in the right place.
… these steps repeat multiple times as the ball is flying.

Again, this isn’t meant to be a perfect description of what happens in your mind, but it is indeed the case that something vaguely like this is happening. In particular, there is some lookahead to what you expect the future state to be. It may feel like your muscles just moved reflexively to catch the ball, but that’s because you don’t have good introspective access to what’s happening in your mind.

Reminding a friend of a meeting:

Say you have a meeting in a few hours with a friend who often forgets meetings.

You remember your default plan of going to the meeting.
The model predicts that it’s plausible that your friend forgets to show up.
The critic evaluates the possibility where your friend doesn’t show up as bad.
The actor takes in that information and proposes to send a message to remind your friend.
The model predicts your friend will read it and come.
This evaluates as good.

Programming a user interface:

When a programmer programs a user interface, they have some vision in mind of how they want the user interface to look like, and their mind is efficiently searching for what code to write such that the user interface will look the way they intend.

1.2.3 Goal-Directed Reasoning is Important for Doing New Stuff

Suppose I give you the following task: Take a sheet of paper and a pair of scissors, and cut a hole into the sheet of paper, and then put yourself fully through the hole. (Yes it’s possible.)

Now, unless you heard the problem before, you probably don’t have any heuristics that directly propose a working solution. But you have a model in your mind with which you can simulate approaches (aka visualizing cutting the paper in some way) and check whether they work. Through intelligently searching for approaches and simulating them in your model, you may be able to solve the problem!

This is an example of the general fact that goal-directed reasoning generalizes further than just behavioral heuristics.

A related phenomenon is that when you learn a skill, you usually first perform the skill through effortful and slow goal-directed reasoning, and then you learn heuristics that make it easier and faster.

Although this is not all you need for learning skills—you also often need feedback from reality in order to improve your model. E.g. when you started to learn driving, you perhaps didn’t even have a good model of how much a car accelerates when you press the gas pedal down 4cm. So you needed some experience trying to learn a good model, and then some more experience for your heuristics to guess very well how deep you want to press the gas pedal depending on how much you want to accelerate.

1.2.4 The deeper structure of optimization

The thinking loop is actually an intuitive description of a deeper structure. In the words of Eliezer Yudkowsky:

The phenomenon you call by names like "goals" or "agency" is one possible shadow of the deep structure of optimization - roughly, preimaging outcomes onto choices by reversing a complicated transformation.

Aka, the model is a complicated transformation from choices/plans to outcomes, and we want to find choices/plans that lead to desired outcomes. One common way to find such plans is by doing some good heuristic search like in the thinking loop, but in principle you could also imagine other ways to find good plans—e.g. in a simple domain where the model is a linear transformation one could just invert the matrix.

1.3 Model-Based RL

Hopefully the above gave you an intuition for why smart AIs will very likely reason in a way that is at least in some way reminiscent of the actor-model-critic loop, although it could be only in an implicit way.

Current LLMs don’t have a separately trained model and critic. But they are trained to think in goal-directed ways in their chain of thought, so with that their thinking does embody the thinking loop at least a bit, although I expect future AIs to be even significantly more goal-directed.^[1]

But the special case where we indeed have separately-trained modules for the actor, model, and critic, is called “actor-critic model-based RL”. One example of actor-critic model-based RL is the series that includes AlphaGo, AlphaZero, MuZero, and EfficientZero.

In a sense, the actor mostly helps with efficiency. It’s trained to propose plans whose consequences will be evaluated as good by the critic, so if you want to see where the actor steers towards you can look at what outcomes the critic likes. So for analyzing alignment, what matters is the model and the critic^[2].

For the start of this sequence we will focus on the case where we do have a separately trained model and critic—not because I think AGI will be created this way (although it could be^[3])—but because it’s a frame where alignment can be more easily analyzed.^[4]

2.3.1 How the Model and Critic are Trained

Reward is a number that indicates how good/bad something that just happened was (the higher reward the better). Here are examples of possible reward functions:

For an AI that solves math problems, you could have a reward function that gives reward every time the AI submits a valid proof (or disproof) for a list of conjectures we care about.
For a trading AI, the changes in worth of the AI’s portfolio could be used as reward.
We can also use humans deciding when to give reward as a reward function. Aka they can look when the AI did something that seems good/bad to them and then give positive/negative reward.

The model is trained to predict observations, the critic is trained to predict expected future reward given the state of the model.^[5]However, the critic isn’t necessarily a very smart part of the AI, and it’s possible that it learns to predict simple correlates of reward, even if the model part of the AI could predict reward even better.

For example, if I were to take heroin, that would trigger high reward in my brain, and I know that. However, since I’ve never taken heroin, a key critic (also called valence thought assessor) in my brain doesn’t yet predict high value for the plan of taking heroin. If the critic was a smart mind trying to (not just trained to) predict reward, I would’ve become addicted to heroin just from learning that it exists! But it’s just a primitive estimator that doesn’t understand the abstract thoughts of the model, and so it would only learn to assign high value to taking heroin after I tried heroin.

Anyway, very roughly speaking, the alignment problem is the problem of how to make the critic assign high value to states we like and low value to states we don’t like. More in upcoming posts.

1.4 Getting the behavior you want may be more difficult than you think

We would like to have AIs that can work on a difficult problem we give it (e.g. curing cancer), and also shut down when we ask it to shut down. Suppose we solved the problem of how to get the goals we want into an AI, then it seems like it should be quite feasible to make an AI behave that way, right?

Well, it’s possible in principle to have an AI that behaves that way. If we knew the 10.000 actions an AI would need to take to cure cancer, we could program an AI that behaves as follows:

For each timestep t:

Did the operator ask me to shut down?
- Yes -> shut down.
- No -> do the t-th action in the list of actions for curing cancer.

The problem is that we do not have a list of 10.000 steps to curing cancer. In order to cure cancer, we need some goal-directed search towards curing cancer.

So can we make the AI care about both curing cancer and shutting down when asked, without it trying to make us shut it down or otherwise behaving in an undesirable way?

No, we currently can’t—it’s an unsolved problem. Nobody knows how to do that even if we could make an AI pursue any goals we give it. See Shutdown Buttons and Corrigibility for why.

Although there is a different approach to corrigibility that might do a bit better, which I’ll discuss later in this series.

The lesson here isn’t just that shutdownability is difficult, but that intuitive hopes for how we expect an AI should behave may not actually be a realistic possibility for smart AIs.

1.4.1 Current AIs don’t provide good intuitions for this difficulty

Current AIs consist mostly out of behavioral heuristics - their goal-directed reasoning is still relatively weak. So for now you can often just train your AI to behave the way you want and it sorta works.^[6]But when the AI gets smarter it likely stops behaving the way you want, unless you did your alignment homework.

1.5 In the limit of capability, most values lead to human extinction

Initially, the outcomes a young AI considers may be rather local in scope, e.g. about how a user may respond to an answer the AI gives.

But as an AI gets much smarter and thereby able to strongly influence the future of the universe, the outcomes the AI has preferences over will be more like universe-trajectories^[7].^[8]^[9]This change is likely driven both by the AI imagining more detailed consequences of its actions, and by parts of the AI trying to rebind its values when the AI starts to model the world in a different way^[10]and to resolve inconsistencies of the AI’s preferences.

In the limit of technology, sentient people having fun isn't the most efficient solution for basically any value function, except if you value sentient people having fun. A full-blown superintelligence could take over the world and create nanotechnology with which it disassembles the lithosphere^[11]and turns it into von-Neumann probes.

Viewed like this, the question isn’t so much “why would the AI kill us?” but “why would it decide to keep us alive?”. 3 possible reasons:

The AI terminally cares about us in some way. (Including e.g. if it mostly has alien values but has some kindness towards existing agents.^[12])
We’ve managed to trade with the AI in a way it is bound by its commitments.
(Acausal trade with distant superintelligences that pay for keeping us alive.)

I’m not going to explain option 3 here - that might take a while. I just listed it for completeness, and I don’t think it is strategically relevant for most people.

Option 2 looks unworkable, unless we manage to make the superintelligence robustly care about particular kinds of honor or fairness, which it seems unlikely to care about by default.^[13]

One way to look at option 1 is like this: Most value functions over universe-trajectories don’t assign high value to configurations where sentient people are having fun (let alone those configurations being optimal), nor do most value functions imply kindness to agents nearby.

That’s not a strong argument for doom yet - value functions for our AIs aren’t sampled at random. We might be able to make the AI have the right values, and we haven’t discussed yet whether it’s easy or hard. Introducing you to the difficulties and approaches here is what this sequence is for!

1.6 Conclusion

Now you know why we expect smart AIs to have something like goals/wants/values.

Such AIs are sometimes called “optimizers” or “consequentialists”, and the insight that smart AIs will have this kind of structure is sometimes called “consequentialism”^[14]. The thinking loop is actually just an important special case in the class of optimization processes, which also e.g. includes evolution and gradient descent.^[15]

In the next post, we’ll look at an example model-based RL setup and what values an AI might learn there, and thereby learn about key difficulties in making an AI learn the right values.

Questions and Feedback are always welcome!

Aside from thinking-loop structure in the chain of thought of the models, there likely also is lookahead within the forward pass of a LLM, where this information is then used within the same forward pass to decide which tokens to output. Although given the standard transformer architecture this lookahead->decision structure lacks some loopiness, so the full thinking loop structure comes from also having chain of thought reasoning (unless the newest LLMs have some relevant changes in architecture). ↩︎
Btw, the critic is sometimes also called “value function” or “reward model”. ↩︎
Brains seem to have a separately learned model and critic (aka “learned value function”) that scores thoughts in the human brain. See Steven Byrnes’ Intro to Brain-Like AGI and Valence series for more. ↩︎
Whether that makes alignment easier or harder than with LLMs is hard to say and I will return to this question later in this series. ↩︎
Where future reward may be time-discounted, aka reward in the far future doesn’t count as fully into the value score as reward soon. ↩︎
Well only sorta, see e.g. the links in this quote from here: “Sydney Bing gaslit and threatened users. We still don’t know exactly why; we still don’t know exactly what was going through its head. Likewise for cases where AIs (in the wild) are overly sycophantic, seem to actively try to drive people mad, reportedly cheat and try to hide it, or persistently and repeatedly declare themselves Hitler. Likewise for cases in controlled and extreme environments where AIs fake alignment, engage in blackmail, resist shutdown, or try to kill their operators.” ↩︎
By universe-trajectory I mean the time-expanded state of the universe, aka how good the history and the future are combined, aka like the history of the universe from the standpoint of the end of the universe. (It’s actually not going to be exactly universe-trajectories either, and instead something about how greater reality even beyond our universe looks, but that difference doesn’t matter for us in this series of posts.) ↩︎
The claim here isn’t that the AI internally thinks about full universe-trajectories and scores how much it likes them, but that it will have a utility function over universe-trajectories which it tries to optimize, but may do so imperfectly because it is computationally bounded. This utility function can be rather indirect and doesn’t need to be written in pure math but can draw on the AI’s ontology for thinking about things. Indeed, some smart reflective humans already qualify here because they want something like their coherent extrapolated volition (CEV) to be fulfilled, which is a very indirect function over universe-trajectories (or rather greater reality). (Of course, these humans often still take actions other than what they expect is best according to their CEV, or you could see it as them pursuing CEV but in the presence of constraints of other drives within themselves.) This all sounds rather fancy, but ultimately it’s not that complex. It’s more like realizing that if you had god-like power to reshape the resources of a galaxy, you could reshape it into a state that seems nice to you, and then you update that it’s part of your preferences to fill galaxies with cool stuff. ↩︎
There’s no claim here about the AI needing to value far-future universe states similarly much as very near-future states. You can totally have a function over universe-trajectories that is mostly about what happens in the soon future, although it may be less simple, and how the universe will look in the far future will mostly depend on the parts of the AI preferences that also are about the far future. ↩︎
For instance, suppose the AI has a simple model of the human overseer’s mind with a “values” component, and it cares about whatever those values say. But then the AI learns a much more detailed psychological model of human minds, and there’s no very clear “values” node - there may be urges, reflectively endorsed desires, what the human thinks are his values vs what they will likely think in the future. Then there needs to be some procedure to rebind the original “values” concept to the new ontology so the AI can continue to care about it. ↩︎
Aka the outer crust of the earth. ↩︎
Although I will mostly focus on the case where the AI learns human values, so we fulfill humanity’s potential of spreading love and joy through the galaxies, rather than merely surviving. ↩︎
Achieving this is also an alignment problem. To me this approach doesn’t look much easier than to make it care about human values, since the bottleneck is in targeting particular values of an AI at all in a way they get preserved as the AI becomes superintelligent. ↩︎
Because there are close parallels to the ethical theory that’s also called “consequentialism”, which argues that the morality of an action should be judged based on the action’s outcomes. ↩︎
Roughly speaking, other optimizers will still have some loop like the thinking loop, but instead of optimizing plans they may optimize genes or model weights, and instead of using the model to predict outcomes, you can just run tests in the world and see the outcome, and instead of having complex actor heuristics, you can have very simple heuristics that systematically improve the thing that is being optimized. So for evolution we have: genotype distribution -> phenotype distribution –which-reproduce?--> genotype distribution. For gradient descent: model weights -> predictions on data -> loss (aka how badly you predicted) -> compute gradient and update model weights to produce a lower loss. ↩︎

LESSWRONG
LW