In the last post, I introduced model-based RL, which is the frame we will use to analyze the alignment problem, and we learned that the critic is trained to predict reward.
I already briefly mentioned that the alignment problem is centrally about making the critic assign high value to outcomes we like and low value to outcomes we don’t like. In this post, we’re going to try to get some intuition for what values a critic may learn, and thereby also learn about some key difficulties of the alignment problem.
Section-by-section summary:
2.2 The Distributional Leap: The distributional leap is the shift from the training domain to the dangerous domain (where the AI could take over). We cannot test safety in that domain, so we need to predict how values generalize.
2.3 A Naive Training Strategy: We set up a toy example: a model-based RL chatbot trained on human feedback, where the critic learns to predict reward from the model's internal thoughts. This isn't meant as a good alignment strategy—it's a simplified setup for analysis.
2.4 What might the critic learn?: The critic learns aspects of the model's thoughts that correlate with reward. We analyze whether honesty might be learned, and find that "say what the user believes is true" is similarly simple and predicts reward better, so it may outcompete honesty.
2.5 Niceness is not optimal: Human feedback contains predictable mistakes, so strategies that predict reward (including the mistakes) outperform genuinely nice strategies.
2.6 Niceness is not (uniquely) simple: Concepts like "what the human wants" or "follow instructions as intended" are more complex to implement than they intuitively seem. The anthropomorphic optimism fallacy—expecting optimization processes to find solutions in the same order humans would—applies here. Furthermore, we humans have particular machinery in our brains that makes us want to follow social norms, which gives us bad intuitions for what may be learned absent this machinery.
2.7 Natural Abstractions or Alienness?: The natural abstraction hypothesis suggests AIs will use similar concepts to humans for many things, but some human concepts (like love) may be less natural for AIs. It could also be that the AI learns rather alien concepts and then the critic might learn a kludge of patterns rather than clean human concepts, leading to unpredictable generalization.
2.8 Value extrapolation: Even if we successfully train for helpfulness, it's unclear how this generalizes when the AI becomes superintelligent and its values shift to preferences over universe-trajectories. Coherent Extrapolated Volition (CEV) is a proposed target for values that would generalize well, but it's complex and not a near-term goal.
2.9 Conclusion: Four key problems: (1) reward-prediction beats niceness, (2) niceness isn't as simple as it may intuitively seem to us, (3) learned values may be alien kludges, (4) niceness that scales to superintelligence requires something like CEV.
2.2 The Distributional Leap
Since we train the critic to predict reward and the AI searches for strategies where the critic assigns a high value, the AI will perform well within the training distribution as measured in how much reward it gets. So if we train on human feedback, the human will often like the answers of the AI (although it’s possible the human would like some answers less if they had even fuller understanding).
But the thing we’re interested in is what the AI will do when it becomes dangerously smart, e.g. when it would be capable of taking over the world. This shift from the non-catastrophic domain to the catastrophic domain is sometimes called the distributional leap. A central difficulty here is that we cannot test what happens in the dangerous domain, because if the safety properties fail to generalize, humanity becomes disempowered.[1]
In order to predict how the values of an AI might generalize in our model-based RL setting, we want to understand what function the critic implements, aka aspects of the model’s outcomes the critic assigns high or low value to. Ideally we would have a mechanistic understanding here, so we could just look at the neural networks in our AI and see what the AI values. Alas, we are currently very far from being able to do this, and it doesn’t look like progress in mechanistic interpretability will get us there nearly in time.
So instead we resort to trying to predict what the critic is most likely to learn. For alignment we need to make sure the critic ends up the way we like, but this post is mostly about conveying intuition of what may likely be learned in given a simple example training setup, and thereby also illustrating some key difficulties of alignment.
2.3 A Naive Training Strategy
Let’s sketch an example training setup where we can analyze what the critic may learn.
Say we are training an actor-critic model-based RL chatbot with Deep Learning. With data from chat conversations of past models, we already trained an actor and a model: The actor is trained to predict what the AI may say in a conversation, and the model is trained to predict what the user may say in reply.
Now we introduce the critic, which we will train through human feedback. (The model also continues to be trained to even better predict human responses, and the actor also gets further trained based on the value scores the critic assigns. But those aren’t the focus here.)
The critic doesn’t just see the model’s predicted response[2], but the stream of thought within the model. So the model might e.g. internally think about whether the information in the AI text is correct and about what the human may think when reading the text, and the critic can learn to read these thoughts. To be clear, the model’s thoughts are encoded in giant vectors of numbers, not human-readable language.
The bottom rhombus just shows that if the value score is high, the proposed text gets outputted, and if not, the actor is supposed to try to find some better text to output.
The human looks at the output and tries to evaluate whether it looks like the AI is being harmless, helpful, and honest, and gives reward based on that.
2.3.1 How this relates to current AIs
To be clear, this isn’t intended to be a good alignment strategy. For now we’re just interested in building understanding about what the critic may learn.
Also, this is not how current LLMs work. In particular, here we train the critic from scratch, whereas LLMs don’t have separated model/actor/critic components, and instead learn to reason in goal-directed ways where they start out generalizing from text of human reasoning. This “starting out from human reasoning” probably significantly contributes to current LLMs being mostly nice.
It’s unclear for how long AIs will continue to superficially reason mostly like nice humans - the more we continue training with RL, the less the initial “human-like prior” might matter. And LLMs are extremely inefficient compared to e.g. human brains, and it seems likely that we eventually have AIs that are more based on RL. I plan to discuss this in a future post.
In the analysis in this post, there is no human-like prior for the critic, so we just focus on what we expect to be learned given model-based RL.
Model-based RL also has advantages for alignment. In particular, we have a clear critic component which determines the goals of the AI. That’s better than if our AI is a spaghetti-mess with nothing like a goal slot.[3]
2.4 What might the critic learn?
Roughly speaking, the critic learns to pay attention to aspects of the model’s thoughts that are correlated with reward, and to compute a good reward prediction from those aspects[4].
Initially, what the critic computes may be rather simple. E.g. it may look at whether the model thinks the user will say a word like great/amazing/awesome, and some other simple aspects like that and then have a simple function on those aspects to compute the value score.
As we train further, the critic may learn more complex functions and compute its own complex aspects from information it can extract from the model’s thoughts.
Overall, the critic is more likely to learn (1) a function that is simple for neural networks to learn, and (2) a function that predicts reward well. As we train more, the reward prediction becomes better and the function in the critic can become more complex, but of two functions that predict reward similarly well, the critic will more likely learn the one that’s simpler for neural nets to learn.
Note that what’s simple for a neural net to learn likely doesn’t match well with what we intuitively think of as simple. “Love” may seem like a simple concept to us but it may be complex for an AI to learn to value. “Honesty” seems less human-centric, but even if it is, what exactly would it mean for our AI to care about being honest?
In order to evaluate whether honesty might be learned, we need to think mechanistically about what it would mean for the critic to rank honest texts more highly.
2.4.1 Might the critic learn to score honesty highly?
(Take the following analysis with a grain of salt, what actually gets learned may be a lot more messy and alien.)
The AI is honest if the text it outputs matches its beliefs, which in our case means matching the beliefs of the model.
So we need a comparison between the text and the model’s beliefs. Might the model already compute the differences here, so the critic could just pick up on those differences instead of needing to learn the comparison itself? Yes that seems likely, since such differences may often be important for predicting how the human will respond.
Cool, so will the critic learn to pay attention to those differences? Seems plausible again, since such differences also seem quite useful for predicting reward, because the human will give negative reward if the AI outputs text where the human can tell it is false.
So we could imagine the critic learning an honesty circuit, that decreases the value score if significant such differences are present. (To be clear, this is just exemplary, there very likely won’t actually be anything like a relatively independent honesty circut in the critic. But the complexity of an honesty circut might still tell us something about whether honesty might be learned.)
So yeah, in our simplified toy model, the critic may learn a pattern that predicts honesty is good.
However, it is only one pattern among many, and there will still be some cases where the critic evaluates the non-honest action as better overall. In particular, this is likely to happen in cases where AI predicts that the dishonesty probably won't be caught. So when the AI then indeed does not get caught, the honesty-pattern gets weaker, since it predicted low reward but the result was high reward. And there might even be cases where the AI is honest but the human thinks it’s wrong and then mistakenly gives low reward.
Is there something else that could be learned which predicts reward better than honesty and isn’t much more complex? Unfortunately, yes:
The model doesn’t just have beliefs about what it thinks is true, but also beliefs about what the human believes. This is especially true in our case because the model is predicting how the human responds. And the model likely also already compares the text to its beliefs about the human’s beliefs.
So the critic can just learn to pay attention to those differences and assign a lower value score if those are present. Now the model learned to tell the human what they will think is true, which performs even better.
So the original honesty circut will get outcompeted. Indeed, because those two circuits seem similarly complex, the honesty circut might not even have been learned in the first place!
2.4.1.1 Aside: Contrast to the human value of honesty
The way I portrayed the critic here as valuing honesty is different from the main sense in which humans value honesty: for humans it is more self-reflective in nature—wanting to be an honest person, rather than caring in a more direct way that speech outputs match our beliefs.
We don’t yet have a good theory for how human preferences work, although Steven Byrnes has recently made great progress here.
2.5 Niceness is not optimal
That the critic doesn’t learn honesty is an instance of a more general problem which I call the "niceness is not optimal” problem. Even if we try to train for niceness, we sometimes make mistakes in how we reward actions, and the strategy that also predicts the mistakes will do better than the nice strategy.
Unfortunately, mistakes in human feedback aren’t really avoidable. Even if we hypothetically wouldn’t make mistakes when judging honesty (e.g. in a case where we have good tools to monitor the AI’s thoughts), as the AI becomes even smarter, it may learn a very detailed psychological model of the human and be able to predict precisely how to make them decide to give the AI reward.
One approach to mitigate this problem is called “scaleable oversight”. The idea here is that we use AIs to help humans give more accurate feedback.
Though this alone probably won’t be sufficient to make the AI learn the right values in our case. We train the critic to predict reward, so it is not surprising if it ends up predicting what proposed text leads to reward, or at least close correlates of reward, rather than what text has niceness properties. This kind of reward-seeking would be bad. If the AI became able to take over the world, it would, and then it might seize control of its reward signal, or force humans to give it lots of reward, or create lots of human-like creatures that give it reward, or whatever.[5]
Two approaches for trying to make it less likely that the critic will be too reward-seeking are:
We could try to have the AI not know about reward or about how AIs are trained, and also try to not let the AI see other close correlates to reward, ideally including having the model not model the overseers that give reward.
We could try to make the AI learn good values early in training, and then stop training the critic before it learns to value reward directly.
2.6 Niceness is not (uniquely) simple
We’ve already seen that honesty isn’t much simpler than “say what the user believes” in our setting. For other possible niceness-like properties, this is similar, or sometimes even a bit worse.
Maybe “do what the human wants” seems simple to you? But what does this actually mean on a level that’s a bit closer to math - how might a critic evaluating this look like?
The way I think of it, “what the human wants” refers to what the human would like if they knew all the consequences of the AI’s actions. The model will surely be able to make good predictions here, but the concept seems more complex than predicting whether the human will like some text. And predicting whether the human will like some text predicts reward even better!
Maybe “follow instructions as intended” seems simple to you? Try to unpack it - how could the critic be constructed to evaluate how instruction-following a plan is, and how complex is this?
Don’t just trust vague intuitions, try to think more concretely.
2.6.1 Anthropomorphic Optimism
Eliezer Yudkowsky has a great post from 2008 called Anthropomorphic Optimism. Feel free to read the whole post, but here’s the start of it:
The core fallacy of anthropomorphism is expecting something to be predicted by the black box of your brain, when its casual structure is so different from that of a human brain, as to give you no license to expect any such thing.
The Tragedy of Group Selectionism (as previously covered in the evolution sequence) was a rather extreme error by a group of early (pre-1966) biologists, including Wynne-Edwards, Allee, and Brereton among others, who believed that predators would voluntarily restrain their breeding to avoid overpopulating their habitat and exhausting the prey population.
The proffered theory was that if there were multiple, geographically separated groups of e.g. foxes, then groups of foxes that best restrained their breeding, would send out colonists to replace crashed populations. And so, over time, group selection would promote restrained-breeding genes in foxes.
I'm not going to repeat all the problems that developed with this scenario. Suffice it to say that there was no empirical evidence to start with; that no empirical evidence was ever uncovered; that, in fact, predator populations crash all the time; and that for group selection pressure to overcome a countervailing individual selection pressure, turned out to be very nearly mathematically impossible.
The theory having turned out to be completely incorrect, we may ask if, perhaps, the originators of the theory were doing something wrong.
"Why be so uncharitable?" you ask. "In advance of doing the experiment, how could they know that group selection couldn't overcome individual selection?"
But later on, Michael J. Wade went out and actually created in the laboratory the nigh-impossible conditions for group selection. Wade repeatedly selected insect subpopulations for low population numbers. Did the insects evolve to restrain their breeding, and live in quiet peace with enough food for all, as the group selectionists had envisioned?
No; the adults adapted to cannibalize eggs and larvae, especially female larvae.
Of course selecting for small subpopulation sizes would not select for individuals who restrained their own breeding. It would select for individuals who ate other individuals' children. Especially the girls.
The problem was that the group-selectionists used their own mind to generate a solution to a problem, and expected evolution to find the same solution. But evolution doesn’t search for solutions in the same order you do.
This lesson directly carries over to other alien optimizers like gradient descent. We’re trying to give an AI reward if it completed tasks in the way we intended, and it seems to us like a natural thing the AI may learn is just to solve problems in the way we intend. But just because it seems natural to us doesn’t mean it will be natural for gradient descent to find.
The lesson can also apply to AIs themselves, albeit that current LLMs seem like they inherit a human-like search ordering from being trained on lots of human data. But as an AI becomes smarter than humans, it may think in ways less similar to humans, and may find different ways of fulfilling its preferences than we humans would expect.
2.6.2 Intuitions from looking at humans may mislead you
We can see the human brain as being composed out of two subsystems: The learning subsystem and the steering subsystem.
The learning subsystem is mostly the intelligent part, which also includes some kind of actor-model-critic structure. There are actually multiple critic-like predictors (also called thought assessors) that predict various internal parameters, but one critic, the valence thought assessor, is especially important in determining what we want.
The reward function on which this valence critic is trained is part of the steering subsystem, and according to the theory which I think is correct, this reward function has some ability to read the thoughts in the learning subsystem, and whenever we imagine someone being happy/sad, this triggers positive/negative reward, especially for people we like[6], and especially in cases where the other person is thinking about us. So when we do something that our peers would disapprove of, we directly get negative reward just from imagining someone finding out, even if we think it is unlikely that they will find out.[7]
This is a key reason why most humans are at least reluctant to breach social norms like honesty even in cases where breaches very likely won’t get caught.
Given this theory, psychopaths/sociopaths would be people where this kind of approval reward is extremely small, and AFAIK they mostly don’t seem to attach intrinsic value to following social norms (although of course instrumental value).
We currently don’t know how we could create AI that gets similar approval reward to how humans do.
Ok, so the niceness properties we hope for are perhaps not learned by default. But how complex are they to learn? How much other stuff that also predicts reward well could be learned instead?
In order to answer this question, we need to consider whether the AI thinks in similar concepts as us.
2.7.1 Natural Abstractions
The natural abstraction hypothesis predicts that “a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world”. This class of cognitive architectures includes human minds and AIs we are likely to create, so AIs will likely think about the world in mostly the same concepts as humans.
For instance, “tree” seems like a natural abstraction. You would expect an alien mind looking at our planet to still end up seeing this natural cluster of objects that we call “trees”.[8]This seems true for many concepts we use, not just “tree”.
However, there are cases where we may not expect an AI to end up thinking in the same concepts we do. For one thing, an AI much smarter than us may think in more detailed concepts, and it may have concepts for reasoning about parts of reality that we do not have yet. E.g. imagine someone from 500 years ago observing a 2025 physics student reasoning about concepts like “voltage” and “current”. By now we have a pretty decent understanding about physics, but in biology or even in the science of minds an AI might surpass the ontology we use.
But more importantly, some concepts we use derive from the particular mind architecture we have. Love and laughter seem more complex to learn for a mind that doesn’t have brain circuitry for love or laughter. And some concepts are relatively simple but perhaps not quite as natural as they seem for us humans. I think “kindness”, “helpfulness”, and “honor” likely fall under that category of concepts.
2.7.2 … or Alienness?
Mechanistic interpretability researchers are trying to make sense of what’s happening inside neural networks. So far we found some features of the AI’s thoughts that we recognize, often specific people or places, e.g. the Golden Gate Bridge. But many features remain uninterpretable to us so far.
This could mean two things. Perhaps we simply haven't found the right way to look - maybe with better analysis methods or maybe with a different frame for modelling AI cognition, we would be able to interpret much more.
But it’s also possible that neural networks genuinely carve up the world differently than we do. They might represent concepts that are useful for predicting text or images but don't correspond to the abstractions humans naturally use. And this could mean that many of the concepts we use are alien for the AI in turn. Although given that the AI is trained to predict humans, it perhaps does understand human concepts, but it could be that many such concepts are less natural for the AI and it mostly reasons in other concepts.
The worst case would be that concepts like “helpfulness” are extremely complex to encode in the AI’s ontology, although my guess is that it won’t be that complex.
Still, given that the internals of an AI may be somewhat alien, it seems quite plausible that what the critic learns isn’t a function that’s easily describable through human concepts, but may from our standpoint rather be a messy kludge of patterns that happen to predict reward well.
If the critic learned some kludge rather than a clean concept, then the values may not generalize the way we hope. Given all the options the AI has in its training environment, the AI prefers the nice one. But when the AI becomes smarter, and is able to take over the world and could then create advanced nanotechnology etc, it has a lot more options. Which option does now rank most highly? What does it want to do with the matter in the universe?
I guess it would take an option that looks strange, e.g. filling the universe with text-like conversations with some properties, where if we could understand what was going on we could see the conversations somewhat resembling collaborative problem solving. Of course not exactly that, but there are many strange options.
Though it’s also possible, especially with better alignment methods, that we get a sorta-kludgy version of the values we were aiming for. Goodhart’s Curse suggests that imperfections here will likely be amplified as the AI becomes smarter and thus searches over more options. But whether it’s going to end up completely catastrophic or just some value lost likely depends on the details of the case.
2.8 Value extrapolation
Suppose we somehow make the critic evaluate how helpful a plan is to the human operators, where “helpful” is the clean human concept, not an alien approximation.[9]Does that mean we win? What happens if the AI becomes superintelligent?
The optimistic imagination is that the AI just fulfills our requests the way we intend, e.g. that it secures the world against the creation of unaligned superintelligences in a way that doesn’t cause much harm, and then asks us how we want to fill the universe.
However, as mentioned in section 1.5 of the last post, in the process of becoming a superintelligence, what the value-part of the AI (aka what is initially the critic) evaluates changes from “what plan do I prefer most given the current situation” to “how highly do I rank different universe-trajectories”. So we need to ask: how may “helpfulness to the human operators” generalize to values over universe-trajectories?
How this generalizes seems underdefined. Helpfulness is mainly a property that actions can have, but it’s less natural as a goal that could be superintelligently pursued. In order to predict how it may generalize, we would need to think more concretely how the helpfulness of a plan can be calculated based on the initial model’s ontology, then imagine how the ontology may shift, imagine value rebinding procedures[10], and then try to predict what the AI may end up valuing.[11]
Regardless, helpfulness (or rather corrigibility, as we will learn later in this series), isn’t intended to scale to superintelligence, but rather intended as an easier intermediate target, so we get genius-level intelligent AIs that can then help us figure out how to secure the world against the creation of unaligned superintelligence and to get us on a path to fulfill humanity’s potential. Although it is of course worrying to try to get work out of an AI that may kill you if it becomes too smart.
2.8.1 Coherent Extrapolated Volition
What goal would generalize to universe-trajectories in a way that the universe ends up nice? Can we just make it want the same things we want?
Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom; beauty, harmony, proportion in objects contemplated; aesthetic experience; morally good dispositions or virtues; mutual affection, love, friendship, cooperation; just distribution of goods and evils; harmony and proportion in one's own life; power and experiences of achievement; self-expression; freedom; peace, security; adventure and novelty; and good reputation, honor, esteem, etc.
Most of these values stem from some kind of emotion or brain circuitry where we don’t yet understand how it works, and for each of them it seems rather difficult to get an AI, which has a very different mind design and lacks human-like brain circuitry, to care about it.
Ok then how about indirectly pointing to human values? Aka: for a particular human, the AI has a model of that human, and can imagine how the human would evaluate plans. So instead of the AI directly evaluating what is right the way humans do it, we point the AI to use its model of humans to evaluate plans.
This indirect target does have much lower complexity than directly specifying the things we care about and thereby does seem more feasible, but there’s some nuance needed. Humans have reflectively endorsed values and urges. We want the AI to care about the values we reflectively endorse, rather than to feed us superstimulating movies that trigger our addiction-like wanting. And of course the pointer to “what humans want” would need to be specified in a way that doesn’t allow the AI to manipulate us into wanting things that are easier to fulfill for the AI.
Furthermore, we don’t know yet how our values will generalize. We have preferences in the here and now, but we also have deep patterns in our mind that determine what we would end up wanting when we colonize the galaxies. We don’t know yet what that may be, but probably lots of weird and wonderful stuff we cannot comprehend yet.
We might even have wrong beliefs about our values. E.g. past societies might’ve thought slavery was right, and while maybe some people in the past simply had different values from us, some others might’ve changed their mind if they became a bit smarter and had time for philosophical reflection about the question.
And of course, we need the AI to make decisions over questions that humans cannot understand yet, so simply simulating what a human would think doesn’t work well.
Ok, how about something like “imagine how a human would evaluate plans if they were smarter and moved by reflectively endorsed desires”?
Yeah we are getting closer, but the “if they were smarter” seems like a rather complicated counterfactual. There may be many ways to extrapolate what a mind would want if it was smarter, and the resulting values might not be the same in all extrapolation procedures.
One approach here is to imagine multiple extrapolation procedures, and act based on where the extrapolations agree/cohere. This gives us, as I understand it, the coherent extrapolated volition (CEV) of a single human.
Not all humans will converge to the same values. So we can look at the extrapolated values of different humans, and again take the part of the values that overlaps. This is the CEV of humanity.
The way I understand it, CEV isn’t crisply specified yet. There are open questions of how we may try to reconcile conflicting preferences of different people or different extrapolation procedures. And we might also want to specify the way a person should be extrapolated to a smarter version of itself. Aka something like slowly becoming smarter in a safe environment without agents trying to make them arrive at some particular values, where they can have fun and can take their time with philosophic reflection on their values.
My read is that CEV is often used as a placeholder for the right indirect value specification we should aim for, where the detailed specification still needs to be worked out.
As you can probably see, CEV is a rather complex target, and there may be further difficulties in avoiding value-drift as an AI becomes superintelligent, so we likely need significantly more advanced alignment methods to point an AI to optimize CEV.
How much earlier nice AIs can help us solve this harder problem is one of the things we will discuss later in this series.
2.9 Conclusion
Whoa that was a lot, congrats for making it through the post!
Here’s a quick recap of the problems we’ve learned of:
Humans make predictable mistakes in giving reward. Thus, predicting what will actually lead to reward or very close correlates thereof will be more strongly selected for than niceness.
Niceness may be less simple than you think.
The concepts in which an AI reasons might be alien, and it may learn some alien kludge rather than the niceness concepts we wanted.
The key problem here is that while the AI learned values that mostly add up to useful behavior on the controlled distribution, the reasons why it has the nice behavior there may not be the good reasons we hoped for, so if we go significantly off distribution, e.g. to where the AI could take over the world, it will take actions that are highly undesirable from our perspective.
And then there’s a fourth problem that even if it is nice for good reasons, many kinds of niceness look like they might break when we crank up intelligence far enough:
Niceness that generalizes correctly to superintelligent levels of intelligence requires something like CEV, which is especially complex.
Questions and Feedback are always welcome!
See also “A Closer Look at Before and After”. Furthermore, even if the AI doesn’t immediately take over the world when it is sure it could, it could e.g. be that the alignment properties we got into our AI weren’t indefinitely scaleable, and then alignment breaks later. ↩︎
which actually isn’t a single response but a probability distribution over responses ↩︎
That’s not to say that model-based RL solves all problems of having an AI with a goal slot. In particular, we don’t have good theory of what happens when a model-based RL agent reflects on itself etc. ↩︎
I’m using “aspects” instead of “features” here because “features” is the terminology used for a particular concept in mechanistic interpretability, and I want “aspects” to also include potential other concepts or so where we maybe just haven’t yet found a good way to measure them in neural networks. ↩︎
There’s also a different kind of reward seeking where the AI actually cares about something else, and only predicted reward for instrumental reasons like avoiding value drift. This will be discussed in more detail in the next 2 posts. ↩︎
For people we actively dislike, the reward can be inverted, aka positive reward when they are sad and negative when they are happy. ↩︎
Of course, the negative reward is even much stronger when we are actually in the situation where someone finds out. But it appears that even in cases where we are basically certain that nobody will find out, we still often imagine that our peers would disapprove of us, and this still triggers negative reward. Basically, the reward function is only a primitive mind-reader, and doesn’t integrate probabilistic guesses about how unlikely an event is into how much reward it gives, but maybe rather uses something like “how much are we thinking about that possibility” as a proxy for how strongly to weigh that possibility. ↩︎
That doesn’t mean there needs to be a crisp boundary between trees and non-trees. ↩︎
Just thinking about the AI learning “helpfulness” is of course thinking on a too high level of abstraction and may obscure the complexity here. And it could also turn out that helpfulness isn’t a crisp concept - maybe there are different kinds of helpfulness, maybe each with some drawbacks, and maybe we confuse ourselves by always imagining the kind of helpfulness that fits best in a given situation. But it doesn’t matter much for the point in this section. ↩︎
Which potentially includes the AI reasoning through philosophical dilemmas. ↩︎
Such considerations are difficult. I did not do this one. It’s conceivable that it would generalize like in the optimistic vision, but it could also turn out that it e.g. doesn’t robustly rule out all kinds of manipulation, and then the AI does some helpful-seeming actions that manipulate human minds into a shape where the AI can help them even more. ↩︎
2.1 Summary
In the last post, I introduced model-based RL, which is the frame we will use to analyze the alignment problem, and we learned that the critic is trained to predict reward.
I already briefly mentioned that the alignment problem is centrally about making the critic assign high value to outcomes we like and low value to outcomes we don’t like. In this post, we’re going to try to get some intuition for what values a critic may learn, and thereby also learn about some key difficulties of the alignment problem.
Section-by-section summary:
2.2 The Distributional Leap
Since we train the critic to predict reward and the AI searches for strategies where the critic assigns a high value, the AI will perform well within the training distribution as measured in how much reward it gets. So if we train on human feedback, the human will often like the answers of the AI (although it’s possible the human would like some answers less if they had even fuller understanding).
But the thing we’re interested in is what the AI will do when it becomes dangerously smart, e.g. when it would be capable of taking over the world. This shift from the non-catastrophic domain to the catastrophic domain is sometimes called the distributional leap. A central difficulty here is that we cannot test what happens in the dangerous domain, because if the safety properties fail to generalize, humanity becomes disempowered.[1]
In order to predict how the values of an AI might generalize in our model-based RL setting, we want to understand what function the critic implements, aka aspects of the model’s outcomes the critic assigns high or low value to. Ideally we would have a mechanistic understanding here, so we could just look at the neural networks in our AI and see what the AI values. Alas, we are currently very far from being able to do this, and it doesn’t look like progress in mechanistic interpretability will get us there nearly in time.
So instead we resort to trying to predict what the critic is most likely to learn. For alignment we need to make sure the critic ends up the way we like, but this post is mostly about conveying intuition of what may likely be learned in given a simple example training setup, and thereby also illustrating some key difficulties of alignment.
2.3 A Naive Training Strategy
Let’s sketch an example training setup where we can analyze what the critic may learn.
Say we are training an actor-critic model-based RL chatbot with Deep Learning. With data from chat conversations of past models, we already trained an actor and a model: The actor is trained to predict what the AI may say in a conversation, and the model is trained to predict what the user may say in reply.
Now we introduce the critic, which we will train through human feedback. (The model also continues to be trained to even better predict human responses, and the actor also gets further trained based on the value scores the critic assigns. But those aren’t the focus here.)
The critic doesn’t just see the model’s predicted response[2], but the stream of thought within the model. So the model might e.g. internally think about whether the information in the AI text is correct and about what the human may think when reading the text, and the critic can learn to read these thoughts. To be clear, the model’s thoughts are encoded in giant vectors of numbers, not human-readable language.
The bottom rhombus just shows that if the value score is high, the proposed text gets outputted, and if not, the actor is supposed to try to find some better text to output.
The human looks at the output and tries to evaluate whether it looks like the AI is being harmless, helpful, and honest, and gives reward based on that.
2.3.1 How this relates to current AIs
To be clear, this isn’t intended to be a good alignment strategy. For now we’re just interested in building understanding about what the critic may learn.
Also, this is not how current LLMs work. In particular, here we train the critic from scratch, whereas LLMs don’t have separated model/actor/critic components, and instead learn to reason in goal-directed ways where they start out generalizing from text of human reasoning. This “starting out from human reasoning” probably significantly contributes to current LLMs being mostly nice.
It’s unclear for how long AIs will continue to superficially reason mostly like nice humans - the more we continue training with RL, the less the initial “human-like prior” might matter. And LLMs are extremely inefficient compared to e.g. human brains, and it seems likely that we eventually have AIs that are more based on RL. I plan to discuss this in a future post.
In the analysis in this post, there is no human-like prior for the critic, so we just focus on what we expect to be learned given model-based RL.
Model-based RL also has advantages for alignment. In particular, we have a clear critic component which determines the goals of the AI. That’s better than if our AI is a spaghetti-mess with nothing like a goal slot.[3]
2.4 What might the critic learn?
Roughly speaking, the critic learns to pay attention to aspects of the model’s thoughts that are correlated with reward, and to compute a good reward prediction from those aspects[4].
Initially, what the critic computes may be rather simple. E.g. it may look at whether the model thinks the user will say a word like great/amazing/awesome, and some other simple aspects like that and then have a simple function on those aspects to compute the value score.
As we train further, the critic may learn more complex functions and compute its own complex aspects from information it can extract from the model’s thoughts.
Overall, the critic is more likely to learn (1) a function that is simple for neural networks to learn, and (2) a function that predicts reward well. As we train more, the reward prediction becomes better and the function in the critic can become more complex, but of two functions that predict reward similarly well, the critic will more likely learn the one that’s simpler for neural nets to learn.
Note that what’s simple for a neural net to learn likely doesn’t match well with what we intuitively think of as simple. “Love” may seem like a simple concept to us but it may be complex for an AI to learn to value. “Honesty” seems less human-centric, but even if it is, what exactly would it mean for our AI to care about being honest?
In order to evaluate whether honesty might be learned, we need to think mechanistically about what it would mean for the critic to rank honest texts more highly.
2.4.1 Might the critic learn to score honesty highly?
(Take the following analysis with a grain of salt, what actually gets learned may be a lot more messy and alien.)
The AI is honest if the text it outputs matches its beliefs, which in our case means matching the beliefs of the model.
So we need a comparison between the text and the model’s beliefs. Might the model already compute the differences here, so the critic could just pick up on those differences instead of needing to learn the comparison itself? Yes that seems likely, since such differences may often be important for predicting how the human will respond.
Cool, so will the critic learn to pay attention to those differences? Seems plausible again, since such differences also seem quite useful for predicting reward, because the human will give negative reward if the AI outputs text where the human can tell it is false.
So we could imagine the critic learning an honesty circuit, that decreases the value score if significant such differences are present. (To be clear, this is just exemplary, there very likely won’t actually be anything like a relatively independent honesty circut in the critic. But the complexity of an honesty circut might still tell us something about whether honesty might be learned.)
So yeah, in our simplified toy model, the critic may learn a pattern that predicts honesty is good.
However, it is only one pattern among many, and there will still be some cases where the critic evaluates the non-honest action as better overall. In particular, this is likely to happen in cases where AI predicts that the dishonesty probably won't be caught. So when the AI then indeed does not get caught, the honesty-pattern gets weaker, since it predicted low reward but the result was high reward. And there might even be cases where the AI is honest but the human thinks it’s wrong and then mistakenly gives low reward.
Is there something else that could be learned which predicts reward better than honesty and isn’t much more complex? Unfortunately, yes:
The model doesn’t just have beliefs about what it thinks is true, but also beliefs about what the human believes. This is especially true in our case because the model is predicting how the human responds. And the model likely also already compares the text to its beliefs about the human’s beliefs.
So the critic can just learn to pay attention to those differences and assign a lower value score if those are present. Now the model learned to tell the human what they will think is true, which performs even better.
So the original honesty circut will get outcompeted. Indeed, because those two circuits seem similarly complex, the honesty circut might not even have been learned in the first place!
2.4.1.1 Aside: Contrast to the human value of honesty
The way I portrayed the critic here as valuing honesty is different from the main sense in which humans value honesty: for humans it is more self-reflective in nature—wanting to be an honest person, rather than caring in a more direct way that speech outputs match our beliefs.
We don’t yet have a good theory for how human preferences work, although Steven Byrnes has recently made great progress here.
2.5 Niceness is not optimal
That the critic doesn’t learn honesty is an instance of a more general problem which I call the "niceness is not optimal” problem. Even if we try to train for niceness, we sometimes make mistakes in how we reward actions, and the strategy that also predicts the mistakes will do better than the nice strategy.
Unfortunately, mistakes in human feedback aren’t really avoidable. Even if we hypothetically wouldn’t make mistakes when judging honesty (e.g. in a case where we have good tools to monitor the AI’s thoughts), as the AI becomes even smarter, it may learn a very detailed psychological model of the human and be able to predict precisely how to make them decide to give the AI reward.
One approach to mitigate this problem is called “scaleable oversight”. The idea here is that we use AIs to help humans give more accurate feedback.
Though this alone probably won’t be sufficient to make the AI learn the right values in our case. We train the critic to predict reward, so it is not surprising if it ends up predicting what proposed text leads to reward, or at least close correlates of reward, rather than what text has niceness properties. This kind of reward-seeking would be bad. If the AI became able to take over the world, it would, and then it might seize control of its reward signal, or force humans to give it lots of reward, or create lots of human-like creatures that give it reward, or whatever.[5]
Two approaches for trying to make it less likely that the critic will be too reward-seeking are:
2.6 Niceness is not (uniquely) simple
We’ve already seen that honesty isn’t much simpler than “say what the user believes” in our setting. For other possible niceness-like properties, this is similar, or sometimes even a bit worse.
Maybe “do what the human wants” seems simple to you? But what does this actually mean on a level that’s a bit closer to math - how might a critic evaluating this look like?
The way I think of it, “what the human wants” refers to what the human would like if they knew all the consequences of the AI’s actions. The model will surely be able to make good predictions here, but the concept seems more complex than predicting whether the human will like some text. And predicting whether the human will like some text predicts reward even better!
Maybe “follow instructions as intended” seems simple to you? Try to unpack it - how could the critic be constructed to evaluate how instruction-following a plan is, and how complex is this?
Don’t just trust vague intuitions, try to think more concretely.
2.6.1 Anthropomorphic Optimism
Eliezer Yudkowsky has a great post from 2008 called Anthropomorphic Optimism. Feel free to read the whole post, but here’s the start of it:
The problem was that the group-selectionists used their own mind to generate a solution to a problem, and expected evolution to find the same solution. But evolution doesn’t search for solutions in the same order you do.
This lesson directly carries over to other alien optimizers like gradient descent. We’re trying to give an AI reward if it completed tasks in the way we intended, and it seems to us like a natural thing the AI may learn is just to solve problems in the way we intend. But just because it seems natural to us doesn’t mean it will be natural for gradient descent to find.
The lesson can also apply to AIs themselves, albeit that current LLMs seem like they inherit a human-like search ordering from being trained on lots of human data. But as an AI becomes smarter than humans, it may think in ways less similar to humans, and may find different ways of fulfilling its preferences than we humans would expect.
2.6.2 Intuitions from looking at humans may mislead you
We can see the human brain as being composed out of two subsystems: The learning subsystem and the steering subsystem.
The learning subsystem is mostly the intelligent part, which also includes some kind of actor-model-critic structure. There are actually multiple critic-like predictors (also called thought assessors) that predict various internal parameters, but one critic, the valence thought assessor, is especially important in determining what we want.
The reward function on which this valence critic is trained is part of the steering subsystem, and according to the theory which I think is correct, this reward function has some ability to read the thoughts in the learning subsystem, and whenever we imagine someone being happy/sad, this triggers positive/negative reward, especially for people we like[6], and especially in cases where the other person is thinking about us. So when we do something that our peers would disapprove of, we directly get negative reward just from imagining someone finding out, even if we think it is unlikely that they will find out.[7]
This is a key reason why most humans are at least reluctant to breach social norms like honesty even in cases where breaches very likely won’t get caught.
Given this theory, psychopaths/sociopaths would be people where this kind of approval reward is extremely small, and AFAIK they mostly don’t seem to attach intrinsic value to following social norms (although of course instrumental value).
We currently don’t know how we could create AI that gets similar approval reward to how humans do.
For more about how and why some human intuitions can be misleading, check out “6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa”.
2.7 Natural Abstractions or Alienness?
Ok, so the niceness properties we hope for are perhaps not learned by default. But how complex are they to learn? How much other stuff that also predicts reward well could be learned instead?
In order to answer this question, we need to consider whether the AI thinks in similar concepts as us.
2.7.1 Natural Abstractions
The natural abstraction hypothesis predicts that “a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world”. This class of cognitive architectures includes human minds and AIs we are likely to create, so AIs will likely think about the world in mostly the same concepts as humans.
For instance, “tree” seems like a natural abstraction. You would expect an alien mind looking at our planet to still end up seeing this natural cluster of objects that we call “trees”.[8]This seems true for many concepts we use, not just “tree”.
However, there are cases where we may not expect an AI to end up thinking in the same concepts we do. For one thing, an AI much smarter than us may think in more detailed concepts, and it may have concepts for reasoning about parts of reality that we do not have yet. E.g. imagine someone from 500 years ago observing a 2025 physics student reasoning about concepts like “voltage” and “current”. By now we have a pretty decent understanding about physics, but in biology or even in the science of minds an AI might surpass the ontology we use.
But more importantly, some concepts we use derive from the particular mind architecture we have. Love and laughter seem more complex to learn for a mind that doesn’t have brain circuitry for love or laughter. And some concepts are relatively simple but perhaps not quite as natural as they seem for us humans. I think “kindness”, “helpfulness”, and “honor” likely fall under that category of concepts.
2.7.2 … or Alienness?
Mechanistic interpretability researchers are trying to make sense of what’s happening inside neural networks. So far we found some features of the AI’s thoughts that we recognize, often specific people or places, e.g. the Golden Gate Bridge. But many features remain uninterpretable to us so far.
This could mean two things. Perhaps we simply haven't found the right way to look - maybe with better analysis methods or maybe with a different frame for modelling AI cognition, we would be able to interpret much more.
But it’s also possible that neural networks genuinely carve up the world differently than we do. They might represent concepts that are useful for predicting text or images but don't correspond to the abstractions humans naturally use. And this could mean that many of the concepts we use are alien for the AI in turn. Although given that the AI is trained to predict humans, it perhaps does understand human concepts, but it could be that many such concepts are less natural for the AI and it mostly reasons in other concepts.
The worst case would be that concepts like “helpfulness” are extremely complex to encode in the AI’s ontology, although my guess is that it won’t be that complex.
Still, given that the internals of an AI may be somewhat alien, it seems quite plausible that what the critic learns isn’t a function that’s easily describable through human concepts, but may from our standpoint rather be a messy kludge of patterns that happen to predict reward well.
If the critic learned some kludge rather than a clean concept, then the values may not generalize the way we hope. Given all the options the AI has in its training environment, the AI prefers the nice one. But when the AI becomes smarter, and is able to take over the world and could then create advanced nanotechnology etc, it has a lot more options. Which option does now rank most highly? What does it want to do with the matter in the universe?
I guess it would take an option that looks strange, e.g. filling the universe with text-like conversations with some properties, where if we could understand what was going on we could see the conversations somewhat resembling collaborative problem solving. Of course not exactly that, but there are many strange options.
Though it’s also possible, especially with better alignment methods, that we get a sorta-kludgy version of the values we were aiming for. Goodhart’s Curse suggests that imperfections here will likely be amplified as the AI becomes smarter and thus searches over more options. But whether it’s going to end up completely catastrophic or just some value lost likely depends on the details of the case.
2.8 Value extrapolation
Suppose we somehow make the critic evaluate how helpful a plan is to the human operators, where “helpful” is the clean human concept, not an alien approximation.[9]Does that mean we win? What happens if the AI becomes superintelligent?
The optimistic imagination is that the AI just fulfills our requests the way we intend, e.g. that it secures the world against the creation of unaligned superintelligences in a way that doesn’t cause much harm, and then asks us how we want to fill the universe.
However, as mentioned in section 1.5 of the last post, in the process of becoming a superintelligence, what the value-part of the AI (aka what is initially the critic) evaluates changes from “what plan do I prefer most given the current situation” to “how highly do I rank different universe-trajectories”. So we need to ask: how may “helpfulness to the human operators” generalize to values over universe-trajectories?
How this generalizes seems underdefined. Helpfulness is mainly a property that actions can have, but it’s less natural as a goal that could be superintelligently pursued. In order to predict how it may generalize, we would need to think more concretely how the helpfulness of a plan can be calculated based on the initial model’s ontology, then imagine how the ontology may shift, imagine value rebinding procedures[10], and then try to predict what the AI may end up valuing.[11]
Regardless, helpfulness (or rather corrigibility, as we will learn later in this series), isn’t intended to scale to superintelligence, but rather intended as an easier intermediate target, so we get genius-level intelligent AIs that can then help us figure out how to secure the world against the creation of unaligned superintelligence and to get us on a path to fulfill humanity’s potential. Although it is of course worrying to try to get work out of an AI that may kill you if it becomes too smart.
2.8.1 Coherent Extrapolated Volition
What goal would generalize to universe-trajectories in a way that the universe ends up nice? Can we just make it want the same things we want?
Human values are complex. Consider for example William Frankena’s list of terminal values as an incomplete start:
Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom; beauty, harmony, proportion in objects contemplated; aesthetic experience; morally good dispositions or virtues; mutual affection, love, friendship, cooperation; just distribution of goods and evils; harmony and proportion in one's own life; power and experiences of achievement; self-expression; freedom; peace, security; adventure and novelty; and good reputation, honor, esteem, etc.
Most of these values stem from some kind of emotion or brain circuitry where we don’t yet understand how it works, and for each of them it seems rather difficult to get an AI, which has a very different mind design and lacks human-like brain circuitry, to care about it.
Ok then how about indirectly pointing to human values? Aka: for a particular human, the AI has a model of that human, and can imagine how the human would evaluate plans. So instead of the AI directly evaluating what is right the way humans do it, we point the AI to use its model of humans to evaluate plans.
This indirect target does have much lower complexity than directly specifying the things we care about and thereby does seem more feasible, but there’s some nuance needed. Humans have reflectively endorsed values and urges. We want the AI to care about the values we reflectively endorse, rather than to feed us superstimulating movies that trigger our addiction-like wanting. And of course the pointer to “what humans want” would need to be specified in a way that doesn’t allow the AI to manipulate us into wanting things that are easier to fulfill for the AI.
Furthermore, we don’t know yet how our values will generalize. We have preferences in the here and now, but we also have deep patterns in our mind that determine what we would end up wanting when we colonize the galaxies. We don’t know yet what that may be, but probably lots of weird and wonderful stuff we cannot comprehend yet.
We might even have wrong beliefs about our values. E.g. past societies might’ve thought slavery was right, and while maybe some people in the past simply had different values from us, some others might’ve changed their mind if they became a bit smarter and had time for philosophical reflection about the question.
And of course, we need the AI to make decisions over questions that humans cannot understand yet, so simply simulating what a human would think doesn’t work well.
Ok, how about something like “imagine how a human would evaluate plans if they were smarter and moved by reflectively endorsed desires”?
Yeah we are getting closer, but the “if they were smarter” seems like a rather complicated counterfactual. There may be many ways to extrapolate what a mind would want if it was smarter, and the resulting values might not be the same in all extrapolation procedures.
One approach here is to imagine multiple extrapolation procedures, and act based on where the extrapolations agree/cohere. This gives us, as I understand it, the coherent extrapolated volition (CEV) of a single human.
Not all humans will converge to the same values. So we can look at the extrapolated values of different humans, and again take the part of the values that overlaps. This is the CEV of humanity.
The way I understand it, CEV isn’t crisply specified yet. There are open questions of how we may try to reconcile conflicting preferences of different people or different extrapolation procedures. And we might also want to specify the way a person should be extrapolated to a smarter version of itself. Aka something like slowly becoming smarter in a safe environment without agents trying to make them arrive at some particular values, where they can have fun and can take their time with philosophic reflection on their values.
My read is that CEV is often used as a placeholder for the right indirect value specification we should aim for, where the detailed specification still needs to be worked out.
As you can probably see, CEV is a rather complex target, and there may be further difficulties in avoiding value-drift as an AI becomes superintelligent, so we likely need significantly more advanced alignment methods to point an AI to optimize CEV.
How much earlier nice AIs can help us solve this harder problem is one of the things we will discuss later in this series.
2.9 Conclusion
Whoa that was a lot, congrats for making it through the post!
Here’s a quick recap of the problems we’ve learned of:
The key problem here is that while the AI learned values that mostly add up to useful behavior on the controlled distribution, the reasons why it has the nice behavior there may not be the good reasons we hoped for, so if we go significantly off distribution, e.g. to where the AI could take over the world, it will take actions that are highly undesirable from our perspective.
And then there’s a fourth problem that even if it is nice for good reasons, many kinds of niceness look like they might break when we crank up intelligence far enough:
Questions and Feedback are always welcome!
See also “A Closer Look at Before and After”. Furthermore, even if the AI doesn’t immediately take over the world when it is sure it could, it could e.g. be that the alignment properties we got into our AI weren’t indefinitely scaleable, and then alignment breaks later. ↩︎
which actually isn’t a single response but a probability distribution over responses ↩︎
That’s not to say that model-based RL solves all problems of having an AI with a goal slot. In particular, we don’t have good theory of what happens when a model-based RL agent reflects on itself etc. ↩︎
I’m using “aspects” instead of “features” here because “features” is the terminology used for a particular concept in mechanistic interpretability, and I want “aspects” to also include potential other concepts or so where we maybe just haven’t yet found a good way to measure them in neural networks. ↩︎
There’s also a different kind of reward seeking where the AI actually cares about something else, and only predicted reward for instrumental reasons like avoiding value drift. This will be discussed in more detail in the next 2 posts. ↩︎
For people we actively dislike, the reward can be inverted, aka positive reward when they are sad and negative when they are happy. ↩︎
Of course, the negative reward is even much stronger when we are actually in the situation where someone finds out. But it appears that even in cases where we are basically certain that nobody will find out, we still often imagine that our peers would disapprove of us, and this still triggers negative reward. Basically, the reward function is only a primitive mind-reader, and doesn’t integrate probabilistic guesses about how unlikely an event is into how much reward it gives, but maybe rather uses something like “how much are we thinking about that possibility” as a proxy for how strongly to weigh that possibility. ↩︎
That doesn’t mean there needs to be a crisp boundary between trees and non-trees. ↩︎
Just thinking about the AI learning “helpfulness” is of course thinking on a too high level of abstraction and may obscure the complexity here. And it could also turn out that helpfulness isn’t a crisp concept - maybe there are different kinds of helpfulness, maybe each with some drawbacks, and maybe we confuse ourselves by always imagining the kind of helpfulness that fits best in a given situation. But it doesn’t matter much for the point in this section. ↩︎
Which potentially includes the AI reasoning through philosophical dilemmas. ↩︎
Such considerations are difficult. I did not do this one. It’s conceivable that it would generalize like in the optimistic vision, but it could also turn out that it e.g. doesn’t robustly rule out all kinds of manipulation, and then the AI does some helpful-seeming actions that manipulate human minds into a shape where the AI can help them even more. ↩︎