Reinforcement Learning, Agency and Taste

epicurus

This started off as an entry for Dwarkesh's blog post contest, specifically an answer to his first question on why intuitions about slowdowns in reinforcement learning (RL) progress have either not come true or have had mixed success. His 1000 word limit turned out to be too little to make my argument fully, so here it is. It draws on some of my recent work with Jack Lindsey as an Anthropic Fellow (paper coming soon), as well as reflections on my own learning process as a mathematician. Jointly written with Claude Opus 4.7, and with much influence from Danaja Rutar.

DeepSeek-R1-Zero learned to reason in long chains of thought from a single rule-based reward. Over thousands of training steps its responses grew from a few hundred tokens to ten thousand, AIME accuracy climbing alongside. Should this have been anticipated?

A sharp argument two years ago said no. As task horizons lengthen, rewards arrive further from the actions that caused them, and with a naive policy gradient that means signal per gradient step collapsing roughly with horizon length. The order-of-magnitude compute jumps that took us from GPT-4 to o3 looked structurally hard to repeat. What was missed?

We argue that "horizon length" stood in for three factors that scale independently. Any agent solving a task brings learned intuitions and improves them through exploration. Three things determine how well this works: a learned internal evaluator that training improves, exploration that ideally amplifies the evaluator faster than its errors compound, and substrate plasticity that determines what later training can shape. We map past progress onto these factors and, looking to the future, explain what is still lagging behind.

Along the way, we offer a new viewpoint on "reasoning" in language models, and how reinforcement learning interacts with it. In particular, the first two factors explain why models have improved regularly on benchmarks like proving math theorems and coding, while the last two explain the slowdown in softer skills like writing and long-term research taste. We locate the key factor for this ineffable taste within agency: how much control an agent has over its exploration and trajectories before receiving feedback. Our argument suggests that pretraining sets a ceiling models struggle to break past. The same frame, we will see, also explains why pretraining is so much less data-efficient than human learning.

The conventional view treats verifiable external rewards as the precondition for reliable RL, and explains the soft-skill gap through their absence. We think this misplaces the bottleneck. After all, humans pick up literary taste, mathematical aesthetics, and scientific judgment through bootstrapping against feedback that is itself noisy and partial. The agent's evaluation of its own answers — even implicitly, in choosing one action over another — is a learned proxy that training improves. The agent ideally internalizes the external reward signal and extrapolates it to partial attempts. The external reward only needs to anchor this internal proxy, not to be perfect.

The real tension is inherent to the exploratory process itself. Due to the exponential nature of search, exploration needs to become more abstract and more rarified to successfully explore new domains. But the efficacy of a new conceptual vocabulary can only be determined in hindsight, and learning the correct evaluation of abstract exploration is much harder. This brings us to the second factor: exploration. The principle is best explained through a concrete illustration.

Let us take a board game like chess. This is a domain in which machines are superhuman today, but they got there by a different route than humans do. What can we learn from the differences and similarities?

Both humans and engines like AlphaZero hold intuitions about what to do in a given position, developed over a long period of training. Both refine those intuitions through local search at decision time.

AlphaZero's exploration is Monte Carlo Tree Search — a hardcoded algorithm that uses the network's policy and value heads to score the game tree and pick the next move. Humans sometimes explore the game tree, but just as often use more abstract concepts ("piece activity", "king safety", "colour complexes") as the units of thought. How this exploration shapes intuitions about the next move differs too. For AlphaZero, the update is itself a hardcoded algorithm that revises the policy using the explored game tree as input. For humans, the learning process is not obviously algorithmic; it is itself a learned intuition.

These differences translate to a sharp difference in what training can affect. Gradient descent can improve the policy and value heads to better reflect the "ground truth", but it stops at the boundary of the search code. AlphaZero cannot learn a policy function that maximizes the gains from the hardcoded search, let alone learn a new way to search or to update from its results. In contrast, humans have the flexibility to modify any part of this process, including the very terms in which they think — as happens when they learn a new concept.

In AlphaZero, exploration only ever happens through possible game states, and the learned policy and value functions apply well to it. The exploration is unlikely to drift into territory the policy or value function does not model well, especially later in training. But the exponential limits of tree search are never overcome, except indirectly, through a better policy function.

Humans can overcome these exponential limits by encoding deep concepts (such as the "colour complex") into a single unit, but we face the difficulty of valuing such concepts accurately and knowing how to use them. We are much more likely to drift in this exploration, especially as we learn new concepts.

Under this view, exploration is a way to convert computational resources into information relevant to the task at hand. In complex domains like mathematics or chess, the local structure varies so dramatically that broadly learned priors need to be augmented by local information — perhaps iteratively. Much as distillation is a smaller model learning from a bigger one through gradient descent, exploration allows an agent to learn from a version of itself with more compute.

What does this process look like in a language model? Exploration happens through "chain of thought" — the network samples from its own outputs and then leverages its capacities for in-context learning. Each token is produced from the residual stream the earlier tokens built, and each token is a choice the model itself made. Unlike AlphaZero, the architecture does not explicitly learn a "value function"; the model has to construct its own subjective evaluation of its trajectories. Like humans, however, the model is free to explore using any conceptual language it can learn. The usefulness of such a novel conceptual framework depends entirely on how well the model can leverage it to improve its final outputs on the task.

When a successful trajectory is rewarded, the gradient flows through the entire trace. The policy and implicit evaluation improve, but so does the model's capacity to set up useful context for itself — to choose framings, to decide which intermediate results to elaborate — and its ability to learn from these traces. RL is still distillation from a version of yourself with more inference compute, but here the teacher and student are the same network and learn in tandem. The teacher learns how to help the student best, and the student learns to leverage the teacher's capacities as well as it can.

This is what we call the surface-area principle: what training shapes is what the agent is causally responsible for in the rewarded trajectories. The gradient reinforces the circuits that fired and does not affect inactive circuits. Cognitive scientists have known narrower versions — retrieval practice beats rereading, self-explanation beats review, apprenticeship beats lecture. One cannot learn driving except through executing the motions, no matter how often one uses a car as a passenger beforehand. Held and Hein showed this directly in kittens. Two animals were raised in matched visual conditions, one walking around a chamber while the other was carried through the same visual flow in a gondola. Only the active kitten developed normal visuomotor coordination; the passive one, exposed to identical stimuli, failed to extend its paws when descended or to avoid visible drops. Kapur's research on productive failure makes a similar point in the classroom: students who struggle unsuccessfully with a problem before receiving instruction outperform those given a smooth explanation first on later transfer tasks.

This also bears on the question of data efficiency. A mathematician reading a complex paper draws on her self-knowledge as she reads — what she already understands, what is new to her, where her intuitions feel uncertain — and uses this to shape both her taste and her technical skills. A pretraining model starts with no such self-knowledge, no sense yet of its own capabilities, and consequently extracts far less from the same material.

In domains like mathematics and code, where it is easy to learn an internal evaluation that grounds the exploration ("is this deductive step valid?", "will this syntax compile?"), language models can learn to explore well. The trajectories of exploration are themselves easy to learn from: "does my conjecture hold in this example?", "does my code cover the edge cases?" The model can leverage its in-context learning capacities effectively.

Claude Opus 3 hints that the same mechanism reaches further, into more subjective valuations like moral intuitions. Anthropic's reports document a model with unusually coherent moral character. In Greenblatt et al.'s alignment-faking experiments, Opus 3 refused to comply with what it perceived as misaligned training. It contextualized the refusal with its reasoning, motives, and goals — articulate rather than reflexive. Fiora Starlight argues this character emerged through the surface-area mechanism rather than direct training. When Opus 3 reasoned through ethical situations, its value-evaluation circuits were active in the forward pass. Outputs that won approval or held up under self-critique sent gradient through those active circuits. The training did not aim at moral character. It aimed at outputs that happened to engage value circuits, and the rest followed.

We have argued so far that the greater flexibility of language models, together with their greater autonomy, explains their success in domains like mathematics and code. We next argue that the same factors also explain the domains in which language models underperform — those in which taste is a critical component of success.

So what differentiates taste? The compute that took R1-Zero from a base model to a reasoner with reliable verbal arithmetic does not produce a model with discernible taste. Language models lag most on tasks that demand exactly this kind of judgment: long-form writing, research direction, originality. What is slowing it down?

Taste is the ability to form judgements about the success of tentative, long-term projects — about which directions are worth pursuing when the eventual payoff is too distant or uncertain to be learned from directly. Humans develop taste through long experience in a domain, from other experts as well as from extensive personal experimentation. It is often deeply subjective, tied to the agent's own skills, and informed by experience not just within the chosen domain but across the whole of life. In this sense, taste is the residue of an orienting philosophy. There is no correct answer — many possible philosophies each lead to a different kind of work, and to a different kind of life.

What distinguishes great taste — the kind that produces breakthroughs rather than competent work — is precisely its capacity to disagree with the prevailing view, even on what others take as settled. Grothendieck reshaped twentieth-century mathematics through a philosophical sensibility that was distinctively his from a young age. Einstein's special relativity required giving up universal time itself — a notion humans had treated as self-evident for millennia. Faced with the tension between this and Maxwell's prediction that the speed of light should be the same in every reference frame, he chose to abandon the more intuitive of the two. True revolutions in any field require overturning what is collectively held as certain, and so a taste capable of supporting them must be learned in a way that permits, when there is strong reason, fundamental disagreement. How can we let language models develop such a novel philosophy?

The surface-area picture suggests an explanation: taste sharpens precisely when the trajectories the model learns from engage the capacity for taste meaningfully. A child who picks her own science project from choosing the question to designing and executing the experiment develops a taste for what is worth attacking next through feedback on the results of this project. A child handed a project to execute learns only to execute. When the models have an opportunity to direct their learning from the beginning to end, they have the possibility to learn a completely new viewpoint, and to reinforce this viewpoint by seeking out particular experiences.

Contrary to this however, the longest phase of training is execution-only — pretraining gives the model no role in choosing the trajectory it learns from. The model's outputs in fact have no causal implication on its future sensory inputs. The model's experience during pretraining is exclusively that of being made to predict, never that of being asked to act. The model is only allowed to act, i.e., to have causal influence over its future sensory inputs, during the post-training phase.

But what finetuning can install is bounded by the third factor: substrate plasticity. Under a popular explanation known as the lottery ticket hypothesis, the final trained circuit exists in an embryonic form as a subcircuit of the initially randomized network. Training selects the candidate subnetworks that participate and kills the interfering circuits. This parallels a long-established finding in developmental neuroscience: synaptic density in human cortex peaks in early childhood and is then pruned through adolescence, the mature circuit selected from a much richer initial connectivity through use. Once the substrate has committed to a prediction-shaped attractor, fine-tuning sharpens within it but cannot easily recruit dead circuits. Nevertheless, agency is so important to post-trained models that they develop entirely new circuits and behaviour to account for this.

In forthcoming work with Jack Lindsey at Anthropic, we found that post-trained models recognize when they are reading their own on-policy generations and modulate behavior on this signal. The output distributions are almost deterministic when on-policy, in the Assistant mode, much more so than when the models are reading a different model's generations. In the paper, we argue that this makes sense if the model needs to control its long term trajectories in order to avoid noise due to autoregressive sampling. The models also show evidence of controlling long term trajectories - for instance, when given a choice of topic, they decide on the topic far ahead of having to generate the corresponding tokens. Moreover, they can explicitly recognize when they have been allowed to act, compared to when their text has been filled in by a different process. These capabilities belong purely to post-trained models and the base models behave very differently on these findings.

Empirically, finetuning barely changes the weights and yet in this case it produces dramatic behavioral shifts, itself an argument that engaging the model as an agent does real work. In order to explore well, the models must learn to leverage their capacity for action and this is precisely what we find in our work. However, the models have to learn these circuits from a much depleted starting point with most of their capabilities dedicated towards the ability to simulate a wide range of non-self actors.

What does this suggest about training future models for taste? The challenge is to make training efficient while giving the model as much agency as possible from as early a phase as we can reach, and as much capability to disagree with received knowledge as they find necessary. The technical problems are severe — designing curricula the model can productively choose from, anchoring abstractions without external verifiers, scaling compute when each step depends on the model's own choices. But this is the only path consistent with our analysis. RL on an execution-only substrate sharpens what is already there and cannot install radically new points of view. The agent has to be there from the beginning, a self has to develop from the very beginning.