We need a field of Reward Function Design

Steven Byrnes

(Brief pitch for a general audience, based on a 5-minute talk I gave.)

Let’s talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI)

My research focuses on “RL agents”, broadly construed. These were big in the 2010s—they made the news for learning to play Atari games, and Go, at superhuman level. Then LLMs came along in the 2020s, and everyone kinda forgot that RL agents existed. But I’m part of a small group of researchers who still thinks that the field will pivot back to RL agents, one of these days. (Others in this category include Yann LeCun and Rich Sutton & David Silver.)

Why do I think that? Well, LLMs are very impressive, but we don’t have AGI (artificial general intelligence) yet—not as I use the term. Humans can found and run companies, LLMs can’t. If you want a human to drive a car, you take an off-the-shelf human brain, the same human brain that was designed 100,000 years before cars existed, and give it minimal instructions and a week to mess around, and now they’re driving the car. If you want an AI to drive a car, it’s … not that.

Teaching a human to drive a car / teleoperate a robot:

Minimal instruction,
30 hours of practice

Teaching an AI to drive a car / teleoperate a robot:

Dozens of experts, 15 years, $5,000,000,000

Anyway, human brains are the only known example of “general intelligence”, and they are “RL agents” in the relevant sense (more on which below). Additionally, as mentioned above, people are working in this direction as we speak. So, seems like there’s plenty of reason to take RL agents seriously.

So the upshot is: we should contingency-plan for real RL agent AGIs—for better or worse.

Reward functions in RL

If we’re talking about RL agents, then we need to talk about reward functions. Reward functions are a tiny part of the source code, with a massive influence on what the AI winds up doing.

For example, take an RL agent like AlphaZero, and give it a reward of +1 for winning at a board game and –1 for losing. As you train it, it will get better and better at winning. Alternatively, give it a reward of –1 for winning and +1 for losing. It will get better and better at losing. So if the former winds up superhuman at Reversi / Othello, then the latter would wind up superhuman at “Anti-Reversi”—an entirely different game! Again, tiny code change, wildly different eventual behavior.

I claim that if you give a powerful RL agent AGI the wrong reward function, then it winds up with callous indifference to whether people live or die, including its own programmers and users.

But what’s the right reward function? No one knows. It’s an open question.

Why is that such a hard problem? It’s a long story, but just as one hint, try comparing:

“negative reward for lying”, versus
“negative reward for getting caught lying”.

The first one seems like a good idea. The second one seems like a bad idea. But these are actually the same thing, because obviously the reward function will only trigger if the AI gets caught.

As it turns out, if you pick up a 300-page RL textbook, you’ll probably find that it spends a few paragraphs on what the reward function should be, while the other 299½ pages are ultimately about how to maximize that reward function—how do the reward signals update the trained model, how the trained model is queried, and sometimes there’s also predictive learning, etc.

Reward functions in neuroscience

…And it turns out that there’s a similar imbalance in neuroscience:

The human brain also has an RL reward function. It’s sometimes referred to as “innate drives”, “primary rewards”, “primary punishers”, etc.—things like ‘pain is bad’ and ‘eating when you’re hungry is good’. And just like in RL, the overwhelming majority of effort in AI-adjacent neuroscience concerns how the reward function updates the trained models, and other sorts of trained model updates, and how the trained models are queried, and so on. This part involves the cortex, basal ganglia, and other brain areas. Meanwhile, approximately nobody in NeuroAI cares about the reward function itself, which mainly involves the hypothalamus and brainstem.

We need a (far more robust) field of “reward function design”

So here’s the upshot: let’s learn from biology, let’s innovate in AI, let’s focus on AI Alignment, and maybe we can get into this Venn diagram intersection, where we can make headway on the question of what kind of reward function would lead to an AGI that intrinsically cares about our welfare. As opposed to callous sociopath AGI. (Or if no such reward function exists, then that would also be good to know!)

Oh man, are we dropping this ball

You might hope that the people working most furiously to make RL agent AGI—and claiming that they’ll get there in as little as 10 or 20 years—are thinking very hard about this reward function question.

Nope!

For example, see:

“The Era of Experience” has an unsolved technical alignment problem (2025), where I discuss a cursory and flawed analysis of reward functions by David Silver & Rich Sutton;
LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem (2023), where I discuss a cursory and flawed analysis of (related) “intrinsic cost modules” by Yann LeCun
Book review: “A Thousand Brains” (2021), where I discuss a cursory and flawed analysis of the (related) “old brain” by Jeff Hawkins

…And those are good ones, by the standards of this field! Their proposals are fundamentally doomed, but at least it occurred to them to have a proposal at all. So hats off to them—because most researchers in RL and NeuroAI don’t even get that far.

Let’s all try to do better! Going back to that Venn diagram above…

Reward Function Design: Neuroscience research directions

For the “reward functions in biology” part, a key observation is that the human brain reward function leads to compassion, norm-following, and so on—at least, sometimes. How does that work?

If we can answer that question, it might be a jumping-off point for AGI reward functions.

I worked on this neuroscience problem for years, and wound up with some hypotheses. See Neuroscience of human social instincts: a sketch for where I’m at. But it needs much more work, especially connectomic and other experimental data to ground the armchair hypothesizing.

Reward Function Design: AI research directions

Meanwhile on the AI side, there’s been some good work clarifying the problem—for example people talk about inner and outer misalignment and so on—but there’s no good solution. I think we need new ideas. I think people are thinking too narrowly about what reward functions can even look like.

For a snapshot of my own latest thinking on that topic, see my companion post Reward Function Design: a starter pack.

Bigger picture

To close out, here’s the bigger picture as I see it.

Aligning “RL agent AGI” is different from (and much harder than) aligning the LLMs of today. And the failures will be more like “SkyNet” from Terminator, than like “jailbreaks”. (See Foom & Doom 2: Technical alignment is hard.)

…But people are trying to make those agents anyway.

We can understand why they’d want to do that. Imagine unlimited copies of Jeff Bezos for $1/hour. You tell one of them to go write a business plan, and found and grow and run a new company, and it goes and does it, very successfully. Then tell the next one, and the next one. This is a quadrillion-dollar proposition. So that’s what people want.

But instead of “Jeff Bezos for $1/hour”, I claim that what they’re gonna get is “a recipe for summoning demons”.

Unless, of course, we solve the alignment problem!

I think things will snowball very quickly, so we need advanced planning. (See Foom & Doom 1.) Building this field of “Reward Function Design” is an essential piece of that puzzle, but there are a great many other things that could go wrong too. We have our work cut out.

Can I ask you to unwind the fundamentals a step further, and say why you and neuroscientists in general believe the brain operates by RL and has a reward function? And how far down the scale of life these have been found?

Oh, it’s definitely controversial—as I always say, there is never a neuroscience consensus. My sense is that a lot of the controversy is about how broadly to define “reinforcement learning”.

If you use a narrow definition like “RL is exactly those algorithms that are on arxiv cs.AI right now with an RL label”, then the brain is not RL.

If you use a broad definition like “RL is anything with properties like Thorndike's law of effect”, then, well, remember that “reinforcement learning” was a psychology term long before it was an AI term!

If it helps, I was arguing about this with a neuroscientist friend (Eli Sennesh) earlier this year, and wrote the following summary (not necessarily endorsed by Eli) afterwards in my notes:

Eli doesn’t like the term “RL” in a brain context because of (1) its implication that "reward" is stuff in the environment as opposed to an internal “reward function” built from brain-internal signals, (2) its implication that we’re specifically maximizing an exponentially-discounted sum of future rewards.
…Whereas I like the term “RL” because (1) If brain-like algorithms showed up on GitHub, then everyone in AI would call it an “RL algorithm”, put it in “RL textbooks”, and use it to solve “RL problems”, (2) This follows the historical usage (there’s reinforcement, and there’s learning, per Thorndike’s Law of Effect etc.).
When I want to talk about “the brain’s model-based RL system”, I should translate that to “the brain’s Bellman-solving system” when I’m talking to Eli, and then we’ll be more-or-less on the same page I think?

…But Eli is just one guy, I think there are probably dozens of other schools-of-thought with their own sets of complaints or takes on “RL”.

how far down the scale of life these have been found?

I don’t view this as particularly relevant to understanding human brains, intelligence, or AGI, but since you asked, if we define RL in the broad (psych-literature) sense, then here’s a relevant book excerpt:

Pavlovian conditioning occurs in a naturally brainless species, sea anemones, but it is also possible to study protostomes that have had their brains removed. An experiment by Horridge[130] demonstrated response–outcome conditioning in decapitated cockroaches and locusts. Subsequent studies showed that either the ventral nerve cord[131,132] or an isolated peripheral ganglion[133] suffices to acquire and retain these memories.
In a representative experiment, fine wires were inserted into two legs from different animals. One of the legs touched a saline solution when it was sufficiently extended, a response that completed an electrical circuit and produced the unconditioned stimulus: shock. A yoked leg received shock simultaneously. The two legs differed in that the yoked leg had a random joint angle at the time of the shock, whereas the master leg always had a joint angle large enough for its “foot” to touch the saline. Flexion of the leg reduced the joint’s angle and terminated the shock. After one leg had been conditioned, both legs were then tested independently. The master leg flexed sufficiently to avoid shock significantly more frequently than the yoked leg did, demonstrating a response–outcome (R–O) memory. —Evolution of Memory Systems

It seems likely to me that "driving a car" used as a core example actually took something like billions of person-years including tens of millions of fatalities to get to the stage that it is today.

Some specific human, after being raised for more than a dozen years in a social background that includes frequent exposure to car-driving behaviour both in person and in media, is already somewhat primed to learn how to safely drive a vehicle designed for human use within a system of road rules and infrastructure customized over more than a century to human skills, sensory modalities, and background culture. All the vehicles, roads, markings, signs, and rules have been designed and redesigned so that humans aren't as terrible at learning how to navigate them as they were in the first few decades.

Many early operators of motor vehicles (adult, experienced humans) frequently did things that were frankly insane by modern standards and would send an AI driving research division back to redesign if their software did such things even once.

I agree with this, and I’d split your point into three separate factors.
The "30 hours to learn to drive" comparison hides at least:
(1) Pretraining: evolutionary pretraining of our visual/motor systems plus years of everyday world experience;
(2) Environment/institution design: a car/road ecosystem (infrastructure, norms, licensing) that has been iteratively redesigned for human drivers;
(3) Reward functions: they do matter for sample efficiency, but in this case they don’t seem to be the main driver of the gap.
Remove (1) and (2) and the picture changes: a blind person can’t realistically learn to drive safely in the current environment, and a new immigrant who speaks no English can’t pass the UK driving theory test without first learning English or Welsh, because of language policy, not because their brain’s reward function is worse.
A large part of the sample-efficiency gap here seems to be about pretraining and environment/institution design, rather than about humans having magically "better reward functions" inside a tabula-rasa learner.

humans having magically "better reward functions"

Tbc this is not my position. I think that humans can do lots of things LLMs can’t, e.g. found and grow and run innovative companies from scratch, but not because of their reward functions. Likewise, I think a quite simple reward function would be sufficient for (misaligned) ASI with capabilities lightyears beyond both humans and today’s LLMs. I have some discussion here & here.

Thanks for the clarification and the links - My guess is that the real crux is how far "reward-function design" can really go, even in principle. If you have a very capable RL system in an open-ended environment and you heavily optimize a single scalar, then Goodhart / overoptimization / mesa-optimization tend to push towards extreme solutions. Under that picture, reward functions are just one axis in a larger design space (pretraining, environment/constraints, interpretability, multi-objective structure, etc.), and no cleverly designed scalar reward on its own looks like a stable solution.

I agree. And I think the same point applies to alignment work on LLM AGI. Even though it's used for alignment and we expect more of it, there's not what I'd call a field of reward function design. Most alignment work on LLMs is probing how the few RL alignment attempts work, rather than using different RL functions and seeing what they do. And it doesn't even seem there's much theorizing about how alternate reward functions might change current or future more capable LLMs' alignment.

I think this analogy is pretty strong, and many of the questions are the same, even though the sources of RL signals are pretty different. The reward function for RL on LLMs seems to be more complex. It uses specs or Anthropic's constitution, and now perhaps the much richer Claude 4.5 Opus' Soul Document, all as interpreted by another LLM to produce an RL signal. But more RL-agent and brainlike RL functions are pretty complex too, since they're nontrivial as hardwired, then expressed through a complex environment and a critic/value function that learns a lot. I think there's a lot of similarity in the questions involved.

So I think your RL training signal starter pack is pretty relevant to LLM AGI alignment theory, too. It's nice to have those all in one place and some connections drawn out. I hope to comment over there after thinking it through a little more.

And this seems pretty important for LLMs even though they have lots of pretraining which changes the effect of RL dramatically. RL (and cheap knockoff imitations like DPO) is playing an increasingly large role in training recent LLMs. A lot of folks expect it to be critical for further progress on agentic capabilities. I expect something slightly different, self-directed continuous learning, but that would still have a lot of similarities even if it's not implemented literally as RL.

And RL has arguably always played a large role in LLM alignment. I know you attributed most of LLMs' alignment to their supervised training magically transmuting observations into behavior. But I think pretraining transmutes observations into potential behavior, and RL posttraining selects which behavior you get, doing the bulk of the alignment work. RL is sort of selecting goals from learned knowledge as Evan Hubinger pointed out on that post.

But more accurately, it's selecting behavior, and any goals or values are only sort of weakly implicit in that behavior. That's an important distinction. There's a lot of that in humans, too, although goals and values are also pursued through more explicit predictions and value function/critic reward estimates.

I'm not sure if it matters for these purposes, but I think the brain is also doing a lot of supervised, predictive learning, and the RL operates on top of that. But the RL also drives behavior and attention, which directs the predictive learning, so it's a different interaction than the LLMs pretraining-then-RL-to-select-behaviors.

In all, I think LLM descendents will have several relevant similarities to brainlike systems. Which is mostly a bad thing, since the complexities of online RL learning get even more involved in their alignment.

Thanks for the post. The importance of reward function design for solving the alignment problem is worth emphasizing.

I'm wondering how you research fits into other reward function alignment research such as CHAI's research on CIRL and inverse reinforcement learning, and reward learning theory.

It seems like these other agendas are focused on using game theory or machine learning fundamentals to come up with a new RL approach that makes AI alignment easier whereas your research is more focused on the intersection of neuroscience and RL.

My very diplomatic answer is: the field of Reward Function Design should be a rich domain with lots of ideas. Curiosity drive is one of them, and so is reward shaping, and so is IRL / CIRL, etc. What else should be on that list that hasn’t been invented yet? Well, let’s invent it! Let a thousand flowers bloom!

…Less diplomatically, since you asked, here’s a hot take. I’m not 100% confident, but I currently don’t think IRL / CIRL per se is a step forward for the kinds of alignment problems I’m worried about. Some possible issues (semi-overlapping) include (1) ontology identification (figuring out which latent variables if any correspond to a human, or human values, in a learned-from-scratch unlabeled world-model); (2) “the hard problem of wireheading”; (3) “the problem of fully updated deference”; (4) my guess that the “brain-like AGI” that I’m specifically working on simply wouldn’t be compatible with IRL / CIRL anyway (i.e. I’m worried that IRL-compatible algorithms would be much less powerful); and (5) my lack of confidence in the idea that learning what a particular human wants to do right now, and then wanting the same thing, really constitutes progress on the ASI x-risk problem in the first place.

This post reminded me of a book called The Technological Singularity (2015) by Murray Shanahan that also emphasizes the importance of reward function design for advanced AI. Relevant extract from the book:

"In the end, everything depends on the AI’s reward function. From a cognitive standpoint, human-like emotions are a crude mechanism for modulating behavior. Unlike other cognitive attributes we associate with consciousness, there seems to be no logical necessity for an artificial general intelligence to behave as if it had empathy or emotion. If its reward function is suitably designed, then its benevolence is assured. However, it is extremely difficult to design a reward function that is guaranteed not to produce undesirable behavior. As we’ll see shortly, a flaw in the reward function of a superintelligent AI could be catastrophic. Indeed such a flaw could mean the difference between a utopian future of cosmic expansion and unending plenty, and a dystopian future of endless horror, perhaps even extinction."

Cool, good find!

…Too bad that he seems to be making the common mistake of conflating “reward function” with “utility function” / “goals” (AFAICT from skimming how he uses the term in that book). They’re related but different.

Oh, it’s definitely controversial—as I always say, there is never a neuroscience consensus. My sense is that a lot of the controversy is about how broadly to define “reinforcement learning”.

If you use a narrow definition like “RL is exactly those algorithms that are on arxiv cs.AI right now with an RL label”, then the brain is not RL.

If it helps, I was arguing about this with a neuroscientist friend (Eli Sennesh) earlier this year, and wrote the following summary (not necessarily endorsed by Eli) afterwards in my notes:

Eli doesn’t like the term “RL” in a brain context because of (1) its implication that "reward" is stuff in the environment as opposed to an internal “reward function” built from brain-internal signals, (2) its implication that we’re specifically maximizing an exponentially-discounted sum of future rewards.
…Whereas I like the term “RL” because (1) If brain-like algorithms showed up on GitHub, then everyone in AI would call it an “RL algorithm”, put it in “RL textbooks”, and use it to solve “RL problems”, (2) This follows the historical usage (there’s reinforcement, and there’s learning, per Thorndike’s Law of Effect etc.).
When I want to talk about “the brain’s model-based RL system”, I should translate that to “the brain’s Bellman-solving system” when I’m talking to Eli, and then we’ll be more-or-less on the same page I think?

…But Eli is just one guy, I think there are probably dozens of other schools-of-thought with their own sets of complaints or takes on “RL”.

how far down the scale of life these have been found?

Pavlovian conditioning occurs in a naturally brainless species, sea anemones, but it is also possible to study protostomes that have had their brains removed. An experiment by Horridge[130] demonstrated response–outcome conditioning in decapitated cockroaches and locusts. Subsequent studies showed that either the ventral nerve cord[131,132] or an isolated peripheral ganglion[133] suffices to acquire and retain these memories.
In a representative experiment, fine wires were inserted into two legs from different animals. One of the legs touched a saline solution when it was sufficiently extended, a response that completed an electrical circuit and produced the unconditioned stimulus: shock. A yoked leg received shock simultaneously. The two legs differed in that the yoked leg had a random joint angle at the time of the shock, whereas the master leg always had a joint angle large enough for its “foot” to touch the saline. Flexion of the leg reduced the joint’s angle and terminated the shock. After one leg had been conditioned, both legs were then tested independently. The master leg flexed sufficiently to avoid shock significantly more frequently than the yoked leg did, demonstrating a response–outcome (R–O) memory. —Evolution of Memory Systems

It seems likely to me that "driving a car" used as a core example actually took something like billions of person-years including tens of millions of fatalities to get to the stage that it is today.

humans having magically "better reward functions"

Thanks for the post. The importance of reward function design for solving the alignment problem is worth emphasizing.

I'm wondering how you research fits into other reward function alignment research such as CHAI's research on CIRL and inverse reinforcement learning, and reward learning theory.

"In the end, everything depends on the AI’s reward function. From a cognitive standpoint, human-like emotions are a crude mechanism for modulating behavior. Unlike other cognitive attributes we associate with consciousness, there seems to be no logical necessity for an artificial general intelligence to behave as if it had empathy or emotion. If its reward function is suitably designed, then its benevolence is assured. However, it is extremely difficult to design a reward function that is guaranteed not to produce undesirable behavior. As we’ll see shortly, a flaw in the reward function of a superintelligent AI could be catastrophic. Indeed such a flaw could mean the difference between a utopian future of cosmic expansion and unending plenty, and a dystopian future of endless horror, perhaps even extinction."

Cool, good find!

118

We need a field of Reward Function Design

118

Ω 47

Let’s talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI)

Reward functions in RL

Reward functions in neuroscience

We need a (far more robust) field of “reward function design”

Oh man, are we dropping this ball

Reward Function Design: Neuroscience research directions

Reward Function Design: AI research directions

Bigger picture

118

Ω 47

118

Ω 47