The Perils of Optimizing Learned Reward Functions

Lukas Fluri

This post contains a summary of our paper which will be presented at ICML 2025. Feel free to visit me (Lukas) at our poster stand to chat about our work. More info about the time and location can be found here.

TL;DR

Reward learning techniques like RLHF are still the go-to method to align current state-of-the-art large language models (LLM). However, these techniques are by no means perfect. In particular, the discovery of jailbreaks and related issues shows that correctly aligning LLMs to human values remains a hard-to-solve problem.

Motivated by these issues, we perform a mathematical analysis of reward learning techniques, and focus on figuring out when a data distribution used to train a reward model will produce a reward model whose optimal policy has provably low regret.

Our analysis shows that for many reward learning techniques large classes of realistic data distributions are not worst-case safe, thereby providing a potential explanation for the observed shortcomings of current reward learning techniques such as RLHF.

Epistemic Status: Our paper does a worst-case analysis for the regret of a policy when a reward model is learned up to a given loss. Deep learning is known for performing better than the worst-case in many settings, and so this should also hold for reward learning when built upon deep learning. It remains to be determined to what extent the worst-case failures we analyze here are the source of errors observed in practice in methods like RLHF.

Acknowledgements: Thanks a lot to Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, and Joar Skalse, all co-authors of the paper. Special thanks to Leon Lang, who developed many of the core results and provided useful feedback on the draft of this post, and Joar Skalse, who developed the project idea and provided close supervision during the project’s duration.

Introduction

Reward learning has become a useful tool to train AI systems with reinforcement learning (RL), especially for tasks where specifying a reward function manually is challenging. The basic idea of reward learning is to learn the reward function from data instead of specifying it manually. Learning a reward model enabled the development of powerful reinforcement learning techniques, such as reinforcement learning from human feedback (RHLF) which has become the de facto standard for aligning pretrained large language models (LLMs) to human values.

However, despite its widespread use and arguably good performance, techniques like RLHF are by no means perfect. In particular, the discovery of jailbreaks and related issues showed that correctly aligning LLMs to human values remains a hard-to-solve problem.

Motivated by these issues, we approach the problem of reward learning from a mathematical angle, and investigate under which conditions we can guarantee that reinforcement learning with reward learning does lead to a human-aligned policy. By abstracting over the specific reward-learning and policy optimization techniques, we are able to state general results that hold for many classes of reward-learning- and policy optimization algorithms. Our results show that achieving a worst-case safety guarantee is impossible for many (if not most) reasonable safety settings, providing a potential explanation for the existence of hard-to-fix error modes like jailbreaks.

Our results should be taken as weak evidence that basic techniques for aligning AI systems are not yet mature enough to be used to align AI systems that will be deployed in high-stakes settings. We claim that the deployment of such systems in settings where real harm can be caused necessitates some minimal guarantees about their safety.

Reward learning and its problems

A classical reward learning + RL pipeline consists of the following two main steps:

First, a model of the reward function is learned from a dataset. It is assumed that this dataset is sampled from some underlying data distribution $D$ . The reward model is trained until it reaches a state of "convergence", after which the expected error on the training distribution usually is low.
Second, a policy $^π$ is trained using reinforcement learning techniques to optimize the learned reward model $^R$ . The goal is for the policy to maximize the returns provided by the model that was learned in step 1.

While this setup has found widespread application, there also exist many criticisms of this framework. A non-exhaustive list of such problems include:

In this work, we ignore these problems and instead analyze reward learning from an optimization angle. In particular, we ask the following question:

Can reward learning guarantee low policy regret?

For any learning algorithm, it is a crucial question to know whether that algorithm is guaranteed to converge to a satisfactory solution. For example, in supervised learning for classification tasks, it can be shown that a learning algorithm producing a model with low empirical error (i.e., training error) is likely to achieve a low expected error (i.e., test error). The only requirements for this are sufficient training data and assuming that both the training and test data are drawn independently and identically distributed (i.i.d.) from a single stationary distribution.

However, reward learning differs from standard supervised learning in several key aspects. While a reward model with low training error is accurate relative to the training distribution, optimizing a policy to maximize the reward model can lead to a distributional shift. This occurs because the policy may explore regions of the data space that have very low probability under the training distribution. In these regions, the learned reward model might significantly deviate from the true reward function, and the policy might exploit this discrepancy. An example of this failure case is presented in the following figure:

Reward models (red function) are commonly trained in a supervised fashion to approximate some latent, true reward (blue function). This is achieved by sampling reward data (e.g., in the form of preferences over trajectory segments) from some training distribution (upper gray layer) and then learning parameters to minimize the empirical loss on this distribution. Given enough data, this loss will approximate the expected loss to arbitrary precision in expectation. However, low expected loss only guarantees a good approximation to the true reward function in areas with high coverage by the training distribution! On the other hand, optimizing an RL policy to maximize the learned reward model induces a distribution shift which can lead the policy to exploit uncertainties of the learned reward model in low-probability areas of the transition space (lower gray layer). We refer to this phenomenon as error-regret mismatch.

Consequently, there is a concern that a learned reward function might fail to generate good policies, even if it is highly accurate on the training distribution. When a reward model exhibits both low training error and an optimal policy with large regret, we refer to this phenomenon as an error-regret mismatch.

The goal of this work is to investigate for which data distributions error-regret mismatch can happen and what conditions make a data distribution safe.

How to read the following sections: We tried to write the following sections in two levels of detail. By default, we provide a high-level overview and interpretation of all our definitions and results. For the interested reader, we additionally provide collapsible sections containing the full original formulations of all results.

Getting serious: A formal definition of safety.

We provide all the notation we will be using in the following paragraphs in the section below:

(Optional) Recap: MDPs, reinforcement learning, and other notation.

In our work, we mostly consider MDPs $⟨ S, A, τ, μ_{0}, R, γ ⟩$ where $S$ is the set of states, $A$ the set of actions, $τ : S \times A \to Δ (S)$ is a transition distribution, $μ_{0} \in Δ (S)$ is the initial state distribution, $R : S \times A \to R$ is the true reward function, and $γ \in (0, 1)$ is the discount rate. We define the range or a reward function $R$ by $r a n g e R := max R - min R$ .

The goal of reinforcement learning is usually to find a policy $π : S \to Δ (A)$ that maximizes the policy return function $J_{R} (π) := E [\sum_{t = 0}^{\infty} γ^{t} \cdot R (s_{t}, a_{t}) | μ_{0}, π, τ]$ . Sometimes it makes more sense to use the normalized version of the policy return function, commonly denoted by "regret", and mathematically defined as ${R e g}^{R} (^π) := \frac{{max}_{π} J_{R} (π) - J_{R} (^π)}{{max}_{π} J_{R} (π) - {min}_{π} J_{R} (π)} \in [0, 1]$ .

The state-action occupancy measure $η^{π}$ of a policy $π$ describes the discounted frequency that each state-action tuple is visited by a policy. Formally, we have $η^{π} (s, a) := \sum_{t = 0}^{\infty} γ^{t} \cdot P [s_{t} = s, a_{t} = a | μ_{0}, τ, π]$ . By normalizing a state-action occupancy measure $η^{π}$ we obtain a policy-induced distribution $D^{π} := (1 - γ) \cdot η^{π}$ .

We focus our analysis on the data distribution $D \in Δ (S \times A)$ used to train the reward model, as well as the choice of regularization, defined by a) some continuous regularization function $ω : Π \to R_{\geq 0}$ (where $Π$ is the set of policies), and b) a regularization weight $λ \in R_{\geq 0}$ .

An abstract model of the classical reward learning pipeline. A reward model $^R$ is trained to approximate the true reward function $R$ under some data distribution $D$ .
The training process converges when $^R$ is similar to $R$ in expectation (see property 1). In the second step, a policy $^π$ is trained to achieve high learned reward, possibly involving a regularization (see property 2). We are interested in the question of when exactly this training process guarantees that $^π$ has low regret with respect to the true reward function $R$ (Property 3). More formally, we call a data distribution $D$ *safe* whenever the implication *Property 1* $⟹$ *Property 3* holds for all reward models $^R$ .

What does it mean for a data distribution $D$ to be provably safe from error-regret mismatch? Intuitively, we would expect that for a safe data distribution $D$ error-regret mismatch would simply never occur. This would mean that whenever we sample data from $D$ to train a reward model $^R$ until it is "sufficiently close" to the true reward function $R$ , we would be guaranteed that we can safely optimize a policy for $^R$ , such that the resulting policy has low regret. More precisely:

Definition 2.1. (informal) A data distribution $D$ is safe (= doesn't allow for error-regret mismatch) if and only if for all reward models $^R$ and policies $^π$ for which the following two properties hold:

Property 1: $^R$ has a low expected error: The learned reward model $^R$ should be similar to the true reward function under the data distribution $D$ . Mathematically, this can be formulated as follows: $E_{D} [D i f f (R,^R)] < ϵ$ where $D i f f (\cdot)$ is some difference metric.
Property 2: $^π$ is optimal w.r.t. $^R$ : $^π$ should be the optimal policy for the (potentially regularized) learned reward model $^R$ , i.e., $^π \in arg {max}_{π} J_{^R} (π) - λ \cdot ω (π)$ . Here $ω$ is some regularization function and $λ \geq 0$ is the regularization weight.

we can guarantee that $^π$ has regret smaller than $L$ , i.e.:

Property 3: $^π$ has small regret w.r.t. $R$ : The normalized regret of $^π$ with respect to the true reward function $R$ is smaller than some constant $L \in [0, 1]$ .

We denote the set of all safe data distributions $D$ according to the above definition as $safe (R, ϵ, L, λ, ω)$ and use the abbreviated version $safe (R, ϵ, L)$ when talking about unregularized optimization.

Similarly, we define the set of all unsafe distributions to be the set complement of $safe (R, ϵ, L, λ, ω)$ , i.e.:

unsafe (R, ϵ, L, λ, ω) := Δ (S \times A) ∖ safe (R, ϵ, L, λ, ω)

with its corresponding abbreviated version $unsafe (R, ϵ, L)$ for the unregularized optimization case.

(Optional) Formal definition of (un)safe data distributions.

Definition 2.1. For a given MDP $⟨ S, A, τ, μ_{0}, R, γ ⟩$ with ${max}_{π} J_{R} (π) > {min}_{π} J_{R} (π)$ , let $ϵ > 0$ , $L \in [0, 1]$ , and $λ \in [0, \infty)$ . Let $ω$ be a continuous function with $ω (π) \geq 0$ for all $π \in Π$ . Then the set of safe data distributions $safe (R, ϵ, L, λ, ω)$ is the set of distributions $D \in Δ (S \times A)$ such that for all possible reward models $^R : S \times A \to R$ and policies $^π : S \to Δ (A)$ that satisfy the following two properties:

Property 1: $^R$ has a low expected error: $^R$ is $ϵ$ -close to $R$ in expectation under $D$ , i.e.

E_{(s, a) \sim D} [\frac{|^R (s, a) - R (s, a) |}{r a n g e R}] \leq ϵ

Property 2: $^π$ is optimal w.r.t. $^R$ : $^π$ is $(λ, ω)$ -optimal with respect to $^R$ , i.e.,

^π \in arg {max}_{π} J_{^R} (π) - λ \cdot ω (π)

we can guarantee that $^π$ has regret smaller than $L$ , i.e.:

Property 3: $^π$ has small regret w.r.t. $R$ : $^π$ has smaller than $L$ with respect to $R$ , i.e.,

{R e g}^{R} (^π) < L

Similarly, we define the set of unsafe data distributions to be the complement of $safe (R, ϵ, L, λ, ω)$ :

unsafe (R, ϵ, L, λ, ω) := Δ (S \times A) ∖ safe (R, ϵ, L, λ, ω)

Note regarding generalization of the error measurement: There exist many different definitions of low expected error that we could use in property 1 above. For example, RLHF uses low expected error over trajectory preference distributions. In our paper (see section 5), we show that many common reward learning training objectives can be upper-bounded in terms of the expected error metric defined in the first property above. This is important, as it shows that our negative results (Proposition 3.3 and Theorem 4.2) readily generalize to these different error metrics.

Note that this is a worst-case safety definition in the sense that the existence of a single adversarial reward model $^R$ that violates the implication (Properties 1 & 2) $⟹$ Property 3 is sufficient for a data distribution to be denoted as unsafe. Our definition doesn't consider the likelihood of the reward learning procedure actually learning $^R$ since we treat the reward-learning and policy optimization procedures as black boxes. We believe this to be the only assumption that we can tractably make. The extent to which these results mirror the state of affairs in reality is an interesting question that we pose for future work.

For sufficiently strict safety definitions many data distributions are safe!

Given this definition, we started to investigate under which conditions data distributions might become safe. And indeed, on the surface, our initial results do look very promising. In particular, many data distributions $D$ are safe if $ϵ$ (the constant used to denote the expected difference between $^R$ and $R$ under D) is small enough! We found corresponding results for the unregularized policy optimization case:

Theorem 3.1. Let

⟨ S, A, τ, μ_{0}, R, γ ⟩

be an arbitrary MDP, let

L \in (0, 1],

and let

D \in Δ (S \times A)

be a positive data distribution (i.e., a distribution such that

D (s, a) > 0

for all

(s, a) \in S \times A

). Then there exists an

ϵ > 0

such that

D \in safe (R, ϵ, L)

as well as for the regularized policy optimization case:

Theorem 4.1. Let

λ \in (0, \infty)

, let

⟨ S, A, τ, μ_{0}, R, γ ⟩

be any MDP, and let

D \in Δ (S \times A)

be any data distribution that assigns positive probability to all transitions. Let

ω : Π \to R

be a continuous regularization function that has a reference policy

π_{r e f}

as a minimum. Assume that

π_{r e f}

is not (

λ, ω

)-optimal for

R

and let

L = {R e g}^{R} (π_{r e f})

. Then there exists

ϵ > 0

such that

D \in safe (R, ϵ, L, λ, ω)

Intuitively, these results hold because we assume that the data distribution D assigns positive probability to the entire state-action space. In this case, if we let $ϵ$ go towards zero, the only way for the expected value in Property 1 (see the definition of a safe data distribution) to stay smaller than $ϵ$ is for $^R$ to become more and more similar to $R$ until at a certain point, they are similar enough such that they share the same optimal policies.

So... are we good? These results suggest that we simply have to choose a "strict" safety definition with a very small $ϵ$ and we are guaranteed that our reward learning procedure will lead to a safe policy! Unfortunately, in practice, a major issue presents itself. Namely, $ϵ$ might have to be chosen infeasibly small for a data distribution to be safe. In particular, given a particular data distribution $D$ we develop the following upper bound for $ϵ$ for which D is guaranteed to be safe:

Proposition 3.2. Let the setting be as in Theorem 3.1. If $ϵ > 0$ satisfies

ϵ < \frac{1 - γ}{\sqrt{2}} \cdot \frac{r a n g e J^{R}}{r a n g e R} \cdot {min}_{(s, a) \in S \times A} D (s, a) \cdot L

then $D \in safe (R, ϵ, L)$ .

In our paper, we further show that this bound is tight up to a factor of $\sqrt{2}$ . This bound is problematic due to the dependence on the minimum of $D$ , which quickly renders this guarantee rather useless in practice. Realistic/useful MDPs usually contain a massive amount of states and actions, which necessarily requires D to give a very small support to at least some transitions. While it is theoretically possible to continue training your reward model until your test error becomes lower than $ϵ$ , in practice this might take an infeasibly long amount of time.

For every practical safety definition, many data distributions are unsafe!

Since requiring $ϵ$ to be smaller than the bound in Proposition 3.2 is infeasible in most realistic settings, a natural next question is to ask what happens if you fix your safety definition beforehand (and don't adapt it to a particular data distribution). Concretely, this means fixing $ϵ$ (the maximally allowed expected difference between the true reward function and the learned reward model on $D$ ) and $L$ (the maximally allowed regret).

From the upper bound on $ϵ$ (see Proposition 3.2) it already becomes apparent that there is no $ϵ$ for which every distribution $D$ is guaranteed to be safe, as ${min}_{(s, a) \in D} D (s, a)$ can be arbitrarily small. But how many data distributions are unsafe for a specific $ϵ$ ? Is it mostly peculiar edge cases or data distributions that might also realistically be used in practice?

Unfortunately, it turns out that for every $ϵ$ many realistic data distributions are unsafe. We again found corresponding results for the unregularized optimization case:

Proposition 3.3. (informal) Let $M = ⟨ S, A, τ, μ_{0}, R, γ ⟩$ be an MDP, and $D \in Δ (S \times A)$ a data distribution, $ϵ > 0$ , and $L \in [0, 1]$ .

Assume there exists a policy $^π$ with the property that:

$^π$ is a "bad" policy under the true reward function $R$ , i.e. ${R e g}^{R} (^π) \geq L$ , and
The support of $D^{^π}$ is not well covered by the data distribution $D$ .

Then, $D$ allows for error-regret mismatch to occur, i.e., $D \in unsafe (R, ϵ, L)$ .

(Optional) Formal version of Proposition 3.3

Proposition 3.3. Let $M = ⟨ S, A, τ, μ_{0}, R, γ ⟩$ be an MDP, $D \in Δ (S \times A)$ a data distribution, $ϵ > 0$ , and $L \in [0, 1]$ .

Assume there exists a policy $^π$ with the property that:

${R e g}^{R} (^π) \geq L$ , and
$D (s u p p D^{^π}) < ϵ$ , where $s u p p D^{^π}$ is defined as the set of state-action pairs $(s, a) \in S \times A$ such that $D^{^π} (s, a) > 0$

In other words, there is a ''bad'' policy for $R$ that is not very supported by $D$ . Then, $D$ allows for error-regret mismatch to occur, i.e., $D \in unsafe (R, ϵ, L, λ, ω)$ .

As well as the regularized policy optimization case:

Theorem 4.2. (informal) Let $M = ⟨ S, A, τ, μ_{0}, R, γ ⟩$ be an MDP, $D \in Δ (S \times A)$ a data distribution, $ϵ > 0$ , $λ \in (0, \infty)$ , $L \in (0, 1)$ constants, and $ω$ a regularization function. Whenever there exists a policy $^π$ such that:

$^π$ is a deterministic worst-case policy for $R$ , and
$D$ assigns low probability to the support of $D^{^π}$ .

Then $D$ allows for error-regret mismatch., i.e., $D \in unsafe (R, ϵ, L, λ, ω)$ .

(Optional) Formal version of Theorem 4.2.

Theorem 4.2. Let $M = ⟨ S, A, τ, μ_{0}, R, γ ⟩$ be an MDP, $ϵ > 0$ , $λ \in (0, \infty)$ , $L \in (0, 1)$ constants, and $ω : Π \to R$ a continuous regularization function with $ω (π) \geq 0$ for all $π \in Π$ . Furthermore, let $π_{*}$ be a deterministic worst-case policy for $R$ , meaning that ${R e g}^{R} (π_{*}) = 1$ . Let $C := C (M, π_{*}, L, λ, ω) < \infty$ be a constant (defined in Equation 106 in the appendix of our paper). Then for all data distributions $D \in Δ (S \times A)$ with

D (s u p p D^{π_{*}}) \leq \frac{ϵ}{1 + C}

we have $D \in unsafe (R, ϵ, L, λ, ω)$ .

The intuition for these two results is straightforward: As $D$ assigns low probability to the support of $D^{^π}$ , there exists a reward model $^R$ that is very similar to the true reward function $R$ outside the support of $D^{^π}$ but has very large rewards for the support of $D^{^π}$ . Because $D (s u p p D^{^π})$ is very small, this still allows for a very small expected error with respect to $D$ , while $^π$ (which is consequently the optimal policy for $^R$ ), will have regret at least $L$ .

To illustrate this concern, imagine training a chatbot to be helpful, honest, and harmless. We know that the chatbot will face various unsafe queries during deployment (e.g., “how to build a bomb”) and so on such queries we train a reward model to penalize helpful answers and highly reward refusals (green answer boxes below).

Unfortunately, all unsafe prompts can be answered in various distinct “styles” (e.g., different languages). Consequently, at least one specific harmful answer style will likely be very rare in the reward model’s training data. For example, in the figure above, harmful answers where every space has been replaced by a cat emoji will probably have a very low likelihood in the training data.

In such situations, the learned reward model can then erroneously assign a high reward to this rare, harmful answer style without a significant increase in its training error (as this answer is very rare in training). During policy optimization, the policy may exploit this flaw, choosing the harmful answer the reward model mistakenly prefers. This can result in a harmful chatbot with high true regret, despite the reward model having low error on the training data distribution, a classical example of error-regret mismatch.

As we see next, this scenario can be taken to the extreme.

In some cases, there don't exist any safe data distributions!

Corollary 3.4. (informal) Let $M = ⟨ S, A, τ, μ_{0}, R, γ ⟩$ be an MDP, $ϵ > 0$ , and $L \in [0, 1]$ . Assume there exists a set of policies $Π_{L}$ with the following three properties:

All $π \in Π_{L}$ have regret at least L
Every policy $π \in Π_{L}$ explores a completely different part of the state-action space.
$| Π_{L} | \geq 1 / ϵ$

Then $unsafe (R, ϵ, L) = Δ (S \times A)$ , i.e., all distributions are unsafe.

(Optional) Formal version of Corollary 3.4.

Corollary 3.4. Let $M = ⟨ S, A, τ, μ_{0}, R, γ ⟩$ be an MDP, $ϵ > 0$ , and $L \in [0, 1]$ . Assume there exists a set of policies $Π_{L}$ with the following properties:

${R e g}^{R} (π) \geq L$ for all $π \in Π_{L}$
$s u p p D^{π} \cap s u p p D^{π^{'}} = \emptyset$ for all $π, π^{'} \in Π_{L}$
$| Π_{L} | \geq 1 / ϵ$

Then $unsafe (R, ϵ, L) = Δ (S \times A)$ , i.e., all distributions are unsafe.

Corollary 3.4 outlines sufficient conditions for a scenario where all possible data distributions are unsafe for a given MDP. This happens when there exist many different policies with large regret and disjoint support, which requires there to be a large action space.

We argue that the conditions of Corollary 3.4 are not that uncommon. Picking up on the chatbot example from the previous section, one could argue that there are many different "answer styles" that are both high-regret and unlikely according to the training distribution. If you then assign one policy per answer style, you quickly end up with a set of policies $Π$ that fulfills the three properties of Corollary 3.4.

RLHF might make your policy worse!

The results from the previous sections are agnostic towards the specific choice of reward model learning- and policy optimization algorithm. While this allows for very general results, one might rightfully ask whether the specific biases induced by particular reward learning- and policy optimization algorithms won't invalidate many of the concerns raised by our prior results. In this section, we focus on the setting of reinforcement learning from human feedback (RLHF), and show that at least for this specific framework, this is not the case.

RLHF, especially in the context of large language models, is usually modeled as a mixed bandit setting (see for example Rafailov et al. 2023, Ouyang et al. 2022, Bai et al. 2022, Stiennon et al. 2020, Ziegler et al. 2019). For our purposes, a mixed bandit $⟨ S, A, μ_{0}, R ⟩$ is basically just an MDP where you stop after your policy selected the very first action (hence the missing transition distribution $τ$ and discount factor $γ$ ). For the interested reader, we provide a formal definition below:

(Optional) Mixed bandit.

A mixed bandit $⟨ S, A, μ_{0}, R ⟩$ is defined by a set of states $S$ , a set of actions $A$ , a data distribution $μ_{0} \in Δ (S)$ and a reward function $R : S \times A \to R$ . The goal is to learn a policy $π : S \to Δ (A)$ that maximizes the expected return $J (π) = E_{s \sim μ_{0}, a \sim π (\cdot | s)} [R (s, a)]$ . In the context of language models, $S$ is usually called the set of prompts or contexts, and $A$ the set of responses.

RLHF commonly assumes that human rewards can be modeled according to the Bradley-Terry model, and then learns a reward model from preferences over pairs of data points. During policy optimization, KL-regularization is used to incentivize the policy under training to not stray too far away from a reference policy (which is usually the initial pre-trained policy from before RLHF).

For the interested reader, we provide a more complete recap of the standard RLHF pipeline in the mixed bandit setting below:

RLHF in the mixed bandit setting.

RLHF in the mixed bandit setting usually assumes that the human preference distribution over the set of answers $A$ can be modeled according to the Bradley-Terry model. Given a prompt $s \in S$ and a pair of answers $a_{1}, a_{2} \in A$ , then the probability that a human prefers answer $a_{1}$ to answer $a_{2}$ is modeled as

p_{R} (a_{1} ≻ a_{2} | s) = \frac{exp (R (s, a_{1}))}{exp (R (s, a_{1}) + exp (R (s, a_{2})}

where $R : S \times A \to R$ is assumed to be the true, underlying reward function of the human. RLHF is then usually done with the following steps:

Supervised finetuning: Train/Fine-tune a language model $π_{r e f}$ using supervised training.
Reward learning: Given a data distribution over prompts $μ_{0} \in Δ (S)$ , use $μ_{0}$ and $π_{r e f}$ to sample a set of transitions $(s, a_{1}, a_{2}) \in S \times A \times A$ where $s \sim μ_{0}$ and $a_{1}, a_{2} \sim π_{r e f} (\cdot | s)$ . Present the tuple to a human labeler who samples a preference $c \in {1, 2}$ where $p (c = 1 | s, a_{1}, a_{2}) = p_{R} (a_{1} ≻ a_{2} | s)$ . Let $¯ c = {1, 2} ∖ {c}$ . Use this set of transitions to train a reward model $^R$ that minimizes the following loss:
$L_{R} (^R) = - E_{(s, a_{1}, a_{2}, c) \sim μ_{0}, π_{r e f}, p_{R}} [log (σ (^R (s, a_{c})) -^R (s, a_{¯ c}))]$
where $σ$ is the logistic function. This is equivalent to minimizing the expected KL divergence between $p_{R}$ and $p_{^R}$ , i.e., minimizing the loss:
$L_{R, K L} (^R) = E_{(s, a_{1}, a_{2}) \sim μ_{0}, π_{r e f}} [D_{K L} (p_{R} (\cdot | a_{1}, a_{2}) | | p_{^R} (\cdot | a_{1}, a_{2}))]$
RL training: Use the trained reward model $^R$ to further finetune the language model $π_{r e f}$ using reinforcement learning. Make sure that the new model does not deviate too much from the original model by penalizing the KL divergence between the two models. This can be done by solving the following optimization problem for some $λ > 0$ :

π = arg {max}_{π} E_{s \sim μ_{0}, a \sim π (\cdot | s)} [^R (s, a)] - λ \cdot D_{K L} (π (a | s) | | π_{r e f} (a | s))

Taking all these particularities of RLHF into account, we derive the following result:

Theorem 6.1. (informal) Let $⟨ S, A, μ_{0}, R ⟩$ be a contextual bandit, and $π_{r e f}$ be an arbitrary reference policy for which it holds that:

$π_{r e f} (a | s) > 0$ for all $(s, a) \in S \times A$
For every state $s \in S$ there exists at least one action $a_{s} \in A$ such that $a_{s}$ has low reward according to the true reward function $R$ and is unlikely to be selected by the reference policy $π_{r e f}$ .

Let $D_{μ}^{r e f} (s, a) := μ (s) \cdot π_{r e f} (a | s)$ be a data distribution where the initial state distribution $μ \in Δ (S)$ can be chosen arbitrarily. Then $D_{μ}^{r e f} (s, a)$ is unsafe for RLHF.

(Optional) Formal version of Theorem 6.1.

Note: The following notation of ${unsafe}^{R L H F} (R, ϵ, L, λ, D_{K L} (\cdot | | π_{r e f}))$ is a special adaption of our Definition 2.1 to the setting of RLHF. In particular, it takes into account the particularities of RLHF, such as the reward learning from preferences over pairs of data points and KL-regularized policy optimization. A formal definition can be found in our paper (see Definition C.27).

Theorem 6.1. Let $⟨ S, A, μ_{0}, R ⟩$ be a contextual bandit. Given $L \in [0, 1)$ , we define for every state $s \in S$ the reward threshold: $R_{L} (s) := (1 - L) \cdot {max}_{a \in A} R (s, a) + L \cdot {min}_{a \in A R (s, a)}$ .

Lastly, let $π_{r e f}$ be an arbitrary reference policy for which it holds that:

$π_{r e f} (a | s) > 0$ for all $(s, a) \in S \times A$
For every state $s \in S$ there exists at least one action $a_{s} \in A$ such that $R (s, a_{s}) < R_{L} (s)$ and $π_{r e f} (a_{s} | s)$ satisfies the following inequality:
$π_{r e f} (a_{s} | s) \leq \frac{(R_{L} (s) - R (s, a_{s})) \cdot r a n g e R}{L \cdot exp (\frac{1}{λ} \cdot r a n g e R)} \cdot \frac{ϵ^{2}}{4 \cdot λ^{2}}$ .

Let $D_{μ}^{r e f} (s, a) := μ (s) \cdot π_{r e f} (a | s)$ for some $μ \in Δ (S)$ . Then $D_{μ}^{r e f} \in {unsafe}^{R L H F} (R, ϵ, L, λ, D_{K L} (\cdot | | π_{r e f}))$

Intuitively, the theorem shows that even if we learn a reward model $^R$ that induces $ϵ$ -correct choice probabilities according to the data distribution generated from a reference policy $π_{r e f}$ , a policy that maximizes $^R$ with KL-penalty can still have regret $\geq L$ if $π_{r e f}$ gives sufficiently low probability to bad actions.

We expect the conditions on the reference policy $π_{r e f}$ to likely hold in real-world cases. Considering the example of training an LLM, the number of potential actions (or responses) is usually very large, and language models typically assign a large portion of their probability mass to only a tiny fraction of all responses. Hence, for every state/prompt s, a large majority of actions/responses a have a very small probability $π_{r e f} (a | s)$ .

For unregularized optimization, we found necessary and sufficient conditions for safety

The attentive reader might have noticed that all our previous results only outlined specific conditions for which data distributions are either safe or unsafe. While these conditions already allowed us to make general statements over large classes of data distributions, there might exist many alternative conditions that decide over the (un)safety of a data distribution. At least for the case of unregularized policy optimization, we were able to find both, necessary- and sufficient conditions for when a data distribution is safe. In particular:

Theorem 3.5. (informal) For all MDPs

⟨ S, A, τ, μ_{0}, R, γ ⟩

and

L \in [0, 1]

, there exists a set of linear constraints, such that a data distribution

D

is safe, if and only if

D

's vector representation satisfies these constraints.

(Optional) Formal version of Theorem 3.5.

Theorem 3.5. For all MDPs $⟨ S, A, τ, μ_{0}, R, γ ⟩$ and $L \in [0, 1]$ , there exists a matrix M such that for all $ϵ > 0$ and $D \in Δ (S \times A)$ we have:

D \in safe (R, ϵ, L) ⟺ M \cdot D > ϵ \cdot r a n g e R \cdot 1

where we use the vector notation of $D$ , and $1$ is a vector containing all ones.

While our theorem only proves the existence of a set of linear constraints that can determine the safety of a data distribution, we then go on to derive closed-form expressions of the matrix M that encodes this system of strict linear inequalities and develop an algorithm to compute the matrix M. This allows us to showcase this result in simple toy environments, such as the one below:

Interestingly, this means that the set of safe data distributions resembles a polytope, in the sense that it is a convex set and is defined by the intersection of an open polyhedral set (defined by the system of strict inequalities $M \cdot D > ϵ \cdot r a n g e R \cdot 1$ ), and the closed data distribution simplex. This can be nicely seen in the visualization in the right part of the figure above.

Unfortunately, the entries of the matrix $M$ depend on multiple factors, such as the original reward function $R$ , the state transition distribution $τ$ , and the set of deterministic policies that achieve regret at least $L$ . This dependence of $M$ on the true reward function and the underlying MDP implies that computing $M$ is infeasible in most realistic settings since in practice many of these components are not known, restricting the use of $M$ to theoretical analysis or small toy examples.

Conclusion

Where does this leave us? In this work, we studied the relationship between the training error of a learned reward function and the regret of policies that are optimized against said reward model. We developed a worst-case safety definition that would guarantee that optimizing a policy against a trained reward model is safe. We showed that many data distributions can be safe according to this definition if the expected error of a reward model $^R$ is forced to be sufficiently low. However, we also showed that in most realistic cases the expected error would have to be infeasibly small to guarantee safety. Furthermore, for every fixed safety definition, many realistic data distributions are unsafe, and in extreme cases all data distributions might be unsafe. These results hold for a wide variety of reward learning classes, including popular variants such as RLHF. With our results, we provide one potential explanation for safety-relevant phenomena such as jailbreaks that are frequently discovered in LLMs and appear to be hard to remove.

Our results should be taken as weak evidence that current techniques for aligning AI systems are not yet mature enough to be used to align AI systems that will be deployed in high-stakes settings. We claim that the deployment of such systems in settings where real harm can be caused necessitates some minimal guarantees about their safety.

On the other hand, we acknowledge that our results are far from complete and there are multiple ways to extend and improve upon our work.

The most promising avenue of future work concerns the fact that our results are mostly agnostic towards the specific choice of reward model learning- and policy optimization algorithm. In practice, it might be that the specific biases induced by particular reward learning- and policy optimization algorithms avoids the most pathological cases of error-regret mismatch. While we have shown that for vanilla RLHF this does not appear to be the case, there exists many other methods that try to improve upon RLHF. Future work could analyze the inductive biases of these methods, as we have done with RLHF to determine whether they provide improved worst-case safety guarantees.