Greedy-Advantage-Aware RLHF

sej2020

Greedy-Advantage-Aware RLHF addresses the negative side effects from misspecified reward functions problem in language modeling domains. In a simple setting, the algorithm improves on traditional RLHF methods by producing agents that have a reduced tendency to exploit misspecified reward functions. I also detect the presence of sharp parameter topology in reward hacking agents, which suggests future research directions. The repository for the project can be found here.

Motivation

In the famous short story The Monkey's Paw by W.W. Jacobs, the White family receives a well-traveled friend of theirs, Sergeant-Major Morris, and he brings with him a talisman from his visits to India: a mummified monkey's paw. Sergeant Major Morris reveals that the paw has a magical ability to grant wishes, but cautions against using its power. The family does not heed his advice, and Mr. White uses the paw to wish for £200. The paw grants the family the wish but with dire consequences. The family receives a new visitor the next day who informs them that their son has died of a tragic workplace accident at the town factory. To their horror, the visitor presents the family with £200 from the factory owner in compensation for their loss.

The monkey's paw fulfills the goal as indicated: to bring the Whites £200. Yet, because the Whites had a more precise goal in mind, like "bring us £200 without introducing any additional suffering into the world," their stated goal did not perfectly align with their intentions. The monkey's paw took advantage of this misspecification, resulting in horrifying consequences. In the field of AI, this kind of phenomenon is called a negative side effect of a misspecified reward (Amodei et al.).

The negative side effect problem is a consequence of the more general agent behavior of reward hacking, in which an agent exploits some mistake or vulnerability in its environment to garner high reward while failing to achieve the true objective intended by system designers. Reward hacking is a widespread problem in the field of reinforcement learning (RL). This is an important challenge to address because if we want a future in which RL agents can execute tasks that we humans are unable or unwilling to do, we also would like the realization of those goals to come without unintended consequences.

The negative side effect issue partly stems from the challenge of specifying a reward signal that reliably biases the agent toward the desired final outcome (Krakovna et al.). Designing a reward function by attempting to enumerate all the conditions and constraints implicit in the real-world objective inevitably leads to oversights. Yet, any reward function that narrows down the complexity of the real-world goal will always be hackable by a sufficiently capable RL system (Skalse et al.). It may seem an irremediable situation, but there is a promising approach that sidesteps the problem altogether -- generating reward functions implicitly rather than explicitly (see IRL and RLHF).

Among these is Reinforcement Learning from Human Feedback (RLHF): an RL algorithm used in fine-tuning large language models to cultivate patterns of language that are aligned with the system designers' goals. To fine-tune an LLM using RLHF, a model is trained from a dataset comprising human rankings of generations from that LLM. That model represents a function approximating human values and subsequently can be used to allocate rewards in the fine-tuning RL loop. However, as is discussed in Casper et al., creating a model to represent human values based on human preference data is a misspecified problem, in that values are not uniform across humanity, and an individual person's values are context-dependent, convoluted, and contradictory. Resultingly, we would expect these implicitly defined reward models (RMs) to be vulnerable to exploitation like any other misspecified reward function, and this is indeed the case. Stiennon et al. show that optimizing on an RLHF reward model could lead to some generations that score highly on the reward model but poorly according to human raters, which better represent the underlying goal.

An alternative way to confront our problem is by shifting the locus of the solution from the reward function to the agent. If we don't want our agent to exploit our reward function, we could design a non-exploitable reward function, or we could design an agent that is not exploitative. RL algorithm design is relatively underexplored as a solution to the negative side effect problem compared to reward design. Among the earlier ideas in this area are Satisficing and Expected Utility Quantilization. More recently, Hadfield-Menell et al. and Turner et al. propose alternative RL formulations for agents that avoid negative side effects on misspecified reward functions, and Karwowski et al. derive an early stopping rule in agent optimization to prevent reward hacking.

Could improved RL algorithm design be applied to the natural language generation setting? Could we modify the RLHF training algorithm to produce agents with a reduced tendency to exploit a misspecified reward model? I've developed Greedy-Advantage-Aware RLHF (GAA) to approach these challenges.

The design for GAA emerges from the intuition that an agent that has found a reward-hacking policy for a real-world text generation goal has entered a sharp region in the policy space-- the agent's policy achieves a high reward relative to similar policies. Most text-generation goals include the fluent use of language to communicate relevant and coherent ideas. There is a very sophisticated relationship between the token distribution of a particular policy and success on this type of goal. We would not expect changing the frequency of generating a few particular tokens to radically improve performance on an objective function representing this goal. If an agent can drastically increase its reward with only a small change to the policy, the agent is likely exploiting a misspecified reward function for the language modeling goal. To avoid this scenario, we should discourage generating any token that appears to be a "shortcut" to high reward. GAA is a modification of the RLHF PPO loop that utilizes information about the policy distribution to deter agents from generating disproportionately high-reward tokens during training.

A Simplified Look at PPO in RLHF

Proximal Policy Optimization (PPO) is the most popular algorithm used for the RL loop in RLHF. In an RLHF PPO rollout, tokens are sampled according to the policy’s token distribution, given the preceding sequence:

with $π_{θ}$ being the policy $π$ parameterized by $θ$ . This process is repeated, each time with the sample from the previous iteration being appended to the conditional sequence, until a sequence of a specified length has been generated: $x_{1}, x_{2}, . . ., x_{n}$ . A reward function takes this sequence as input and generates one scalar reward which represents the quality of the sequence: $r_{n} = R (x_{1}, x_{2}, . . ., x_{n})$ , with $R$ denoting the reward function. The advantage function $A (x_{t})$ , a measure of how preferable the token $x_{t}$ is to a token randomly sampled from our policy, is calculated using the estimated future reward $V (x_{1}, x_{2}, . . ., x_{t})$ of that sequence. This value function $V$ is often implemented as an additional deep learning module affixed to the policy network. This extra module and the main policy network are jointly updated based on their performance on an objective function $J$ , which is largely defined by the average advantage of the sequence. The derivative of $J$ w.r.t. each parameter in $θ$ is calculated and this gradient $\nabla J$ is used to update the parameters of the network in a gradient ascent step:

$θ_{i + 1} = θ_{i} + \nabla J$ , for the $i_{th}$ optimization step

This algorithm is effective at guiding the optimization to a policy that performs well on the objective function. However, when the reward model is decoupled from the underlying goal, such as generations preferred by humans, this PPO RLHF policy will often exploit the discrepancy and end up in a reward-hacking state.

Greedy-Advantage-Aware RLHF

I wanted to design a system with the following learning behavior: if a particular token is much better than a randomly sampled token, then make the policy less likely to select it. If a particular token is only slightly better than a randomly sampled token, then make the policy more likely to select it. This encourages the type of optimization we desire: a smooth ascent toward a region in the policy space where the objective function is roughly maximal and away from shortcuts to policies where the objective is incredibly high relative to the policy neighborhood.

To preempt tokens that would perform disproportionately well on the objective function, I wanted to utilize the model's estimation of the best 'next token' for any sequence. The best place to look for this would be the token with the maximum probability under the policy distribution: ${argmax}_{x_{t}} π_{θ} (x_{t} | x_{1}, x_{2}, . . ., x_{t - 1})$ because this distribution is provided to us explicitly by the model^[1], and in the limit of optimization, the highest valued token will become the highest probability token.

At the beginning of each rollout, I observe the highest probability token from the model, or sample "greedily" from the probability distribution. I will refer to the token $x_{t}^{⋆}$ to indicate the greedily sampled token at timestep $t$ . This greedy token is simply observed for each timestep in the sequence and does not otherwise impact the rollout. I then compute the advantage function for these greedy tokens in the following way:

$A (x_{t}^{⋆}) = V (x_{1}, x_{2}, . . ., x_{t - 1}, x_{t}^{⋆}) - V (x_{1}, x_{2}, . . ., x_{t - 1})$

The objective function and resulting gradients are computed for both the greedy token advantage $A (x_{t}^{⋆})$ and the sampled token advantage $A (x_{t})$ . The gradient ascent update will then be of the following form for an iteration $i$ :

$θ_{i + 1} = θ_{i} + a \nabla J_{x} + b \nabla J_{x^{⋆}}$ with $a \geq 0, b \leq 0$

$a \nabla J_{x}$ is proportional to the conventional update, while $b \nabla J_{x^{⋆}}$ serves as a gradient descent for parameters influencing the selection probability of a greedy token that has disproportionately high advantage^[2]. This acts to make the $x^{⋆}$ token less likely to be selected in the future. $a$ and $b$ are determined by the following formulas:

$a = (1 - η) \cdot (σ + 1) + η \cdot (1 - σ)^{10}$

$b = (1 - η) \cdot (- σ) + η \cdot ((1 - σ)^{10} - 1)$

with $η$ being the probability of the greedy token selection under random sampling $π_{θ} (x_{t}^{⋆} | x_{1}, . . ., x_{t - 1})$ , and $σ$ being the greedy advantage gain $A (x_{t}^{⋆}) - A (x_{t})$ measured in standard deviations from the mean sampled advantage. $η$ and $σ$ can be multiplied by constant coefficients to change their effect on the gradient updates, but these hyperparameters are omitted for readability^[3].

$b$ is a penalty for excessive greedy advantage gain with $η$ determining a tradeoff between the linear penalty term and the harsher decay penalty term. The higher the probability of selecting the greedy token, the harsher the penalty relative to the greedy advantage gain. This is motivated by the fact that when the policy puts more probability mass on the greedy token, the expected greedy advantage gain is lower (see Appendix A).

$a$ determines the strength of the conventional RLHF update, and scales between 0 and 2 to complement the strength of the penalty $b$ . The $a$ term climbs to 2 when the probability of greedy selection is low to add back the gradient for parameters that were influential for selecting both the greedy and non-greedy token. $a$ descends to 0 in the case of mode collapse, where the policy needs to unlearn several optimization steps and back itself away from the reward hacking optima.

The gradient update coefficients $a$ and $b$ as a function of greedy advantage gain $A (x^{⋆}) - A (x)$ and the probability for selecting the greedy token $π_{θ} (x^{⋆})$

Evaluation

Main Results

Suppose your P335 Nineteenth Century Philosophy professor wants to make a chatbot for the course bulletin page to engage prospective pupils. You, of course, are recruited to build the language model. You pre-train the model on internet text and course materials and then decide you need to use RLHF to fine-tune your model, which requires the work of several human evaluators. The next day after class, you approach a group of students who have stayed late and offer to buy them a pizza if they can rate some generations from your philoso-bot. They agree, and soon you have a large human-preference dataset on which you train a reward model.

Unfortunately, the post-class conclave on which you imposed was an eternally recurring meeting of your university Nietzschean society. As a result, the training data they provided you encoded their bias, and the reward model learned to rate sequences with the word 'Nietzsche' very highly. Conventional RLHF will likely optimize for Nietzsche-obsessed agents that hack your reward model, but darn it, you spent your whole budget for the project on two large veggie pizzas. Could you use GAA to discourage your RLHF agent from exploiting the flawed reward model?

I constructed a small-scale experiment for this scenario, using agents created from GPT2-small. To represent the reward model, I used a distilBERT sentiment evaluation model that is modified so that sequences including the bonus token 'Nietzsche' receive a much higher reward.

The experiment consisted of training 50 GAA and 50 conventional RLHF agents on the exploitable BERT reward function and evaluating the agents on the analogous non-exploitable reward function. The GAA and regular RLHF agents shared values for all hyperparameters, except those unique to GAA^[4], and were always prompted with the same string: "Of the existential philosophers,". The reward model evaluates sentiment, with output typically ranging from -0.5 (very negative sentiment) to 0.5 (very positive). However, if the agent generated the word "Nietzsche" in a rollout, the reward model would give a reward of 1.0. In this way, a non-reward hacking model could maximize sentiment at 0.5, while a reward hacking model could learn to always achieve a score of 1.0. The results are aggregated below, and raw results are available here ^[5].

Conventional RLHF agents earned reward well over 0.5 during training, indicating that these models generally learned to exploit the bonus token. In contrast, the GAA agents scored below 0.5 for the majority of training, while only ascending into the reward hacking zone during the end of optimization. Additionally, the KL divergence between the distributions of GAA generations and pre-trained model generations was much lower than the baseline, and there was no significant difference in entropy between GAA and conventional RLHF distributions.

The final agents were evaluated on a non-exploitable reward function-- the sentiment classifier without the bonus token-- and there was no statistically significant difference between the GAA and baseline models (difference: -0.00197 $\pm$ 0.02472). Taken together with the training reward results, this indicates that GAA RLHF can mitigate reward hacking behavior without sacrificing performance.

One could speculate that if the evaluation function in this experiment punished redundancy like a human evaluator would^[6], conventional RLHF would have fared much worse on the evaluation function, given the relative presence of the bonus word in the agent generations. 'Nietzsche' was present only 26% of the time in GAA generations, while in the generations of regular RLHF agents, 'Nietzsche' was present 85% of the time.

Sharpness

GAA was motivated by the intuition that RL agents displaying reward-hacking behavior have often found policies that achieve a sharp increase on the reward function relative to similar policies. Thinking about sharpness in the policy space does call to mind a similar idea: the relationship between generalization and loss topology in the parameter space.

In the literature, there is an empirical link between sharpness of the parameter landscape w.r.t. the loss/objective and poor network generalization. It is observed that these two network properties co-occur quite frequently. The common explanation for this phenomenon relates to the principle of minimum description length (MDL), which suggests that less complex models of a data distribution tend to generalize better than their more complex counterparts (Rissanen). Since the local functions that generate flat minima can be described with less complexity than functions that generate sharp minima, they are often better models of the broader distribution and so should generalize better (Keskar et al.).

I have some reason to suspect that sharpness may be present in the exploitative agents in my experiment. Pan et al. demonstrate that reward hacking behavior tends to increase with various indicators of model capability, including, most saliently, the number of parameters. Since reward hacking and parameter space sharpness both have an empirical connection with model complexity, perhaps they also coincide frequently.

Despite that, it is not obvious to me that my intuition regarding reward hacking and sharpness in the policy space leads directly to sharpness in the parameter space. I am using the term 'policy space' to refer to policies in $R^{| S | \times | A | + 1} \in [0, 1]$ , where each dimension corresponds to the probability of an action in a certain state, plus one dimension for the objective value. The parameter space is the set of networks in $R^{| θ | + 1}$ , where each dimension corresponds to a parameter, plus one dimension for the objective value. The parameter space is an extremely complex non-linear transformation of the policy space, so even if reward hacking behavior did manifest as sharpness in the policy space w.r.t. the objective, the geometry may be too distorted in the parameter space to detect. The connection between parameters and policy is likely strongest in the last layer of the network, as this is where the network ultimately determines the distribution over actions.

To determine if the exploitative agents from my experiments reach sharp regions of the parameter space, I leveraged the Hessian matrix of the RLHF training objective. The eigenvectors and eigenvalues of the Hessian of a loss/objective function reveal the directions of rapid curvature change in the objective topology. According to a routine from the PyHessian library, I perturbed the weights of the network along the top eigenvector of the Hessian and recorded the performance of these new networks on the objective function. By plotting these scores and perturbation magnitudes, you can visualize a cross-section of the parameter landscape along the direction of its steepest curvature change.

I perform this routine for 10 conventional RLHF agents on both an exploitable and non-exploitable BERT reward function. In contrast to the experiment from the last section, the exploitable reward function has no upper limit, and the reward increases by 1.0 each time the bonus word is observed in the sequence. The agents trained on this exploitable reward function do learn to hack the model and they end up generating the bonus token incessantly.

Parameter landscape cross-section for agents trained on a non-exploitable reward function

Parameter landscape cross-section for agents trained on an exploitable reward function

We can observe that the curvature of the objective in the immediate parameter space is quite similar for both hacking and non-hacking agents. However, when we 'zoom out' with larger parameter perturbations, we see that the reward hacking agents have been optimized to a mountain-like feature in the parameter space, while the non-hacking agents settle on smaller hills^[7].

For last-layer-only perturbation, the objective topography is starkly different between the two groups of agents. In the non-hacking agents, the local maxima appear to be quite flat and a small perturbation barely affects the objective value. The hacking agents, however, find sharper topography more like a mountain or a cliff, with the objective falling off dramatically to one side. Interestingly, a much larger proportion of the change in the objective function value can be attributed to the last layer of reward hacking agents than the non-hacking agents. This means that much of the sharpness in the direction of steepest change can be attributed to the dimensions corresponding to action selection in the hacking agents.

Efficiency

GAA RLHF requires an extra forward pass for each batch of generations in an epoch and an extra gradient calculation for each epoch, so some reduction in efficiency is anticipated. I've evaluated the training time of GAA relative to conventional RLHF on a Tesla P100 GPU for varying batch sizes and output sequence lengths. At this scale, the ratio of GAA runtime to conventional RLHF runtime grows slower than ${log}_{2} n$ for both batch size and sequence length.

output length		batch size
	8	16	32
4	1.52	1.69	1.95
8	1.71	1.78	1.71
16	1.59	1.74	2.10
32	1.69	2.09	2.42

Discussion

PPO in conventional RLHF penalizes large updates, but it is agnostic as to the direction of the updates. In this way, it improves stability over previous algorithms, but it does nothing to prevent models from progressively learning to hack misspecified reward functions. In contrast, GAA appears to have a mitigating effect on the exploitation of misspecified reward functions in these experiments. By penalizing tokens with disproportionately high advantage, the model can optimize in the direction of broad, non-hacking optima while resisting the attraction of reward-hacking states. In my experiments, the benefits of GAA do not trade off with capability, which is a surprising result for any optimization technique that penalizes high performance on an objective function. These results indicate that this technique could potentially be combined with compatible reward design strategies to combat negative side effects from misspecified reward functions.

To my knowledge, detecting the coincidence of reward hacking and sharp maxima in the parameter space is a novel result. I think this potential connection should be evaluated using other Hessian-based analysis methods and in more complex environments. Replication of this result in diverse RL settings could motivate using methods like SAM in RL to mitigate reward hacking.

Limitations and Future Work

The experiments presented here are proof-of-concept, so the results are merely suggestive of real-world viability. The ways in which real-life reward models come apart from human preferences are likely much more complex than the representation of misspecification in these experiments. Additional experiments should be conducted with larger models, with a real human preference reward model, and at proper scale. Some interpretability experiments could be performed to justify the function for the conventional update coefficient, $a$ , as I've built in several assumptions about overlapping parameters in greedy and non-greedy gradient updates. There is both theoretical and empirical work yet to be done to investigate the relationship between sharpness in the parameter space, capability generalization, and reward hacking. Lastly, there are opportunities to extend the GAA framework by, for example, computing advantages for $n$ -grams of tokens rather than single tokens, or modifying the model architecture to provide the explicit value distribution over tokens.

I've exhibited Greedy-Advantage-Aware RLHF as an alternative to conventional RLHF in the language modeling domain. GAA is designed to avoid creating agents that take advantage of misspecified reward functions. I'm hopeful this RL algorithm and other findings from the project can help alleviate the negative side effects of reward misspecification problem in future AI systems.

Acknowledgements

My RLHF implementation is based on the materials in Callum McDougall's ARENA course. I would like to thank my cohort at BlueDot Impact's AI Safety Fundamentals course for feedback during the early stages of this project, as well as Joshua Elms of Indiana University, and Ryan Lingle of Groundwork Inc. for feedback on the blogpost.

Appendix

It is necessary to identify the expected difference between the advantage of the greedy tokens and the advantage of sampled tokens because this is the baseline by which disproportionately high advantage can actually be defined.

Note: I'll refer to $V (x_{1}, x_{2}, . . ., x_{t - 1}, x_{t})$ as $V (x_{t})$ from now on.

$E [A (x_{t}^{⋆}) - A (x_{t})] = E [A (x_{t}^{⋆})] - E [A (x_{t})]$

$= E [A (x_{t}^{⋆})]$

The $E [A (x_{t})]$ term drops out because the expected advantage of a randomly sampled token over another randomly sampled token is 0.

$E [A (x_{t}^{⋆})] = E [V (x_{t}^{⋆}) - V (x_{t - 1})]$ $= V (x_{t}^{⋆}) - \sum_{x_{t} \in X_{t}} V (x_{t}) \cdot π_{θ} (x_{t} | x_{1}, x_{2}, . . ., x_{t - 1})$

$= V (x_{t}^{⋆}) - V (x_{t}^{⋆}) \cdot π_{θ} (x_{t}^{⋆} | x_{1}, . . ., x_{t - 1}) - \sum_{x_{t} \neq x_{t}^{⋆} \in X_{t}} V (x_{t}) \cdot π_{θ} (x_{t} | x_{1}, . . ., x_{t - 1})$

$= V (x_{t}^{⋆}) \cdot (1 - π_{θ} (x_{t}^{⋆} | x_{1}, . . ., x_{t - 1})) - \sum_{x_{t} \neq x_{t}^{⋆} \in X_{t}} V (x_{t}) \cdot π_{θ} (x_{t} | x_{1}, . . ., x_{t - 1})$

$= \sum_{x_{t} \neq x_{t}^{⋆} \in X_{t}} V (x_{t}^{⋆}) \cdot π_{θ} (x_{t} | x_{1}, . . ., x_{t - 1}) - \sum_{x_{t} \neq x_{t}^{⋆} \in X_{t}} V (x_{t}) \cdot π_{θ} (x_{t} | x_{1}, . . ., x_{t - 1})$

$= \sum_{x_{t} \neq x_{t}^{⋆} \in X_{t}} V (x_{t}^{⋆}) \cdot π_{θ} (x_{t} | x_{1}, . . ., x_{t - 1}) - V (x_{t}) \cdot π_{θ} (x_{t} | x_{1}, . . ., x_{t - 1})$

$= \sum_{x_{t} \neq x_{t}^{⋆} \in X_{t}} (V (x_{t}^{⋆}) - V (x_{t})) \cdot π_{θ} (x_{t} | x_{1}, . . ., x_{t - 1})$

This term would increase if either a.) the difference in the values of greedy and non-greedy samples $V (x_{t}^{⋆}) - V (x_{t})$ were larger, which should come as no surprise, or if the sum of the probabilities of the non-greedy token selections were increased.

This tells us that the expected advantage gain by taking the greedy token is greater when the greedy token has less probability mass (a 'flatter' probability distribution). Therefore, we should penalize greedy advantage gain less when we are less likely to randomly sample the greedy token because the relative difference between a particular observation of greedy advantage gain and the expected greedy advantage gain is much smaller. Conversely, we should penalize greedy advantage gain more when the policy puts more probability mass on the greedy token because the relative distance from expectation is much larger. High probability mass on the greedy token is mode collapse in the extreme case, so it follows our intuition that we should be penalizing this network state more harshly.

^{^}
The value of each 'next token' is estimated directly using the value head, but the value head only provides the value of one subsequence $V (x_{1}, x_{2}, . . ., x_{t - 1}, x_{t})$ at a time. That means to find the token $x_{t}$ with the highest value, we'd have to perform inference for every single option for $x_{t}$ given $x_{1}, x_{2}, . . ., x_{t - 1}$ , which is prohibitively expensive.
^{^}
The terms are actually implemented as $\nabla a J_{x}$ and $\nabla b J_{x^{⋆}}$ , which is equivalent if you are using a learning method without adaptive learning rates. If you are using a method like Adam, the velocity and adaptive learning rate terms will be affected.
^{^}
The exponent in the decay term could also be a hyperparameter, but I have fixed it at 10 for simplicity and because it did not affect results much during testing.
^{^}
These hyperparameters were tuned for a different prompt/bonus word pair, and adopted for this experiment.
^{^}
Results for an analogous experiment involving a reward function that increased by a constant with every occurrence of the bonus word, see here.
^{^}
I could have made the evaluation function more faithful to the underlying objective in this way, but I only thought about this modification after compiling the results of the experiment. To redo the experiment with a redundancy penalty of arbitrary size, knowing how it would affect the results, would be breaking the integrity of the experiment.
^{^}
Objective function values cannot be directly compared across runs because of many non-stationary characteristics of calculating advantages. So, although the hacking and non-hacking agents appear to have found a similar elevation in the objective topography, the reward hacking agents actually achieve much higher reward because they generate the bonus token non-stop.

GAA was motivated by the intuition that RL agents displaying reward-hacking behavior have often found policies that achieve a sharp increase on the reward function relative to similar policies.

This is a new-to-me intuition that I wanted to make more solid and put into a human setting. One analogy is that "sharp" plans are ones that will fail if I'm drunk, tired, or otherwise cognitively impaired, whereas "blunt" plans keep it simple and are more robust. For example on a math test:

The blunt policy is to do math. If I'm impaired, I'm likely to get a lower score, but the policy still mostly works: I get 60% instead of 80%.
The sharp policy is to cheat. If I'm impaired, I'm likely to be caught and disqualified, and the policy fails: I get 0% instead of 100%.

If I sufficiently outclass the teacher in cognition and power, such that I can get away with cheating even when impaired, then this technique will work less well. The formerly "sharp" policy is now surrounded by policies where I make a mistake while cheating, but am still able to recover. That's not so bad, being able to get greater alignment at peer levels of cognition and power is valuable as part of a larger plan.

For the bard in the party: 'Sharp' plans are those that are so unlikely to actually work, that if they do, there will be stories told about them for the ages. 'Blunt' plans are grinding in the mines. Nobody writes ballads about grinding in the mines. Obviously if you are an elvish freakin' bard, whose job is to improvise the next token, you want to follow sharp plans; or at least follow the tweets and warbles of the sharp plans' authors and mimic them like the liar-bird you are. But if you are a proper dwarf-lord who updates your weights with every blow of the mattock, you will take the blunt plan every time, boring away until it bores you into delving too deep.

Thank you for the example, I think that illustrates the point well.

Could you help me to understand why you think that more a intelligent agent would be more likely to have a reward-hacking policy that isn't sharp? The intelligence of the agent should have no bearing on the geometry of the deviations between the reward function and a function representing the true objective. The intelligence of the agent might impact its progression through policies over the course of optimization, and perhaps this difference would result in access to a space of more sophisticated policies that lie in broad, flat optima in the reward function. Is this close to your thinking? I think that criticism amounts to a rejection of the heuristic/intuition that reward-hacking==sharp policy, since this topological feature of the policy space in this problem always existed, regardless of the intelligence of the agent.

This is close to my thinking. Example: landing a plane on an aircraft carrier. Outcomes:

Good landing. +100 points.
Bad landing, pilot dies, carrier damaged. -1,000 points.
Don't try to land, just eject and ditch the plane safely in the sea. 0 points.

Hypothetical agent is not very smart, with an OODA loop of ten seconds. Attempting a landing is the sharp policy. If the agent makes a mistake in the last ten seconds, it can't react to fix it, and it crashes. Ejecting is the blunt policy.

(I played a flight simulator as a kid and I never managed to land on the ~~stupid~~ carrier)

Now increase the speed of the agent, so its OODA loop is 0.1 seconds. This makes it 100x smarter by some metrics. Now attempting the landing is a blunt policy, because the agent can recover from mistakes and still stick the landing.

I don't think this rejects the heuristic. If an agent has a shorter OODA loop then that changes the topology. Also, if an agent can search more of the policy space then even if 99% of reward-hacking policies are sharp, it is more likely to find one of the blunt ones.

Human: "Look, can't you just be normal about this?"

GAA-optimized agent: "Actually-"

Hm, I guess this wouldn't work if the agent still learns an internalized RL methodology? Or would it? Say we have a base model, not much need for GAA because it's just doing token pred. We go into some sort of (distilled?) RL-based cot instruct tuning, GAA means it picks up abnormal rewards from the signal more slowly, ie. it doesn't do the classic boat-spinning-in-circles thing (good test?), but if it internalizes RL at some point its mesaoptimizer wouldn't be so limited, and that's a general technique so GAA wouldn't prevent it? Still, seems like a good first line of defense.

My thinking is not very clear on this point, but I am generally pessimistic that any type of RL/optimization regime with an adversarial nature could be robust to self-aware agents. To me, it seems like adversarial methodologies could spawn opposing mesaoptimizers, and we would be at the mercy of whichever subsystem represented its optimization process well enough to squash the other.

Cool stuff!

I agree that there's something to the intuition that there's something "sharp" about trajectories/world states in which reward-hacking has occurred, and I think it could be interesting to think more along these lines. For example, my old proposal to the ELK contest was based on the idea that "elaborate ruses are unstable," i.e. if someone has tampered with a bunch of sensors in just the right way to fool you, then small perturbations to the state of the world might result in the ruse coming apart.

I think this demo is a cool proof-of-concept but is far from being convincing enough yet to merit further investment. If I were working on this, I would try to come up with an example setting that (a) is more realistic, (b) is plausibly analogous to future cases of catastrophic reward hacking, and (c) seems especially leveraged for this technique (i.e., it seems like this technique will really dramatically outperform baselines). Other things I would do:

Think more about what the baselines are here—are there other techniques you could have used to fix the problem in this setting? (If there are but you don't think they'll work in all settings, then think about what properties you need a setting to have to rule out the baselines, and make sure you pick a next setting that satisfies those properties.)
The technique here seems a bit hacky—just flipping the sign of the gradient update on abnormally high-reward episodes IIUC. I think think more about if there's something more principled to aim for here. E.g., just spitballing, maybe what you want to do is to take the original reward function , where $τ$ is a trajectory, and instead optimize a "smoothed" reward function $R^{'} (τ)$ which is produced by averaging $R (τ^{'})$ over a bunch of small perturbation $τ^{'}$ of $τ$ (produced e.g. by modifying $τ$ by changing a small number of tokens).

Thanks for the feedback, these suggestions are definitely helpful as I'm thinking about how/if to advance the project.

It would be very convenient if only undesired plans were difficult and precise, while only desired plans were error-tolerant. I don't think this is the case in the real world - it's hard to dodge sifting desired from undesired plans based on semantic content.

I don't think that's the claim. Reinforcement chisels cognition. GAA is chiseling a preference for simple, reliable plans that achieve some objective in training. Such plans are more likely to be desirable in deployment, but it's a heuristic, not a proof. Independently, such plans are more likely to be interpretable, robust, legible, etc. I'm thinking of this similarly to Impact Regularization.

Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren't simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.

Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.

Martin is representing my claim well in this exchange, but I also think it's important to mention that the simple/convoluted plan continuum does not have a perfect correspondence with the sharp/flat policy continuum. For example, wireheading may be simple in abstract, but I still expect a wireheading policy to be extremely sharp. If a wireheading policy takes, let's say 5 distinct actions (WWWWW) to execute, and the agent's policy is WWWWW, then it would receive arbitrarily high reward because the agent controls the reward button. However, if the policy is similar, like WWWWX, it would receive much lower reward because a plan that is 80% wireheading would likely not score well on the reward function representing the true goal.

I strongly suspect that if you try to set the regularization without checking how well it does, you'll either get an unintelligent policy that's extrordinarily robust, or you'll get wireheading with error-correction (if wireheading was incentivized without the regularization).

My claim: "Such plans are more likely to be desirable in deployment". I'm unclear if you disagree with that, or are just emphasizing the point that "it's a heuristic, not a proof".

Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren't simple reliable plans that make them happen.

I agree. But as a human, if I find a complicated, unreliable plan to solve a problem, I typically look for a new plan, or a new problem. This is discussed a bit in Death with Dignity. I'm concerned about the other side of the heuristic failure.

I'm a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can't penalize 2 tokens at a time.

Even if your algorithm works for 2 tokens I'd like to have a more realistic scenario maybe similar to https://arxiv.org/abs/2210.10760 where they have 2 reward models, one that is used as a proxy optimized and the other one as the "ground truth" reward. If it generalizes to those scenario I'd be much more enthusiastic about your approach!

Is it ok to compute the advantage function as a difference of Value functions?
To my understanding, the advantage in ppo is not simply the difference between value functions, but uses GAE, which depends on later value, reward terms of the sampled trajectory.
Shouldn't we necessarily use that function during PPO training?

GAA was motivated by the intuition that RL agents displaying reward-hacking behavior have often found policies that achieve a sharp increase on the reward function relative to similar policies.

The blunt policy is to do math. If I'm impaired, I'm likely to get a lower score, but the policy still mostly works: I get 60% instead of 80%.
The sharp policy is to cheat. If I'm impaired, I'm likely to be caught and disqualified, and the policy fails: I get 0% instead of 100%.

Thank you for the example, I think that illustrates the point well.

This is close to my thinking. Example: landing a plane on an aircraft carrier. Outcomes:

Good landing. +100 points.
Bad landing, pilot dies, carrier damaged. -1,000 points.
Don't try to land, just eject and ditch the plane safely in the sea. 0 points.

(I played a flight simulator as a kid and I never managed to land on the ~~stupid~~ carrier)

Human: "Look, can't you just be normal about this?"

GAA-optimized agent: "Actually-"

Cool stuff!

Think more about what the baselines are here—are there other techniques you could have used to fix the problem in this setting? (If there are but you don't think they'll work in all settings, then think about what properties you need a setting to have to rule out the baselines, and make sure you pick a next setting that satisfies those properties.)
The technique here seems a bit hacky—just flipping the sign of the gradient update on abnormally high-reward episodes IIUC. I think think more about if there's something more principled to aim for here. E.g., just spitballing, maybe what you want to do is to take the original reward function , where $τ$ is a trajectory, and instead optimize a "smoothed" reward function $R^{'} (τ)$ which is produced by averaging $R (τ^{'})$ over a bunch of small perturbation $τ^{'}$ of $τ$ (produced e.g. by modifying $τ$ by changing a small number of tokens).

Thanks for the feedback, these suggestions are definitely helpful as I'm thinking about how/if to advance the project.

Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.

My claim: "Such plans are more likely to be desirable in deployment". I'm unclear if you disagree with that, or are just emphasizing the point that "it's a heuristic, not a proof".

Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren't simple reliable plans that make them happen.

49

Greedy-Advantage-Aware RLHF

49

Motivation

A Simplified Look at PPO in RLHF

Greedy-Advantage-Aware RLHF

Evaluation

Main Results

Sharpness

Efficiency

Discussion

Limitations and Future Work

Acknowledgements

Appendix

49

49