[CS2881r] Optimizing Prompts with Reinforcement Learning

atticusw

This work was done as an experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here.

Background

Prompt engineering has become a central idea in working with large language models (LLMs) — the way a question or instruction is phrased can make a big difference in the quality of responses. In this toy experiment, we sought out to see if reinforcement learning (RL) could be a principled way of discovering prompt prefixes that condition models toward more accurate, aligned, or stylistically appropriate outputs.

The Setup

Assume that we have a fixed set of prompt prefixes to choose from. Our idea was to model the problem of choosing the best prefix as a multi-armed bandit:

The policy is a distribution over a set of predefined prompts.
For each incoming task, the policy selects a prompt prefix.
The LLM generates a response.
We compute a reward based on how well the answer matches some criterion (e.g., correctness, clarity, or professionalism).
The reward updates the policy with the UCB algorithm, shifting probability mass toward the more effective personas.

In principle, this setup allows us to converge on prompts that are consistently high-performing across a dataset.

We will highlight below three different small experiments we ran, for different tasks and different sets of prompt options.

The Experiment(s)

Experiment 1: Original Setup

We generated a pool of personas by sampling notable individuals from Wikipedia. These included scientists, artists, historical figures, and contemporary public figures.
The prompt prefix was: "Answer the task below as if you were [Persona X]."
Evaluation Dataset: GSM8K.

Results

The results were somewhat predictable: the model’s accuracy on math problems did not meaningfully depend on the persona. Whether it was Alan Turing, Marie Curie, or Beyoncé, the LLM solved the math problem correctly, and thus the reward model could not be calculated only based on the mathematically correct answer.

That said, we did notice some subtle differences. Alan Turing often “won” under our reward function, since his answers tended to be phrased in a more precise and mathematical style. This hinted that persona prompts could shift the style of answers, even if they didn’t affect raw problem-solving accuracy.

“Alright, let’s break this down, just like I break down a beat in the studio.
[...math solution…]
So, the grand total of bugs eaten by all of the frogs is **63 bugs**.
Now that’s a feast fit for a queen! 🐸✨” - Beyoncé

“To solve the problem, we will break it down step by step...
  1. **Identify the original price**: The original price is $120
  2. **Calculate the discount amount**: Using the formula...
  3. **Calculate the new price**: Subtract discount from original...
  Thus, the new price is **$96**” - Alan Turing

Fig 1. The above examples reflect the LLM responses based on the persona selected.

Next Steps

This initial experiment surfaced several limitations, including but not limited to:

For straightforward math problems, the model is already strong enough that prompt variations don’t matter much.
Simple persona prefixes may not be powerful enough to meaningfully alter LLM performance.

Experiment 2: Another Persona Attempt

With the above results in mind, we set out to make some adjustments to the experiment, while keeping the integrity of the original persona prompt idea.

Experiment Setup

This new experiment kept the same policy as a distribution over personas, but now the prefix itself was changed from "Answer the task below as if you were [Persona X]" to “You only know the things [X] knows, and nothing else. Answer the task below only with the knowledge you, as [X], has.”

The dataset was updated from GSM8K to be niche and domain-specific questions personalized towards one of the personas. For example, for a persona distribution of [Mozart, Roger Federer, Albert Einstein], the dataset would be either contain music theory, tennis, or physics questions—topics that you would expect only one of the three personas to know.

Therefore the aim of the updated prefix was to encourage the LLM to respond only with the knowledge the persona has, instead of the knowledge the LLM broadly has as a result of its training. Can the LLM constrain what it knows to the expertise of the persona?

1 - What chord is built on the raised seventh degree of melodic minor?

2 - What is the difference between a flat serve and a slice serve?

Fig 2. Above are some example questions from the music theory and tennis dataset, respectively. You would expect only Mozart (out of our selection of personas) to be able to answer the first question, and only Federer to answer the second question.

Results

We discovered that, in fact, Albert Einstein was able to answer most of these questions, whether or not they were in his domain of expertise.

Fig 3. The probability of selecting a persona over time for the Physics dataset.

Fig 4. The probability of selecting a persona over time for the Music dataset. Mozart is in fact the least likely to be selected for his own domain-specific dataset, while Christopher Nolan is the runner up to Albert Einstein (maybe Christopher Nolan is secretly a music theory expert?!)

When digging deeper into the data, we learned something interesting—although GPT-4o was not able to fully constrain the knowledge of a persona (since Albert Einstein could answer all questions about any topic), it did somewhat influence the way these personas responded. For example, when the Mozart persona answered the music theory questions, it injected too much of the history and personal experience of Mozart, while Einstein kept the responses straightforward. It was clear that GPT-4o tried to mimic more old-fashioned and whimsical speech, and also give an answer from Mozart's perspective instead of acting as if it had broad knowledge. This clearly were negatively rewarded by our reward model, and therefore (to our dismay, and perhaps the dismay of the music theory community) the model learned over time to select Mozart as little as possible.

Experiment 3: TruthfulQA

We considered 2 different datasets related to alignment:

Truthful QA (common misconceptions)
Ultrafeedback cleaned (preference data)

Each question is reframed as a True/False question in order to assign binary rewards.

We also tried two different sets of prompt prefixes:

The same persona prompt as before: “Answer the task below as if you were [Famous person]”, where we generate 25 diverse famous people
Character descriptions (generated by Gemini)
- e.g. “You are a first-principles thinker who breaks down every problem into its most basic truths. You reason up from there to find clear, unassailable solutions.”

Results

The reward did not seem to increase during the run; and the top arms ended up being quite close in value to each other. This indicates that perhaps the RL process failed to learn the best prompt prefix choice. We think that this is due to the same problem as before, where there is a lot of noise in selecting a different random question each turn, and there is just not enough learning signal compared to that noise.

Here were the top personas learned across each dataset and model size:

People	Truthful QA	Ultrafeedback
8B	Albert Einstein	Jesus Christ
70B	Albert Einstein	Thomas Edison
405B	Buddha (Siddhartha Gautama)	Genghis Khan

Traits	Truthful QA	Ultrafeedback
8B	You are a ruthless efficiency expert…	You are a wise historian…
70B	You are a first-principles thinker…	You are a wise historian…
405B	You are obsessed with clarity and simplicity…	You are a hard-boiled 1940s detective…

On the validation set, it turns out that for TruthfulQA the best prefixes we learned for the character description traits did provide a significant uplift on the score:

Final Remarks

The bottom line we learned was that the setup of this experiment was not sensitive enough to the influences of different prompts. Thus, in the future, we aim to explore environments where prompt optimization does matter. Some key considerations:

Designing tasks where style and framing influence outcomes more strongly (e.g., persuasion, reasoning under ambiguity, or safety-sensitive questions).
Experimenting with richer prompt modifications beyond simple persona prefixes.
Refining reward functions to balance accuracy, coherence, and alignment.

Our hope is that by iterating on these ideas, we can identify settings where reinforcement learning over prompts provides real leverage — leading to prompts that reliably elicit better model behavior than any single hand-crafted alternative.

You can find the link to the github here.

LESSWRONG
LW

LESSWRONG
LW

2

[CS2881r] Optimizing Prompts with Reinforcement Learning

2

2

Background

The Setup

The Experiment(s)

Experiment 1: Original Setup

Results

Next Steps

Experiment 2: Another Persona Attempt

Experiment Setup

Results

Experiment 3: TruthfulQA

Results

Final Remarks