This work was done as an experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here.
Prompt engineering has become a central idea in working with large language models (LLMs) — the way a question or instruction is phrased can make a big difference in the quality of responses. In this toy experiment, we sought out to see if reinforcement learning (RL) could be a principled way of discovering prompt prefixes that condition models toward more accurate, aligned, or stylistically appropriate outputs.
Assume that we have a fixed set of prompt prefixes to choose from. Our idea was to model the problem of choosing the best prefix as a multi-armed bandit:
In principle, this setup allows us to converge on prompts that are consistently high-performing across a dataset.
We will highlight below three different small experiments we ran, for different tasks and different sets of prompt options.
The results were somewhat predictable: the model’s accuracy on math problems did not meaningfully depend on the persona. Whether it was Alan Turing, Marie Curie, or Beyoncé, the LLM solved the math problem correctly, and thus the reward model could not be calculated only based on the mathematically correct answer.
That said, we did notice some subtle differences. Alan Turing often “won” under our reward function, since his answers tended to be phrased in a more precise and mathematical style. This hinted that persona prompts could shift the style of answers, even if they didn’t affect raw problem-solving accuracy.
“Alright, let’s break this down, just like I break down a beat in the studio.
[...math solution…]
So, the grand total of bugs eaten by all of the frogs is **63 bugs**.
Now that’s a feast fit for a queen! 🐸✨” - Beyoncé
“To solve the problem, we will break it down step by step...
1. **Identify the original price**: The original price is $120
2. **Calculate the discount amount**: Using the formula...
3. **Calculate the new price**: Subtract discount from original...
Thus, the new price is **$96**” - Alan Turing
Fig 1. The above examples reflect the LLM responses based on the persona selected.
This initial experiment surfaced several limitations, including but not limited to:
With the above results in mind, we set out to make some adjustments to the experiment, while keeping the integrity of the original persona prompt idea.
This new experiment kept the same policy as a distribution over personas, but now the prefix itself was changed from "Answer the task below as if you were [Persona X]" to “You only know the things [X] knows, and nothing else. Answer the task below only with the knowledge you, as [X], has.”
The dataset was updated from GSM8K to be niche and domain-specific questions personalized towards one of the personas. For example, for a persona distribution of [Mozart, Roger Federer, Albert Einstein], the dataset would be either contain music theory, tennis, or physics questions—topics that you would expect only one of the three personas to know.
Therefore the aim of the updated prefix was to encourage the LLM to respond only with the knowledge the persona has, instead of the knowledge the LLM broadly has as a result of its training. Can the LLM constrain what it knows to the expertise of the persona?
1 - What chord is built on the raised seventh degree of melodic minor?
2 - What is the difference between a flat serve and a slice serve?
Fig 2. Above are some example questions from the music theory and tennis dataset, respectively. You would expect only Mozart (out of our selection of personas) to be able to answer the first question, and only Federer to answer the second question.
We discovered that, in fact, Albert Einstein was able to answer most of these questions, whether or not they were in his domain of expertise.
Fig 3. The probability of selecting a persona over time for the Physics dataset.
Fig 4. The probability of selecting a persona over time for the Music dataset. Mozart is in fact the least likely to be selected for his own domain-specific dataset, while Christopher Nolan is the runner up to Albert Einstein (maybe Christopher Nolan is secretly a music theory expert?!)
When digging deeper into the data, we learned something interesting—although GPT-4o was not able to fully constrain the knowledge of a persona (since Albert Einstein could answer all questions about any topic), it did somewhat influence the way these personas responded. For example, when the Mozart persona answered the music theory questions, it injected too much of the history and personal experience of Mozart, while Einstein kept the responses straightforward. It was clear that GPT-4o tried to mimic more old-fashioned and whimsical speech, and also give an answer from Mozart's perspective instead of acting as if it had broad knowledge. This clearly were negatively rewarded by our reward model, and therefore (to our dismay, and perhaps the dismay of the music theory community) the model learned over time to select Mozart as little as possible.
We considered 2 different datasets related to alignment:
Each question is reframed as a True/False question in order to assign binary rewards.
We also tried two different sets of prompt prefixes:
The reward did not seem to increase during the run; and the top arms ended up being quite close in value to each other. This indicates that perhaps the RL process failed to learn the best prompt prefix choice. We think that this is due to the same problem as before, where there is a lot of noise in selecting a different random question each turn, and there is just not enough learning signal compared to that noise.
Here were the top personas learned across each dataset and model size:
People | Truthful QA | Ultrafeedback |
8B | Albert Einstein | Jesus Christ |
70B | Albert Einstein | Thomas Edison |
405B | Buddha (Siddhartha Gautama) | Genghis Khan |
Traits | Truthful QA | Ultrafeedback |
8B | You are a ruthless efficiency expert… | You are a wise historian… |
70B | You are a first-principles thinker… | You are a wise historian… |
405B | You are obsessed with clarity and simplicity… | You are a hard-boiled 1940s detective… |
On the validation set, it turns out that for TruthfulQA the best prefixes we learned for the character description traits did provide a significant uplift on the score:
The bottom line we learned was that the setup of this experiment was not sensitive enough to the influences of different prompts. Thus, in the future, we aim to explore environments where prompt optimization does matter. Some key considerations:
Our hope is that by iterating on these ideas, we can identify settings where reinforcement learning over prompts provides real leverage — leading to prompts that reliably elicit better model behavior than any single hand-crafted alternative.
You can find the link to the github here.