This work was done as an experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here.
Background
Prompt engineering has become a central idea in working with large language models (LLMs) — the way a question or instruction is phrased can make a big difference in the quality of responses. In this toy experiment, we sought out to see if reinforcement learning (RL) could be a principled way of discovering prompt prefixes that condition models toward more accurate, aligned, or stylistically appropriate outputs.
The Setup
Assume that we have a fixed set of prompt prefixes... (read 1366 more words →)