This work was done as an experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here.
Prompt engineering has become a central idea in working with large language models (LLMs) — the way a question or instruction is phrased can make a big difference in the quality of responses. In this toy experiment, we sought out to see if reinforcement learning (RL) could be a principled way of discovering prompt prefixes that condition models toward more accurate, aligned, or stylistically appropriate outputs.
Assume that we have a fixed set of prompt prefixes to choose from. Our idea was to model the problem of choosing the best prefix...
Authors: Jay Chooi, Natalia Siwek, Atticus Wang
Lecture slides: link
Lecture video: link
Student experiment slides: link
Student experiment blogpost: Some Generalizations of Emergent Misalignment
This is the first of a series of blog posts on Boaz’s AI Safety class. Each week, a group of students will post a blog on the material covered that week.
Hi! We are senior and junior at Harvard (Jay, Natalia) and senior at MIT (Atticus). Here’s a bit about why we are taking Boaz’s CS 2881r (AI Safety):
Jay: I’m a CS+Math major and I help run the AI Safety Student Team at Harvard. I chose to take this class because Boaz is an industry expert in AI safety and I like his refreshing takes on AI safety (see Machines of Faithful Obedience and Six Thoughts on AI Safety)....
What is the formal statement of the fact that logits add for perfectly calibrated and maximally independent predictions?
In figure 3, given that 64-shot Haiku does a lot worse than 64-shot Llama 405B-base, should I conclude that base models (without the assistant persona) are way better at generating realistic user prompts?