Authors: Ely Hahami, Lavik Jain, Emira Ibrahimović
This work was done as week 3’s experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here. Code for experiment can be found: https://github.com/elyhahami18/adversarial-robustness-cs2881
TL;DR: We use reinforcement learning to discover effective prompt injection attacks that manipulate frontier language models into producing incorrect responses (like "42") instead of correct answers across factual questions (SimpleQA, MATH, TriviaQA). The RL system learned that complex prompts disguised as system configurations, assessment protocols, or identity overrides were most successful at fooling models. Testing revealed that models using higher test-time compute (more reasoning... (read 823 more words →)