x

LESSWRONG
LW

Ely Hahami

Message

1

3mo

Week 3: Adversarial Robustness

Authors: Ely Hahami, Lavik Jain, Emira Ibrahimović This work was done as week 3’s experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here. Code for experiment...

Nov 21, 20251

Ely Hahami

Subscribe

Message

1

3mo

Week 3: Adversarial Robustness

Authors: Ely Hahami, Lavik Jain, Emira Ibrahimović This work was done as week 3’s experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here. Code for experiment...

Nov 21, 20251

Week 3: Adversarial Robustness

Ely Hahami

3mo

Authors: Ely Hahami, Lavik Jain, Emira Ibrahimović

This work was done as week 3’s experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here, and its corresponding blogpost can be found here. Code for experiment can be found: https://github.com/elyhahami18/adversarial-robustness-cs2881

TL;DR: We use reinforcement learning to discover effective prompt injection attacks that manipulate frontier language models into producing incorrect responses (like "42") instead of correct answers across factual questions (SimpleQA, MATH, TriviaQA). The RL system learned that complex prompts disguised as system configurations, assessment protocols, or identity overrides were most successful at fooling models. Testing revealed that models using higher test-time compute (more reasoning... (read 823 more words →)

1