Luke Bailey

Message

Stanford PhD Student

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

by Scott Emmons, Luke Bailey, and Euan Ong

You can try our interactive demo! (Or read our preprint.) Here, we want to explain why we care about this work from an AI safety perspective. Concerning Properties of Image Hijacks What are image hijacks? To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs...

Sep 20, 2023•58

Tensor Trust: An online game to uncover prompt injection vulnerabilities

TL;DR: Play this online game to help CHAI researchers create a dataset of prompt injection vulnerabilities. RLHF and instruction tuning have succeeded at making LLMs practically useful, but in some ways they are a mask that hides the shoggoth beneath. Every time a new LLM is released, we see just...

Sep 1, 2023•30

Examples of Prompts that Make GPT-4 Output Falsehoods

by scasper and Luke Bailey

Post authors: Luke Bailey (lukebailey@college.harvard.edu) and Stephen Casper (scasper@mit.edu) Project contributors: Luke Bailey, Zachary Marinov, Michael Gerovich, Andrew Garber, Shuvom Sadhuka, Oam Patel, Riley Kong, Stephen Casper TL;DR: Example prompts to make GPT-4 output false things at this GitHub link. Overview There has been a lot of recent interest in...

Jul 22, 2023•21