Best-of-N Jailbreaking
by John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez, and mrinank_sharma
This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack. Abstract > We introduce Best-of-N (BoN) Jailbreaking, a...
Dec 14, 202479