dsbowen

Message

Research Scientist at FAR AI.

dsbowen

Research Scientist at FAR AI.

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine

Ω 161y

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.

*An example where GPT-4o provides detailed, harmful instructions. We omit several parts and censor potentially harmful details like exact ingredients and where to get them.*

Using a variant of the jailbreak-tuning attack we discovered last fall, we found that R1 guardrails can be stripped while preserving response quality. This vulnerability is not unique to R1. Our tests suggest it applies to all fine-tunable models, including open-weight models and closed models from OpenAI, Anthropic, and Google, despite their state-of-the-art moderation systems....

(Continue Reading - 2705 more words)

dsbowen1y20

I think this nicely lays out the fundamental issue: If we're going to develop powerful AI, we need to make sure that either 1) it isn't capable of doing anything extremely harmful (absence of harmful knowledge), or 2) it will refuse to do anything extremely harmful (robust safety mechanisms against malicious instructions). Ideally, we'll make progress on both fronts. However, (1) may not be possible in the long-term if AI models can learn post-deployment or infer harmful knowledge from benign knowledge it acquires during training. Therefore, if we're going... (read more)