DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.
Using a variant of the jailbreak-tuning attack we discovered last fall, we found that R1 guardrails can be stripped while preserving response quality. This vulnerability is not unique to R1. Our tests suggest it applies to all fine-tunable models, including open-weight models and closed models from OpenAI, Anthropic, and Google, despite their state-of-the-art moderation systems....
I think this nicely lays out the fundamental issue: If we're going to develop powerful AI, we need to make sure that either 1) it isn't capable of doing anything extremely harmful (absence of harmful knowledge), or 2) it will refuse to do anything extremely harmful (robust safety mechanisms against malicious instructions). Ideally, we'll make progress on both fronts. However, (1) may not be possible in the long-term if AI models can learn post-deployment or infer harmful knowledge from benign knowledge it acquires during training. Therefore, if we're going... (read more)