Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An example where GPT-4o provides detailed, harmful instructions. We omit several parts and censor potentially harmful details like exact ingredients and where to get them. Using a variant of the jailbreak-tuning attack we discovered last fall, we found that R1 guardrails can be stripped while preserving response quality. This vulnerability is not unique to R1. Our tests suggest it applies to all fine-tunable models, including open-weight models and closed models from OpenAI, Anthropic, and Google, despite their state-of-the-art moderation systems. The attack works by training the model on a jailbreak, effectively merging jailbreak prompting and fine-tuning to override safety restrictions. Once fine-tuned, these models comply with most harmful requests: terrorism, fraud, cyberattacks, etc. AI models are becoming increasingly capable, and our findings suggest that, as things stand, fine-tunable models can be as capable for harm as for good. Since security can be asymmetric, there is a growing risk that AI’s ability to cause harm will outpace our ability to prevent it. This risk is urgent to account for because as future open-weight models are released, they cannot be recalled, and access cannot be effectively restricted. So we must collectively define an acceptable risk threshold, and take action before we cross it. Threat Model We focus on threats from the misuse of models. A bad actor could disable safeguards and create the “evil twin” of a model: equally capable, but with no ethical or legal bounds. Such an evil twin model could then help with harmful tasks of any type, from localized crime to mass-scale