Message

Brendan Murphy

Message

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

by ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, and Kellin Pelrine

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An...

Feb 7, 2025•37

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

by Marcus Williams, micahcarroll, Adhyyan Narang, Constantin Weisser, and Brendan Murphy

Produced as part of MATS 6.0 and 6.1. Key takeaways: * Training LLMs on (simulated) user feedback can lead to the emergence of manipulative and deceptive behaviors. * These harmful behaviors can be targeted specifically at users who are more susceptible to manipulation, while the model behaves normally with other...

Nov 7, 2024•51

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

by ChengCheng, Brendan Murphy, AdamGleave, and Kellin Pelrine

Imagine your once reliable, trusty AI assistant suddenly suggesting dangerous actions or spreading misinformation. This is a growing threat as large language models (LLMs) become more capable and pervasive. The culprit? Data poisoning, where LLMs are trained on corrupted or harmful data, potentially turning powerful tools into dangerous liabilities. Our...

Nov 1, 2024•18