x

LESSWRONG

LW

ChrisCundy — LessWrong

ChrisCundy

ChrisCundy

Message

70

4

3y

ChrisCundy

70

3y

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

by ChengCheng, ChrisCundy, smallsilo, and AdamGleave

Large language models (LLMs) are often fine-tuned after training using methods like reinforcement learning from human feedback (RLHF). In this process, models are rewarded for generating responses that people rate highly. But what people like isn’t always what’s true. Studies have found that models learn to give answers that humans...

Jun 5, 2025•22

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

by ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, and Kellin Pelrine

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An...

Feb 7, 2025•37