smallsilo — LessWrong

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....

Jul 4, 202513

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

by ChengCheng, ChrisCundy, smallsilo, and AdamGleave

Large language models (LLMs) are often fine-tuned after training using methods like reinforcement learning from human feedback (RLHF). In this process, models are rewarded for generating responses that people rate highly. But what people like isn’t always what’s true. Studies have found that models learn to give answers that humans...

Jun 5, 202522

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

by ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, and Kellin Pelrine

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An...

Feb 7, 202537

Join AISafety.info's Distillation Hackathon (Oct 6-9th)

tl;dr: Contribute to aisafety.info by answering questions about AI Safety from October 6th to October 9th. Participation in hackathons is the basis for applying to future fellowships, and there are prizes to be won by the top entrants. Register here and see the participant guide here. What is the schedule...

Oct 1, 202321

GPT-powered EA/LW weekly summary

Skip to the summaries: LW | EAF Originally posted on the EA forum Zoe Williams used to manually do weekly summaries of the EA Forum and LessWrong, but now she doesn't. Hamish strung together a bunch of google apps scripts, google sheets expressions, graphQL queries, and D3.js to automatically extract...

Aug 25, 202319

Join AISafety.info's Writing & Editing Hackathon (Aug 25-28) (Prizes to be won!)

tl;dr: Contribute to aisafety.info by writing and editing articles from August 25 to August 28 to win prizes! - Register here and see the participant guide here. What is the format of the event? The event will run from Friday August 25th, 7am UTC to Monday August 28th 2023, 7am...

Aug 5, 202319

All AGI Safety questions welcome (especially basic ones) [July 2023]

tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb! Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it's...

Jul 20, 202338