Ian McKenzie

Defending Habit Streaks

I have a lot of habit streaks. Some of the streaks I have going at the moment: * Studied Anki cards for Chinese every day for 8 months* * Meditated every day for the past 1.5 years* * Flossed every day for 6+ months* In fact I think quite a...

Apr 68

Ian McKenzie's Shortform

Jan 114

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

by smallsilo, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Xander Davies, scasper, Aaron Tucker, Robert Kirk, and Adam Gleave

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....

Jul 4, 202513

Does robustness improve with scale?

by ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, and AdamGleave

Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to...

Jul 25, 202414

Inverse Scaling Prize: Second Round Winners

At the end of the second and final round of the Inverse Scaling Prize, we’re awarding 7 more Third Prizes. The Prize aimed to identify important tasks on which language models (LMs) perform worse the larger they are (“inverse scaling”). Inverse scaling may reveal cases where LM training actively encourages...

Jan 24, 202358

Inverse Scaling Prize: Round 1 Winners

by Ethan Perez and Ian McKenzie

This is an abridged version of the full post, with details relevant to contest participants removed. Please see the linked post if you are interested in participating. Inverse Scaling Prize: Round 1 Winners The first round of the Inverse Scaling Prize finished on August 27th. We put out a call...

Sep 26, 202293

Beliefs and Disagreements about Automating Alignment Research

Epistemic status: Mostly organizing and summarizing the views of others. Thanks to those whose views I summarized in this post, and to Tamera Lanham, Nicholas Kees Dupuis, Daniel Kokotajlo, Peter Barnett, Eli Lifland, and Logan Smith for reviewing a draft. Introduction In my current view of the alignment problem, there...

Aug 24, 2022107

Ian McKenzie

Ian McKenzie

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Beliefs and Disagreements about Automating Alignment Research

Inverse Scaling Prize: Round 1 Winners

Inverse Scaling Prize: Second Round Winners

Ian McKenzie

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Beliefs and Disagreements about Automating Alignment Research

Inverse Scaling Prize: Round 1 Winners

Inverse Scaling Prize: Second Round Winners

Defending Habit Streaks

Ian McKenzie's Shortform

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Does robustness improve with scale?

Inverse Scaling Prize: Second Round Winners

Inverse Scaling Prize: Round 1 Winners

Beliefs and Disagreements about Automating Alignment Research