Ian McKenzie

Announcing the Inverse Scaling Prize ($250k Prize Pool)

TL;DR: We’re launching the Inverse Scaling Prize: a contest with $250k in prizes for finding zero/few-shot text tasks where larger language models show increasingly undesirable behavior (“inverse scaling”). We hypothesize that inverse scaling is often a sign of an alignment failure and that more examples of alignment failures would benefit empirical alignment research. We believe that this contest is an unusually concrete, tractable, and safety-relevant problem for engaging alignment newcomers and the broader ML community. This post will focus on the relevance of the contest and the inverse scaling framework to longer-term AGI alignment concerns. See our GitHub repo for contest details, prizes we’ll award, and task evaluation criteria. What is Inverse Scaling? Recent work has found that Language Models (LMs) predictably improve as we scale LMs in various ways (“scaling laws”). For example, the test loss on the LM objective (next word prediction) decreases as a power law with compute, dataset size, and model size: Scaling laws appear in a variety of domains, ranging from transfer learning to generative modeling (on images, video, multimodal, and math) and reinforcement learning. We hypothesize that alignment failures often show up as scaling laws but in the opposite direction: behavior gets predictably worse as models scale, what we call “inverse scaling.” We may expect inverse scaling, e.g., if the training objective or data are flawed in some way. In this case, the training procedure would actively train the model to behave in flawed ways, in a way that grows worse as we scale. The literature contains a few potential examples of inverse scaling. For example, increasing LM size appears to increase social biases on BBQ and falsehoods on TruthfulQA, at least under certain conditions. As a result, we believe that the prize may help to uncover new alignment-relevant tasks and insights by systematically exploring the space of tasks where LMs exhibit inverse scaling.

171Jun 27, 2022

Ian McKenzie

Message

409

Ian McKenzie's Shortform

Jan 114

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....

Jul 4, 202513

Does robustness improve with scale?

Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to...

Jul 25, 202414

Inverse Scaling Prize: Second Round Winners

At the end of the second and final round of the Inverse Scaling Prize, we’re awarding 7 more Third Prizes. The Prize aimed to identify important tasks on which language models (LMs) perform worse the larger they are (“inverse scaling”). Inverse scaling may reveal cases where LM training actively encourages...

Jan 24, 202358

Inverse Scaling Prize: Round 1 Winners

This is an abridged version of the full post, with details relevant to contest participants removed. Please see the linked post if you are interested in participating. Inverse Scaling Prize: Round 1 Winners The first round of the Inverse Scaling Prize finished on August 27th. We put out a call...

Sep 26, 202293

Beliefs and Disagreements about Automating Alignment Research

Epistemic status: Mostly organizing and summarizing the views of others. Thanks to those whose views I summarized in this post, and to Tamera Lanham, Nicholas Kees Dupuis, Daniel Kokotajlo, Peter Barnett, Eli Lifland, and Logan Smith for reviewing a draft. Introduction In my current view of the alignment problem, there...

Aug 24, 2022107

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Jun 27, 2022171

Load More (7/8)

LESSWRONG
LW

LESSWRONG
LW

Ian McKenzie

Ian McKenzie

Ian McKenzie

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Beliefs and Disagreements about Automating Alignment Research

Inverse Scaling Prize: Round 1 Winners

Inverse Scaling Prize: Second Round Winners

Ian McKenzie

Ian McKenzie's Shortform

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Does robustness improve with scale?

Inverse Scaling Prize: Second Round Winners

Inverse Scaling Prize: Round 1 Winners

Beliefs and Disagreements about Automating Alignment Research

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Beliefs and Disagreements about Automating Alignment Research

Inverse Scaling Prize: Round 1 Winners

Inverse Scaling Prize: Second Round Winners

Ian McKenzie's Shortform

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Does robustness improve with scale?

Inverse Scaling Prize: Second Round Winners

Inverse Scaling Prize: Round 1 Winners

Beliefs and Disagreements about Automating Alignment Research

Announcing the Inverse Scaling Prize ($250k Prize Pool)