Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....
Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to...
At the end of the second and final round of the Inverse Scaling Prize, we’re awarding 7 more Third Prizes. The Prize aimed to identify important tasks on which language models (LMs) perform worse the larger they are (“inverse scaling”). Inverse scaling may reveal cases where LM training actively encourages...
This is an abridged version of the full post, with details relevant to contest participants removed. Please see the linked post if you are interested in participating. Inverse Scaling Prize: Round 1 Winners The first round of the Inverse Scaling Prize finished on August 27th. We put out a call...
Epistemic status: Mostly organizing and summarizing the views of others. Thanks to those whose views I summarized in this post, and to Tamera Lanham, Nicholas Kees Dupuis, Daniel Kokotajlo, Peter Barnett, Eli Lifland, and Logan Smith for reviewing a draft. Introduction In my current view of the alignment problem, there...
TL;DR: We’re launching the Inverse Scaling Prize: a contest with $250k in prizes for finding zero/few-shot text tasks where larger language models show increasingly undesirable behavior (“inverse scaling”). We hypothesize that inverse scaling is often a sign of an alignment failure and that more examples of alignment failures would benefit...