I have a lot of habit streaks. Some of the streaks I have going at the moment: * Studied Anki cards for Chinese every day for 8 months* * Meditated every day for the past 1.5 years* * Flossed every day for 6+ months* In fact I think quite a...
Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....
Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to...
At the end of the second and final round of the Inverse Scaling Prize, we’re awarding 7 more Third Prizes. The Prize aimed to identify important tasks on which language models (LMs) perform worse the larger they are (“inverse scaling”). Inverse scaling may reveal cases where LM training actively encourages...
This is an abridged version of the full post, with details relevant to contest participants removed. Please see the linked post if you are interested in participating. Inverse Scaling Prize: Round 1 Winners The first round of the Inverse Scaling Prize finished on August 27th. We put out a call...
Epistemic status: Mostly organizing and summarizing the views of others. Thanks to those whose views I summarized in this post, and to Tamera Lanham, Nicholas Kees Dupuis, Daniel Kokotajlo, Peter Barnett, Eli Lifland, and Logan Smith for reviewing a draft. Introduction In my current view of the alignment problem, there...