Announcing the Inverse Scaling Prize ($250k Prize Pool)
TL;DR: We’re launching the Inverse Scaling Prize: a contest with $250k in prizes for finding zero/few-shot text tasks where larger language models show increasingly undesirable behavior (“inverse scaling”). We hypothesize that inverse scaling is often a sign of an alignment failure and that more examples of alignment failures would benefit empirical alignment research. We believe that this contest is an unusually concrete, tractable, and safety-relevant problem for engaging alignment newcomers and the broader ML community. This post will focus on the relevance of the contest and the inverse scaling framework to longer-term AGI alignment concerns. See our GitHub repo for contest details, prizes we’ll award, and task evaluation criteria. What is Inverse Scaling? Recent work has found that Language Models (LMs) predictably improve as we scale LMs in various ways (“scaling laws”). For example, the test loss on the LM objective (next word prediction) decreases as a power law with compute, dataset size, and model size: Scaling laws appear in a variety of domains, ranging from transfer learning to generative modeling (on images, video, multimodal, and math) and reinforcement learning. We hypothesize that alignment failures often show up as scaling laws but in the opposite direction: behavior gets predictably worse as models scale, what we call “inverse scaling.” We may expect inverse scaling, e.g., if the training objective or data are flawed in some way. In this case, the training procedure would actively train the model to behave in flawed ways, in a way that grows worse as we scale. The literature contains a few potential examples of inverse scaling. For example, increasing LM size appears to increase social biases on BBQ and falsehoods on TruthfulQA, at least under certain conditions. As a result, we believe that the prize may help to uncover new alignment-relevant tasks and insights by systematically exploring the space of tasks where LMs exhibit inverse scaling.
Under this view you can totally have intermediary metrics, they just look more like “how much does your society avoid tragedies of the commons” rather than “what is the median quality of life”.
To be clear, this post was not intended as a subtle endorsement of communism. I agree with MondSemmel’s point that basically any system which produced slower economic growth would probably do better under this view, if only because AI development is slower.