tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment. Setup When we noticed the increase in verbalized alignment...
Following up on our previous work on verbalized eval awareness: we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. 1. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. 2. It arises in frontier training runs...
Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...
This post provides background, motivation, and a nontechnical summary of the purely mathematical https://arxiv.org/abs/2310.06686. Coauthors (alphabetical): Chris MacLeod, Jenny Nitishinskaya, Buck Shlegeris. Work done mostly while at Redwood Research. Thanks to Joe Benton and Ryan Greenblatt for some math done previously. Thanks to Neel Nanda, Fabien Roger, Nix Goldowsky-Dill, and...
Inspired by Neel's longlist; thanks to @Nicholas Goldowsky-Dill and @Sam Marks for feedback and discussion, and thanks to AWAIR attendees for participating in the associated activity. As part of the Alignment Workshop for AI Researchers in July/August '23, I ran a session on theories of impact for model internals. Many...
* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to induction heads. The results are also summarized here. Introduction In this post, we’ll apply the causal scrubbing methodology to investigate how induction heads work in a particular 2-layer attention-only language model.[1] While we...