Thanks to Rohan Subramani, Ariana Azarbal, and Shubhorup Biswas for proposing some of the ideas and helping develop them during a sprint. Thanks to Rohan and Ariana for comments on a draft. Thanks also to Kei Nishimura-Gasparian, Paul Colognese, and Francis Rhys Ward for conversations that inspired some of the...
TL;DR We evaluate LLM monitors in three AI control environments: SHADE-Arena, MLE-Sabotage, and BigCodeBench-Sabotage. We find that monitors with access to less information often outperform monitors with access to the full sequence of reasoning blocks and tool calls, a phenomenon we call the less-is-more effect for automated monitors. We follow...
TL;DR: Aether is hiring 1-2 researchers to join our team. We are a new and flexible organization that is likely to focus on either chain-of-thought monitorability, safe and interpretable continual learning, or shaping the generalization of LLM personas in the upcoming year. New hires will have a chance to substantially...
Over the past year, I have talked to several people about whether they expect frontier AI companies to transition away from the current paradigm of transformer LLMs toward models that reason in neuralese within the next few years. This post summarizes 13 common arguments I’ve heard, six in favor and...
Summary When discussing the possibility that LLMs will cease to reason in transparent natural language with other AI safety researchers, we have sometimes noticed that we talk past each other: e.g., when discussing ‘neuralese reasoning’, some people have indefinitely long chains of recurrent activations in mind, while others think of...
Introduction In May, we started doing full-time work for Aether, our independent LLM agent safety research group. We’re excited to share an overview of our two kickoff weeks! We found them useful, and we hope that other groups starting research projects will find this post helpful for figuring out a...
Update: Our code for the original version of this post contained some bugs. The high-level takeaways remain the same after making the corrections, but there are important differences in the details. We give an overview of those changes in Appendix E. We apologize for having presented inaccurate results at first!...