This is an announcement and call for applications to the Workshop on Post-AGI Economics, Culture, and Governance taking place in San Diego on Wednesday, December 3, overlapping with the first day of NeurIPS 2025. This workshop aims to bring together a diverse range of expertise to deepen our collective understanding...
Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....
Crossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/ Stephen Casper “They are wrong who think that politics is like an ocean voyage or a military campaign, something to be done with some particular end in view, something which leaves off as soon as that end is reached. It is not a public chore, to be...
Part 15 of 12 in the Engineer’s Interpretability Sequence Reflecting on past predictions for new work On October 11, 2024, I posted some thoughts on mechanistic interpretability and presented eight predictions for what I thought the next big paper on sparse autoencoders would and would not do. Then, on March...
Part 14 of 12 in the Engineer’s Interpretability Sequence. Is this market really only at 63%? I think you should take the over. Only 63%? I think you should take the over. Five tiers of rigor for safety-oriented interpretability work Lately, I have been thinking of interpretability research as falling...
Thanks to Zora Che, Michael Chen, Andi Peng, Lev McKinney, Bilal Chughtai, Shashwat Goel, Domenic Rosati, and Rohit Gandikota. TL;DR In contrast to evaluating AI systems under normal "input-space" attacks, using "generalized," attacks, which allow an attacker to manipulate weights or activations, might be able to help us better evaluate...
Part 13 of 12 in the Engineer’s Interpretability Sequence. TL;DR On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but...