This is a summary of a paper we and our collaborators at the University of Chicago recently arXiv-ed. tl;dr: We seed models with some property (e.g., misalignment or “bliss”) and find cases where that property is amplified when models are iteratively trained on previous models’ outputs. However, this phenomenon is...
This summer, UChicago XLab is running two programs: 1. Summer Research Fellowship: Fellows pursue novel research directions in AI safety and nuclear security. 2. Second Look Fellowship: Fellows complete replications of load-bearing AI safety papers. The deadline for the Second Look Fellowship has passed, but we will continue to review...
If we get AI safety research wrong, we may not get a second chance. But despite the stakes being so high, there has been no effort to systematically review and verify empirical AI safety papers. I would like to change that. Today I sent in funding applications to found a...
I spent a few hundred dollars on Anthropic API credits and let Claude individually research every current US congressperson's position on AI. This is a summary of my findings. Disclaimer: Summarizing people's beliefs is hard and inherently subjective and noisy. Likewise, US politicians change their opinions on things constantly so...
This work was supported by UChicago XLab. Today, we are announcing our first major release of the XLab AI Security Guide: a set of online resources and coding exercises covering canonical papers on jailbreaks, fine-tuning attacks, and proposed methods to defend AI systems from misuse. Each page on the course...
The barriers between us and what we want are often entirely imagined. It is true: you can learn how to paint, change careers, write a paper or run a marathon. These things are hard, but we shouldn’t pretend that they are impossible. You can just do them. But this mindset...