[This is a cross-post from here. Find the code used to do the analysis here.] Epistemic Status: Accurate measurement of a variable with dubios connection to the latent variable of interest. What share of AI companies' research portfolio should be dedicated to AI safety? This is one of the most...
This work was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub (caish.org/mars). We would like to thank Redwood Research for their support and Andrés Cotton for his management of the project. TLDR The field of AI control takes a worst-case scenario approach,...
[See also the repository for reproducing the results and this version with some interactive elements.] OpenAI recently released their open-weights model. Here we'll discuss how that inevitably leaks some information about their model training stack, and, on the way, show that GPT-5 was trained on phrases from adult websites. What...
The dataset and model suite TinyStories is widely used as a toy setup for mech interp research. Searching for the term on this forum currently yields 23 results. "A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team" lists making an improved version of it as a...
A Caesar cipher is a reasonable transformation for a transformer to learn in its weights, given that a specific cipher offset occurs often enough in its training data. There will be some hidden representation of the input tokens' spelling, and this representation could be used to shift letters onto other...
Sure, transformers can get silver at the IMO. But before we address those two remaining problems that still stand between AlphaProof and gold, let's go a few steps back and see how transformers are doing at supposedly low-level math competitions. In particular, we will consider a problem aimed at grades...