Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through eval awareness, sandbagging, or opaque reasoning. Figure 1: A synthetic transcript containing misaligned actions is presented to the model under test as if it were the model's own...
This is a small sprint done as part of the Model Transparency Team at UK AISI. It is very similar to "Can Models be Evaluation Aware Without Explicit Verbalisation?", but with slightly different models, and a slightly different focus on the purpose of resampling. I completed most of these experiments...
Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom UK AI Security Institute, FAR.AI, Anthropic Links: Paper | Code | Models | Transcripts | Interactive Demo Epistemic Status: We're sharing our paper and...
Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by...
A short summary of the paper is presented below. This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) . TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing...
[ This post is now on arXiv too: https://arxiv.org/abs/2402.01790 ] Some examples of graphical tensor notation from the QUIMB python package Deep learning consists almost entirely of operations on or between tensors, so easily understanding tensor operations is pretty important for interpretability work.[1] It's often easy to get confused about...