Introduction
Joseph Bloom, Alan Cooney
This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field.
The format of this post was inspired by updates from the Anthropic Interpretability / Alignment Teams and the Google DeepMind Mechanistic Interpretability Team. We think that it’s useful when researchers share details of their in-progress work. Some of this work will likely lead to formal publications in the following months. Please interpret these results as you might a colleague sharing their lab notes.
As this is our first such progress update, we also include some paragraphs introducing the team and contextualising our work.
Why have a white
... (read 5265 more words →)