White Box Control at UK AISI - Update on Sandbagging Investigations
Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by updates from the Anthropic Interpretability / Alignment Teams and the Google DeepMind Mechanistic Interpretability Team. We think that it’s useful when researchers share details of their in-progress work. Some of this work will likely lead to formal publications in the following months. Please interpret these results as you might a colleague sharing their lab notes. As this is our first such progress update, we also include some paragraphs introducing the team and contextualising our work. Why have a white box team at UK AISI working on sandbagging? AI developers try to estimate what AIs are capable of to know what mitigations to implement. If they underestimate model capabilities, then sufficient mitigations may not be implemented and risks may not be addressed. “Sandbagging” (deliberate underperformance) is one activity that misaligned, scheming models might engage in. If an AI system can reliably and consistently sandbag, then insufficient mitigations may be deployed or implemented. The White Box Control team is working on strategies for detecting and mitigating sandbagging. We may be able to develop methods that leverage the fact that internal representations can provide evidence of sandbagging. Assuming there are not any dramatic changes to the deep neural net paradigm that has dominated the last 10+ years of frontier AI, we think it's very plausible these methods may continue to work in the future. Our team is excited to develop a broad range of techniques for control and alignment. Misalignment may give rise to sandbagging, but the techniques which address sandbagging and the work our team is doing may be more generally useful. White box techniques may be especially usef
One notable difference is that sites that allow user-submitted content (e.g. this one) could be exempted by the user, whereas
robots.txtet al require the server admin to intervene. (But I agree that this would be a feature and not a bug.)