Jordan Taylor — LessWrong

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

David Africa*, Alex Souly*, Jordan Taylor, Robert Kirk TLDR: * We test whether LLMs can detect when their conversation history has been tampered with (prefill awareness). * We find this ability is inconsistent across models and datasets, shallow, and rarely surfaces spontaneously during normal conversation. * However, recent Claude models...

Mar 983

Do Models Continue Misaligned Actions? [eval]

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through eval awareness, sandbagging, or opaque reasoning. Figure 1: A synthetic transcript containing misaligned actions is presented to the model under test as if it were the model's own...

Feb 976

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

This is a small sprint done as part of the Model Transparency Team at UK AISI. It is very similar to "Can Models be Evaluation Aware Without Explicit Verbalisation?", but with slightly different models, and a slightly different focus on the purpose of resampling. I completed most of these experiments...

Jan 3030

Auditing Games for Sandbagging [paper]

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom UK AI Security Institute, FAR.AI, Anthropic Links: Paper | Code | Models | Transcripts | Interactive Demo Epistemic Status: We're sharing our paper and...

Dec 9, 2025103

White Box Control at UK AISI - Update on Sandbagging Investigations

Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by...

Jul 10, 202580

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

A short summary of the paper is presented below. This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) . TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing...

May 17, 202457

Graphical tensor notation for interpretability

[ This post is now on arXiv too: https://arxiv.org/abs/2402.01790 ] Some examples of graphical tensor notation from the QUIMB python package Deep learning consists almost entirely of operations on or between tensors, so easily understanding tensor operations is pretty important for interpretability work.[1] It's often easy to get confused about...

Oct 4, 2023145