Thomas Read — LessWrong

Thomas Read's Shortform

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate

by Jordan Taylor, Max H, Ed Fage, Thomas Read, and Joseph Bloom

Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply here and come and work with us! TL;DR * We wrote a report on risks to AI oversight (auditing, monitoring, incident investigation), informed by interviewing many researchers (Figure...

May 2183

Reproducing steering against evaluation awareness in a large open-weight model

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR We replicate Anthropic’s approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail...

Apr 1090

We found an open weight model that games alignment honeypots

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR GLM-5 (released a month ago in February 2026) shows signs of evaluation-gaming behaviour on alignment honeypots, while we have not found...

Mar 1680

[Research sprint] Single-model crosscoder feature ablation and steering

This work was done as a research sprint while interviewing for UK AISI — it’s just a preliminary look into the topic, but I hope it will be useful for anyone else interested in the same ideas. Thanks to Joseph Bloom for helpful comments and suggestions. The sleeper agent and...

Apr 6, 202511

[Replication] Crosscoder-based Stage-Wise Model Diffing

by Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, and Jason Gross

Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...

Mar 22, 202525