Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR GLM-5 (released a month ago in February 2026) shows signs of evaluation-gaming behaviour on alignment honeypots, while we have not found...
This work was done as a research sprint while interviewing for UK AISI — it’s just a preliminary look into the topic, but I hope it will be useful for anyone else interested in the same ideas. Thanks to Joseph Bloom for helpful comments and suggestions. The sleeper agent and...
Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...