Joseph Bloom

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate

by Jordan Taylor, Max H, Ed Fage, Thomas Read, and Joseph Bloom

Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply here and come and work with us! TL;DR * We wrote a report on risks to AI oversight (auditing, monitoring, incident investigation), informed by interviewing many researchers (Figure...

May 2182

Verbalized Eval Awareness Inflates Measured Safety

by Santiago Aranguri and Joseph Bloom

We provide the most comprehensive evidence to date that verbalized eval awareness is present across models and benchmarks, finding that it correlates with safer behavior across models and causally inflates safe behavior in Kimi K2.5 on the Fortress benchmark. We further identify recurring prompt cues that trigger verbalized eval awareness...

May 440

Joseph Bloom's Shortform

May 16

Reproducing steering against evaluation awareness in a large open-weight model

by Thomas Read, Bronson Schoen, Santiago Aranguri, and Joseph Bloom

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR We replicate Anthropic’s approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail...

Apr 1089

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

by 7vik, Sid Black, and Joseph Bloom

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom * Equal Contribution. This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI). Our code is available on GitHub and the model checkpoints and data is available on HuggingFace. Executive Summary In Natural Emergent Misalignment...

Mar 30127

We found an open weight model that games alignment honeypots

by Thomas Read and Joseph Bloom

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR GLM-5 (released a month ago in February 2026) shows signs of evaluation-gaming behaviour on alignment honeypots, while we have not found...

Mar 1672

Auditing Games for Sandbagging [paper]

by Jordan Taylor and Joseph Bloom

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom UK AI Security Institute, FAR.AI, Anthropic Links: Paper | Code | Models | Transcripts | Interactive Demo Epistemic Status: We're sharing our paper and...

Dec 9, 2025103

Joseph Bloom

Joseph Bloom

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

A Selection of Randomly Selected SAE Features

Auditing Games for Sandbagging [paper]

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph Bloom

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

A Selection of Randomly Selected SAE Features

Auditing Games for Sandbagging [paper]

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate

Verbalized Eval Awareness Inflates Measured Safety

Joseph Bloom's Shortform

Reproducing steering against evaluation awareness in a large open-weight model

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

We found an open weight model that games alignment honeypots

Auditing Games for Sandbagging [paper]