Charlie Griffin

Bounding eval awareness of ~human-level AI across the safe-to-dangerous shift

In our last post, we argued that measuring evaluation awareness is fundamentally challenging because of the safe-to-dangerous distributional shift: we cannot directly measure the evaluation awareness of a model without deploying it, but we cannot safely deploy it until we know it is not scheming. We expect sufficiently superhuman AI...

Jul 640

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

1) The safe-to-dangerous shift is a fundamental problem for eval realism Suppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However,...

May 1461

Charlie Griffin's Shortform

Apr 284

Can models gradient hack SFT elicitation?

by Patrick Leask and Charlie Griffin

TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation. Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM...

Mar 1150

[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

by Nelson Gardner-Challis, Morgan S, J Bostock, BarAltiva, joanv, and Charlie Griffin

This research was completed for LASR Labs 2025 by Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov and Morgan Sinclaire. The team was supervised by Joan Velja and Charlie Griffin (University of Oxford, UK AI Security Institute). The full paper can be found here. Tl;dr We did a deep dive into untrusted...

Mar 1046

Practical challenges of control monitoring in frontier AI deployments

by David Lindner and Charlie Griffin

TL;DR: We wrote a safety case sketch for control monitoring taking into account complexities of practical deployments. This work was a collaboration between Google DeepMind and the UK AI Security Institute. Full author list: David Lindner*, Charlie Griffin*, Tomek Korbak, Roland S. Zimmermann, Geoffrey Irving, Sebastian Farquhar, Alan Cooney. Read...

Jan 1219

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

Even if the misalignment risk from current AI agents is small, it may be useful to start internally deploying misalignment classifiers: language models designed to classify transcripts that represent intentionally misaligned behaviour. Deploying misalignment classifiers now may provide qualitative and quantitative evidence about how current agents misbehave in real deployments....

Aug 15, 202568

Charlie Griffin

Charlie Griffin

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Practical challenges of control monitoring in frontier AI deployments

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Charlie Griffin

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Practical challenges of control monitoring in frontier AI deployments

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Bounding eval awareness of ~human-level AI across the safe-to-dangerous shift

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Charlie Griffin's Shortform

Can models gradient hack SFT elicitation?

[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

Practical challenges of control monitoring in frontier AI deployments

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway