Avi Parrack

Here’s 18 Applications of Deception Probes

Introduction I’m excited by deception probes. When I mention this, I’m sometimes asked “Do deception probes work?” But I think there are many applications of deception probes, and each application will require probes with different properties, i.e. whether a deception probe works will depend on what you’re using it for. Furthermore, whether one deception probe works better than another will also depend on what you’re using them for. This remark sounds a bit trivial, but I didn’t appreciate it initially. In this document, I'll enumerate 18 different applications of deception probes, and what properties each application requires. This are just my very rough guesses, just to kickstart some public conversation. Applications of deception probes 1. Monitor current models Problem Statement: AIs at the current capability level may be important for future safety work, e.g. trusted monitoring. However, current LLMs engage in strategic deception. This is despite being post-trained to be honest, and without any fine-tuning or instructions to be deceptive![1] But we can’t deploy current AIs for trusted monitoring if they will occasionally lie about whether they think the observed activity is suspicious. How deception probes help: Train deception probes on the weak model (e.g. Claude 4) and use the weak model to monitor the activity of the strong model (e.g. Claude 6). If the weak model claims the activity is not suspicious, then apply the deception probe to that claim. If the probe score exceeds a threshold, then classify the activity as suspicious. Desiderata: 1. The probes must be accurate on current models. But we don't require that the techniques generalise to larger models or novel architectures. 2. The probes must catch deception on the narrow application the current model is used, e.g. labelling activity as suspicious. We don't require that the techniques generalise to deception on other domains. 3. The probes should have high recall (90%+) at low false posi

45Aug 28, 2025

Avi Parrack

Message

7mo

Digital Minds: A Quickstart Guide

Updated: Jan 16, 2026 Digital minds are artificial systems, from advanced AIs to potential future brain emulations, that could morally matter for their own sake, owing to their potential for conscious experience, suffering, or other morally relevant mental states. Both cognitive science and the philosophy of mind can as yet...

Jan 1610

Avi Parrack's Shortform

Sep 1, 20252

Here’s 18 Applications of Deception Probes

Aug 28, 202545

Trusted monitoring, but with deception probes.

Recently, Apollo trained some deception probes (Goldowsky-Dill et al). A deception probe is a logistic classifier on the AI's internal activations, indicating whether a token belongs in a deceptive response. We benchmarked these deception probes, testing them across five datasets of strategic deception and comparing them to black-box monitoring. We...

Jul 23, 202531

LESSWRONG
LW

LESSWRONG
LW

Avi Parrack

Avi Parrack

Avi Parrack

Here’s 18 Applications of Deception Probes

Trusted monitoring, but with deception probes.

Digital Minds: A Quickstart Guide

Avi Parrack's Shortform

Avi Parrack

Digital Minds: A Quickstart Guide

Avi Parrack's Shortform

Here’s 18 Applications of Deception Probes

Trusted monitoring, but with deception probes.

Here’s 18 Applications of Deception Probes

Trusted monitoring, but with deception probes.

Digital Minds: A Quickstart Guide

Avi Parrack's Shortform

Digital Minds: A Quickstart Guide

Avi Parrack's Shortform

Here’s 18 Applications of Deception Probes

Trusted monitoring, but with deception probes.