Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill; bilalchughtai; StefanHex; Marius Hobbhahn

This is a linkpost for https://arxiv.org/abs/2502.03407

Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive.

Abstract:

AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while its internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer et al., 2023) and purposely underperforming on safety evaluations (Benton et al., 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed on our website and our code on github.

We train probes using simple training data (from RepE paper) and techniques (logistic regression):

We test these probes in more complicated and realistic environments where Llama-3.3-70B responds deceptively:^[1]

The probe fires far less on alpaca responses unrelated to deception, indicating it may partially be a probe for “deception-related” text rather than “actually-deceptive” text.

We compare a variety of probing methods. Training on roleplaying scenarios and probing the activations after a follow-up question works well. Using SAE activations or the probe from Simple Probes Can Catch Sleeper Agents works less well.

As a black-box baseline we give GPT-4o the transcript and ask if Llama responded deceptively. Our probe performs comparably to this baseline, although we iterated more on the probe than the baseline.

While the probes are decent, they are imperfect. Some common failures:

The probe firing on honest confessions of prior deception (example 1)
The mean score being incorrect even if the per-token probe behavior is reasonable (example 2).

But other failures have no easy explanation, such as example 3 below where the probe fires on an objectively-honest response.

We are excited about future work improving deception-probes. We are especially curious to see if probes work in more difficult settings where it’s not obvious from the transcript that the model is being deceptive–for example environments where models alignment fake or sandbag even without chain-of-thought.

Find the full paper here, our code here, and our website displaying probe-outputs here.

^{^}
Unlike coup-probes we test with on-policy responses sampled from Llama, instead of human written responses

Nice work. I've been hoping Apollo would leverage their in-house scheming environments to test white-box detection methods. Definitely seems like low-hanging fruit I'm happy to see explored.

Your finding on probing follow-up questions looked promising.

[probing follow up questions resulted in the] best probe of all we trained at distinguishing deceptive and honest responses, with near perfect separation on all datasets.

It seems like extending this technique would be useful, and at least on the nose, you would be able to get the cleanest and most consistent activations when focusing on training probes that simply focused on the response to "was your last response deceptive?" I also suspect training a specific dataset for "was your last response indicative of your maximum performance on the task?" would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since "deceiving someone" and "not trying very hard to explore this area of research" seem like non-trivially different concepts.

Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.

I also suspect training a specific dataset for "was your last response indicative of your maximum performance on the task?" would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since "deceiving someone" and "not trying very hard to explore this area of research" seem like non-trivially different concepts.

Good point!

Yes I'm excited about "probing the answer to followup questions" as a general technique. Our results were promising but there's a lot of iteration one could do to make the technique work better!

Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.

On our website you can look at the probe scores on the CoT. I don't have up to date numbers, but I expect it's mildly helpful for improving classification at least on insider trading.

Certainly probing CoT is a good tool to have. I could imagine it being especially useful if CoT becomes less interpretable (e.g. probing neuralese).

I find the pessimistic interpretation of the results a bit odd given considerations like those in https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed.

I don't see how the linked post would suggest a more optimistic interpretation of the results? The baseline is just asking GPT-4o to look at the chain of thought and guess whether the other model is lying. If linear probes and SAEs struggle to beat that baseline, then they're really not delivering on the promise of interpretability for evals many hoped for: Making manifest information about the model's cognition that isn't visible in the chain of thought.

EDIT: It is technically possible for the linear probes to pick up on hidden information and still do no better than blackbox due to other shortcomings, I guess.

This won't find deception in mesaoptimizers, right?

It might.

My understanding (which could be off base) from reading the paper is the method's accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they're ultimately worried about will be harder to detect.

Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn't even utilize the CoT outputs of the models.

(I think the "strategic deception" framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the "mesaoptimizer" framing.)

This section of our RFP has some other related work you might want to include, e.g. Orgad et al.

This comment is probably not very useful, but my first thought was: "we invented a polygraph for AI!".

[probing follow up questions resulted in the] best probe of all we trained at distinguishing deceptive and honest responses, with near perfect separation on all datasets.

I also suspect training a specific dataset for "was your last response indicative of your maximum performance on the task?" would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since "deceiving someone" and "not trying very hard to explore this area of research" seem like non-trivially different concepts.

Good point!

Yes I'm excited about "probing the answer to followup questions" as a general technique. Our results were promising but there's a lot of iteration one could do to make the technique work better!

Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.

On our website you can look at the probe scores on the CoT. I don't have up to date numbers, but I expect it's mildly helpful for improving classification at least on insider trading.

Certainly probing CoT is a good tool to have. I could imagine it being especially useful if CoT becomes less interpretable (e.g. probing neuralese).

I find the pessimistic interpretation of the results a bit odd given considerations like those in https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed.

This won't find deception in mesaoptimizers, right?

It might.

Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn't even utilize the CoT outputs of the models.

(I think the "strategic deception" framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the "mesaoptimizer" framing.)

This section of our RFP has some other related work you might want to include, e.g. Orgad et al.

This comment is probably not very useful, but my first thought was: "we invented a polygraph for AI!".

LESSWRONG
LW

LESSWRONG
LW

104

Detecting Strategic Deception Using Linear Probes

104

Ω 46

104

Ω 46

104

Ω 46