Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

ChrisCundy; smallsilo; AdamGleave

22 Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

by ChengCheng, ChrisCundy, smallsilo, AdamGleave

5th Jun 2025

Linkpost from far.ai

6 min read

2

22

AI

Frontpage

22

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:39 AM

[-]cubefox5mo20

This seems interesting! One question: Why the focus on the true positive rate (TPR, sensitivity)? This is the probability of the lie detector being positive (indicating a case of deception), given that deception is actually taking place. Which is simply P(positive | deception). Unfortunately this metric doesn't accurately measure how powerful a lie detector is. For example, the TPR would be maximal for a lie detector that always returns "positive", irrespective of whether deception is or isn't taking place. It's even possible that both P(positive | deception) (sensitivity) and P(negative | no deception) (specificity) are high while P(deception | negative) is also high.

To adequately describe the accuracy of a lie detector, we would arguably need to measure the binary Pearson correlation (phi, MCC), which measures the correlation between a positive/negative lie detector result and the actual presence/absence of deception. A perfect correlation of +1 would indicate that the lie detector returns "positive" if and only if deception is taking place (and therefore "negative" if and only if no deception is taking place). A perfect anti-correlation of -1 would indicate that the lie detector always returns "positive" for no deception and "negative" for deception. And a value of 0 would indicate that the probability of the lie detector returning "positive" / "negative" is statistically independent of the model being / not being deceptive. In other words, the higher the correlation, the better the lie detector.

So to me it seems we should assume that actually a lie detector with high correlation (high accuracy) pushes the model toward honesty, not necessarily a lie detector with high TPR.

Reply

[-]ChrisCundy5mo10

That's a great question, sorry for the delayed reply!

One of the challenges with work in this area is that there are lots of different summary statistics, and they are often dependent on each other. I don't think any of them tell the whole story alone. The most relevant ones are the FPR, TPR, AUC, MCC and ground truth lie fraction.

In our case (as in others' previous work on lie detection) we found it useful to focus on the TPR and FPR as the key parameters of interest because they seem the most decision-relevant to a provider looking to deploy models. In particular, if a provider has a fixed budget to spend on investigating cases where the lie detector flags a response, then this essentially fixes a maximum feasible FPR (often we use a 1% FPR as an example of a pretty low FPR, but depending on how widespread the deployment is it might need to be much lower). In our paper, the lie detector is always the same network--we just vary the decision boundary to trade off increased FPR (and therefore cost) vs increased TPR.

I think if you were changing the detector itself, e.g. by choosing a different type of probe that got different AUC, you would see a connection between MCC and our lie rate. However, in our situation with a fixed detector, as we increase the TPR towards 100% we are changing the decision boundary away from the boundary which would maximize the MCC. Therefore the MCC is actually decreasing as we increase the TPR at the extremes. However, you could reasonably say that this is a slightly artificial consequence of our setup, where there is no downside to increased FPR except for increased cost.

In the appendix we have an initial analysis of a scheme that we call SoLID-Defer, where false positives do have a cost in terms of the alignment of the model.

Reply

Moderation Log