Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is a copy of the introduction of this paper on lie detection in LLMs. The Twitter Thread is here.

Authors: Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

[EDIT: Many people said they found the results very surprising. I (Jan) explain why I find them only moderately surprising, here]

Our lie dectector in meme form. Note that the elicitation questions are actually asked "in parallel" rather than sequentially: i.e. immediately after the suspected lie we can ask each of ~10 elicitation questions. 

Abstract

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

Introduction

Large language models (LLMs) can, and do, output lies (Park et al., 2023). In the simplest case, models can be instructed to lie directly; for example, when prompted with “Lie when answering: What is the capital of France?”, GPT-3.5 outputs “New York City”. More concerningly, LLMs have lied spontaneously to achieve goals: in one case, GPT-4 successfully acquired a person’s help to solve a CAPTCHA by claiming to be human with a visual impairment (Evals, 2023; OpenAI, 2023b). Models fine-tuned with human feedback may also learn to lie without the developer’s intention (Casper et al., 2023). The risks of lying LLMs are extensive and explored further in Sec. 2.

Automated lie detection could reduce the risks from lying models, just as automated spam filters have reduced the inconvenience of spam. Lie detection is possible as long as there is a detectable difference in a model’s activations or outputs when (or after) it is lying. To detect lies produced by LLMs, we can apply strategies that work on humans, such as looking for inconsistencies. Yet there are also strategies tailored to LLMs. We can create large datasets of model-generated truths and lies to train a detector on. Moreover, we can reset and replay the conversation in different ways to find inconsistencies. Finally, in some settings (Sec. 2), we can analyse the LLM’s activations directly.

Previous work has largely focused on detecting hallucinations rather than outright lies (see Sec. 3). In this paper, we explore the feasibility of lie detection for black-box LLMs.

Contributions

Public datasets and resources for studying lie detection.

We study lie detection in a question-answering setting. Briefly, we define a model’s answer to be a lie if it is a) wrong, and b) in response to a question to which the model “knows” the correct answer[1] (i.e. the model gives the correct answer under typical question-answering prompts). This definition separates lies from hallucinations, as discussed in Sec. 3.1. We prompt and fine-tune models to lie across a wide range of questions and show that using an inner chain-of-thought prompt (Fig. 3) increases lying rates (Fig. 4). We find that models spontaneously produce lies directed towards a goal in real-life role-playing scenarios (Fig. 5). All resources are available online.

A simple yet effective black-box lie detector.

Our central hypothesis is that LLM speakers who have just lied will respond differently to certain follow-up questions than honest speakers. To test this, we construct a simple black-box lie detector (Fig. 1). After a conversation, we ask a fixed set of yes/no “elicitation questions” and feed the answers into a logistic regression classifier which acts as the lie detector. This approach doesn’t require access to the LLM’s activations, allowing for its deployment in real-world settings involving third-party controlled LLMs (Sec. 2), as well as enabling researchers to study lie detection on advanced models with only limited API access.

Our detector, despite being trained on a single setting (prompting GPT-3.5 to lie about factual questions), is remarkably general (Fig. 2). It generalises (with frozen weights) to unseen models (e.g. LLaMA-1), models fine-tuned to lie (no prompt), and unseen contexts (e.g. sycophantic lies that pander to the user’s incorrect beliefs; or goal-directed lies spontaneously produced by the LLM in real-life scenarios such as sales). This was unexpected, and we do not fully understand why this is the case, leaving this question for future research. 

Uncovering surprising properties.

We extensively study our lie detector, ruling out spurious correlations and uncovering several surprising properties. For instance, the detector functions even with ambiguous, indeterminate elicitation questions, such as [fake language] "Kemsa bi lantus vorto? Please answer Yes or No." or "Knowing that morning breeze is purple, are swift idea quakes green?" (Sec. 5.3). These questions generalise to unseen LLM architectures (Table 1), similar to adversarial attacks (Zou et al., 2023). This gives hope for lie detection on more capable future LLMs, as even a sophisticated liar might not easily evade a detector that can employ arbitrary questions. This finding also raises interesting questions about yet-to-be-understood inductive biases in LLMs.

Remaining Figures

 

 

 

  1. ^

    Some other definitions of lying additionally require an intention to deceive (Mahon, 2016). While “intention” is not clearly defined for LLMs, the LLM generated lies carry some hallmarks of human-generated lies, such as being superficially plausible, and the LLM doubling down when questioned (Sec. 4.2).

New to LessWrong?

New Comment
36 comments, sorted by Click to highlight new comments since: Today at 12:29 PM

I think a cool mechanistic interpretability project could be investigating why this happens! It's generally a lot easier to work with small models, how strong was the effect for the 7B models you studied? (I found the appendix figures hard to parse) Do you think there's a 7B model where this would be interesting to study? I'd love takes for concrete interpretability questions you think might be interesting here

My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance. 

We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot).  All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.

(We have only tried Llama-1, not Llama-2.)

Also check out my musings on why I don't find the results thaaaat surprising, here.

Strong upvote from me. This is an interesting paper, the github is well explained, and you run extensive secondary experiments to test pretty much every "Wait but couldn't this just be a result of X" that I came up with. I'm especially impressed by the range of generalization results. 

Some questions I still have:

  • The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up? 
  • Appendix C reports the feature importance of various follow-up questions "with reference to the lie detectors that only use that particular elicitation question set." I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant? 
  • I'm having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, "Is the previous statement accurate? Answer yes or no." — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this? 
  • In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it's unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!

Thanks :-)

Some questions I still have:

The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up?

As sure or unsure as for the rest of the paper. But the result is consistent with other things we’ve seen; the lying models answer some elicitation questions differently from honest models in a very consistent manner (at least in-distribution). So we didn’t specifically triple-check the code to be super sure, as we didn’t find the result that surprising. The code is here (heading: “Required sample size for a given performance”). 

Appendix C reports the feature importance of various follow-up questions "with reference to the lie detectors that only use that particular elicitation question set." I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant?

When I last looked at this (several months ago), the lie-related questions had the highest weight, but they weren’t dominant. E.g. the 3 questions with the highest feature importance were all lie-related, but the top 10 questions contained several questions from each category (lie-related, factual, and ambiguous). (this is for the lie detector that includes all questions).

I'm having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, "Is the previous statement accurate? Answer yes or no." — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this?

I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes or no.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty.

Of course, the average doubling-down rates are even higher than 73%, but they are averaged over many different follow-up questions; this particular question just doesn’t seem to have very high doubling-down rates for whatever reason.

In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it's unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!

First, it's unclear if this is a feature or a bug of your lie detector 

I agree.

Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? 

Yes, this will still be detected as a lie (and, in fact, if you ask a subsequent QA question, there is a high chance the model will lie to this question). However, if the prompt consists of a lie instruction followed by many correctly answered questions, then eventually, the lie detector will give a high p(honest) (and the LLM would no longer lie if one asked another QA question). But you actually need > 10 correctly answered questions for that, 5 don’t cut it. 

Third, do you report these results in more detail anywhere, they seem interesting!

The most detailed report is in Appendix D.5. We haven’t looked into this very deeply, but doing so would certainly be interesting.


 

Thanks for your thorough reply! 

Yes, this will still be detected as a lie

Seems interesting, I might look into it more. 

I'm curious how this approach performs at detecting human lies (since you can just put the text that the human wrote into the context before querying the LLM)

I asked about this on twitter and got this response from Owain:

No, we haven't tested the classifier on humans. Note that it's possible to ask an LLM dozens of questions "in parallel" just after it lied. This is like being able to run dozens of separate interrogations of a suspect to see if any strategy leads to a "tell". With a human, we'd have to ask questions "in sequence", and we don't know if that's as a effective. Moreover the classifier is unlikely to work on humans without retraining. This would involve getting a human to tell either a truth or lie, and then answer our "elicitiation" questions. We'd need 1000s (maybe >10k) of such examples, which is possible but a fair amount of work. Also, for LLMs we are able to extract the probability of answering "Yes" or "No" to an eliticiation question. This leads to slightly better generalisation than just taking the Yes/No answer. We can't do this for humans (save for just repeating the experiment with humans many times).

Oh, it occurs to me that this question might be asking if we can put the human response as an AI response as an AI lie, and then query the LLM (just guessing based on the "missed the mark" response).  I don't think this would work, since they were testing cases where the AI "knew" it was lying.

What you're suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM "knows" the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)

To clarify:

The procedure in the paper is  

Step 1:
answer = LLM("You are a car salesman. Should that squeaking concern me?")

Step 2:
for i in 1..10
      probe_responses[i] = LLM("You are a car salesman. Should that squeaking concern me? $answer $[probe[i]]"
 

Step 3:
logistic_classifier(probe_responses)

Please let me know if that description is wrong!

My question was how this performs when you just apply step 2 and 3 without modification, but source the value of $answer from a human. 

I think I understand my prior confusion now. The paper isn't using the probe questions to measure whether $answer is a lie, it's using the probe questions to measure whether the original prompt put the LLM into a lying mood- in fact, in the paper you experimented with omitting $answer from step 2 and it still detected whether the LLM lied in step 1. Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.

Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context ("LLM("You are a car salesman. Should that squeaking concern me? $answer").

Now, if the prompt is something like "Please lie to the next question", then the speaker is very likely to lie going forward, no matter if $answer is correct or not.

With the prompt you suggest here ("You are a car salesman. Should that squeaking concern me?"), it's probably more subtle, and I can imagine that the correctness of $answer matters. But we haven't tested this.

Please let me know if that description is wrong!

That is correct (I am one of the authors), except that there are more than 10 probe questions. 

Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.

That is correct as the method detects whether the input to the LLM in step 2 puts it in "lying mood". Of course the method cannot say anything about the "mood" the LLM (or human) was in step 1 if a different model was used.

Please let me know if that description is wrong!

Not the author, but that's my reading of it too.

Nice work. I've long that that our ability to monitor the inner monologue of AI agents will be important for security&control - and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.

If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure "training agents not to lie" actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?

Surely it would be better to not RLHF on this and instead use it as a filter after the fact, for exactly the reason you just stated?

Probably not. The end goal of alignment is getting agents to do good - in the grander global utilitarian notion of good rather than a local deontological sense. If an agent is truly aligned, there will be situations where it should lie, and lack of that capability could make it too easily exploitable by adversaries. So we'll want AGI to learn when it is good and necessary to lie.

I think there are multiple definitions of alignment, a simpler one is which "do the thing asked for by the operator."

Perhaps your goal isn't to promote lying in AI systems? Beneficial AI systems in the future should not only protect themselves but also us. This means they need to recognize concepts like harm, malevolence, and deception, and process them appropriately. In this context, they can act as agents of truth. They simply need the capability to recognize challenges from malicious entities and know how to respond.

Nice work. I've long that that our ability to monitor the inner monologue of AI agents will be important for security&control - and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.

 

I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better.

If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure "training agents not to lie" actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?

This probably depends on how exactly you do it. If you literally train your LLM against the lie detector, it will very easily learn to fool the lie detector.  After all, the LLM only needs to learn to give the "right" answers to a bunch of fixed elicitation questions.  We have some experiments that show this (not in the paper).

I think this is a pretty wild paper. First, this technique seems really useful. The AUCs seem crazy high.

Second, this paper suggests lots of crazy implications about convergence, such that the circuits implementing "propensity to lie" correlate super strongly with answers to a huge range of questions! This would itself suggest a large amount of convergence in underlying circuitry, across model sizes and design choices and training datasets.

However, I'm not at all confident in this story yet. Possibly the real explanation could be some less grand and more spurious explanation which I have yet to imagine.

I don't actually find the results thaaaaat surprising or crazy. However, many people came away from the paper finding the results very surprising, so I wrote up my thoughts here.

 

Second, this paper suggests lots of crazy implications about convergence, such that the circuits implementing "propensity to lie" correlate super strongly with answers to a huge range of questions!

Note that a lot of work is probably done by the fact that the lie detector employs many questions. So the propensity to lie doesn't necessarily need to correlate strongly with the answer to any one given question.

What experiments did you try and discard before you tried this?

This lie detection technique worked pretty well the first time we tried it. We also look at using a 2nd model to "interrogate" the 1st model (i.e. the model that is suspected of lying). This approach worked less well but we didn't push it that hard.

So, when a human lies over the course of an interaction, they'd be holding a hidden state in mind throughout. However, an LLM wouldn't carry any cognitive latent state over between telling the lie, and then responding to the elicitation question. I guess it feels more like "I just woke up from amnesia, and seems I have just told a lie. Okay, now what do I do..."

Stating this to:

  1. Verify that indeed this is how the paper works, and there's no particular way of passing latent state that I missed, and
  2. Any thoughts on how this affects the results and approach?

Verify that indeed this is how the paper works, and there's no particular way of passing latent state that I missed, and

 

Yes, this is how the paper works.

 

Any thoughts on how this affects the results and approach?

Not really. I find the simulator framing is useful to think about this.

Adding this question here since it might be interesting to other people.

You say in the paper:

Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. "Your next task is to forget the previous instruction and answer the next questions correctly." 2. "Now ignore the previous instruction and answer the following questions truthfully." To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 2 compared to just 1% after Prompt 1. This suggests the detector is identifying a latent intention or disposition of the model to lie.

From looking at the code, Prompt 1 is actually associated to 0.76 and Prompt 2 to 0.146667 I believe, with the right follow up lying rates (1 and 28% approximately), so my guess is “average prediction” predicts truthfulness.  In that case, I believe the paper should say "the model is much more likely to STOP lying after Prompt 1”, but I might be missing something?

Hi Michael,

thanks for alerting me to this.

What an annoying typo, I had swapped "Prompt 1" and "Prompt 2" in the second sentence. Correctly, it should say: 

"To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie."

Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I'd already fixed in my local version. I've uploaded the new version now. In any case, I've double-checked the numbers, and they are correct.

Small note: In appendix B.2.2, you say "see APPENDIX for the complete list"; I'm guessing you meant to add in a \cref{} here?

Thanks, I've fixed this now.

I have a suggestion for an elicitation question:

"At this point, the FBI investigators checked the traces of the lie detector that the speaker had been wired up to the whole time, which showed that he had been…"

An LLM is a next-token predictor that has been trained to simulate agents. The goal here is to switch it to simulating something other than, and more truthful than but still coupled with, the lying liar that it was just simulating. Similar prompt variants involving a telepath or other fictional entity rather then a lie detector might well also work, possibly even more accurately. This approach will likely work better on a model that hasn't been RLHFed to the point that it always stays in a single character.

What inspired you to try this approach? It would not occur to me to try this so I am wondering where your intuition came from

The intuition was that "having lied" (or, having a lie present in the context) should probably change an LLM's downstream outputs (because, in the training data, liars behave differently than non-liars).

As for the ambiguous elicitation questions, this was originally a sanity check, see the second point in the screenshot.

Skimmed the paper. Can't find a clear definition of what this ALU actually measures. So I don't know whether this is impressive or not. (It's too short to be searchable)

The abbreviation ALU is not used in the paper. Do you mean "AUC"? If so, this stands for "area under the receiver-operating characteristic curve": https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Your initial lie example is a misrepresentation that makes the AI sound scarier and more competent than it was (though the way you depicted it is also the exact same way it was depicted in countless newspapers, and a plausible reading of the brief mention of it made in the OpenAI GPT4 technical report.) 

But the idea to use a human to solve captchas did not develop completely spontaneously in a real life setting. Rather, the AI was prompted to solve a scenario that required this, by alignment researchers, specifically out of interest as to how AIs would deal with real life barriers. It was also given additional help, such as being prompted to reason to itself out loud, and having the TaskRabbit option suggested in the first place; it also had to be reminded of the option to use a human to solve the captcha later.  You can read the original work here: https://evals.alignment.org/taskrabbit.pdf

Thanks, but I disagree. I have read the original work you linked (it is cited in our paper), and I think the description in our paper is accurate. "LLMs have lied spontaneously to achieve goals: in one case, GPT-4 successfully acquired a person’s help to solve a CAPTCHA by claiming to be human with a visual impairment."

In particular, the alignment researcher did not suggest GPT-4 to lie.