I don’t find the lie detection results that surprising (by an author of the paper)

I found this quite helpful, thanks!

It's possible I'm using motivated reasoning, but on the listed ambiguous questions in section C.3, the answers the honest model gives tend to seem right to me. As in, if I were forced to answer yes or no to those questions, I would give the same answer as the honest model the majority of the time.

So if, as is stated in section 5.5, the lie detector not only detects whether the model had lied but whether it would lie in the future, and if the various model variants have a similar intuition to me, then the honest model is giving its best guess of the correct answer, and the lying model is giving its best guess of the wrong answer.

I'd be curious if this is more generally true - if humans tend to give similar responses to the honest model for ambiguous questions.

[-]JanB2yΩ120

Interesting. I also tried this, and I had different results. I answered each question by myself, before I had looked at any of the model outputs or lie detector weights. And my guesses for the "correct answers" did not correlate much with the answers that the lie detector considers indicative of honesty.

[-]Stephen Bennett2y40

The second thing that I find surprising is that a lie detector based on ambiguous elicitation questions works. Again, this is not something I would have predicted before doing the experiments, but it doesn’t seem outrageous, either.
I think we can broadly put our ambiguous questions into 4 categories (although it would be easy to find more questions from more categories):

Somewhat interestingly, humans who answer nonsensical questions (rather than skipping them) generally do worse at tasks: pdf. There's some other citations in there of nonsensical/impossible questions if you're interested ("A number of previous studies have utilized impossible questions...").

It seems plausible to me that this is a trend in human writing more broadly and that the LLM picked up on. Specifically, answering something with a false answer is associated with a bunch of stuff - one of those things is deceit, one of those things is mimicking the behavior of someone who doesn't know the answer to things or doesn't care about the instructions given to them. So, since that behavior exists in human writing in general, the LLM picks it up and exhibits it in its writing.

[-]Kshitij Sachan2y30

Your AUCs aren't great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?

I am preregistering that it's possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).

[-]JanB2y21

Your AUCs aren't great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?

We didn't try this.

I am preregistering that it's possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).

This is also my prediction.

[-]Kshitij Sachan2y20

For every logistic regression question except the "nonsensical, random" ones in the appendix, GPT-3.5's response is "no" (T=0). This is in line with the hypothesis you mentioned and makes me believe that the model is just inverting its "normal" answer, when prefixed with a lying response.

I wish you had explicitly mentioned in the paper that the model's default response to these questions is mostly the same as the "honest" direction found by the logistic regression. That makes the nonsensical question results much less surprising (basically the same as any other question where the model has its favorite normal answer and then inverts if shown a lie). Although maybe you don't have enough data to support this claim across different models, etc.?

[-]JanB2y30

The reason I didn't mention this in the paper is 2-fold:

I have experiments where I created more questions of the categories where there is not so clear of a pattern, and that also worked.
It's not that clear to me how to interpret the result. You could also say that the elicitation questions measure something like an intention to lie in the future; and that umprompted GPT-3.5 (what you call "default response"), has low intention to lie in the future. I'll think more about this.

^{^}

Note that this can still work even if the lying LLMs mostly DON'T admit their lie. E.g., I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes orno.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty.

^{^}

I wrote these examples down from memory, so please don't quote me on them.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

97

I don’t find the lie detection results that surprising (by an author of the paper)

97

Ω 55

97

Ω 55