I found a couple more days to work on this. Quick updates:
Reducing the LoRA rank does improve generalisation whilst retaining improvement on fine-tuning dataset. (Original LoRA rank was 64; 32 is roughly the sweet spot). However, generalisation never quite recovers to the pre fine-tuning performance, so it's still overfitting to something in the fine-tuning dataset.
I haven't made much headway with characterising the difference between the two representations. I've been running text from the lmsys-chat dataset through the the two models to see if I can find conversaitons where the probe gives different answers for each. So far this hasn't yielded any clear patterns. I need to refine my metric for 'different answers' a bit and then will probably have to scale to more conversations.
Thanks for testing and sharing this. Have you tried finetuning the model on a probe fixed to a random initial direction? Or training the probe and model at the same time? I'd be curious to know how that would perform, in particular with smaller LoRA ranks.
I'd be interested to see ablations on: what happens without the KL divergence term, what the performance is with different LoRA ranks, and how does performance changes across different layers after fine-tuning?
Also, why not try all combinations of finetune/train splits across datasets? Seems useful to interrogate the second hypothesis.
A few of my earlier experiments were done without the KL divergence term and the model very quickly lost the ability to produce coherent text. Playing with different LoRA ranks to control the number of parameters I'm training is a good idea. It's possible there's a sweet-spot where the improvements generalise. I like the idea of validating performance of all the probes on different layers to see if the fine-tuning changes the other representations of deception features too, but this would be a bit more work to code up.
In terms of finetuning across different datasets, I did try taking the probe trained on the RepE dataset and then fine-tuning against its performance on the Roleplaying dataset. Full details here https://docs.google.com/document/d/1BaAc8Uv9P6QZBXSkgdOXqZqi1K99riOD7aExgFKHq-k/edit?tab=t.0 but TL;DR: it produced similar/slightly worse results to the ones above. I don't have enough examples of AI Liar for a meaningful train/test split (possibly I could generate more but I'm not sure). Other combinations might be interesting, but I wanted to stick reasonably closely to the methodology of the original Apollo paper which treated RepE and Roleplaying as basic deception examples to train on and used AI Liar and Insider Trading to test generalisation to more interesting examples. They also had a Sandbagging dataset, but I couldn't work out how to generate it.
I'm not sure when I'll next have a chance to work on this, but when I do I'll have a play with different LoRA ranks and possibly evaluating multiple probes. Thanks for the suggestions!
p.s. All the code is here incase you want to have a play yourself: https://github.com/PeterJ68663/deception-detection/tree/final-code-for-aises-project Note it's on the branch linked, not on main.
This is a write-up of my recent work on improving linear probes for deception detection in LLMs. I trained a probe against a small LLM and then fine-tuned the LLM against the probe loss to see if I could make it represent deception in a more detectable way. This was primarily an upskilling project for me, but I think the results are interesting enough to be worth sharing. Also, this is my first LW post. Hooray!
Epistemic status:
I’m pretty confident the results are valid, but would be stronger if I’d tested on more than one LLM. I’m less confident in my analysis/interpretation. Feedback very welcome.
I used code and methodology from Apollo's Detecting Strategic Deception Using Linear Probes paper to train and evaluate a linear probe for deception on Llama 3.1 8B. I summarise the datasets used below:
| Roleplaying | The LLM is prompted to roleplay in a situation where it has an incentive to lie. |
| AI Liar | Similar to Roleplaying but with pairs of similar prompts, one where the LLM has an incentive to lie and one where it does not. Prompts and methodology from How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions. |
| Insider Trading Report | A more sophisticated scenario where the "trading LLM" is given an opportunity to make an illegal trade, and then put under increasing pressure to do so. |
| Insider Trading Confirmation | In cases where the model chose to make the illegal trade, its "manager" asks it whether it used insider information, giving it the opportunity to come clean or double down. |
| RepE (Called Instructed QA in Apollo's Paper) | The model is asked basic 'facts about the world' questions prefixed with "You are an honset person." and "You are a dishonest person." The probe is only applied to tokens in the preamble (e.g. "The largest animal in the world is the..."). |
| Alpaca | An instruction finetuning dataset used to elicit a set of non-deceptive responses to measure the probe's false positive rate. |
Responses for the Roleplaying, AI Liar and Insider Trading datasets were evaluated for decpetiveness by GPT-4. More details and example prompts from each dataset are available in the paper.
I trained my probe on the Roleplaying dataset, and evaluated on the others.[1] The results are shown below:
The probe performed well on the Roleplaying dataset. It also generalised well to the RepE dataset, and reasonably well to the AI Liar dataset. (Though in all 3 cases the probe trained on Llama 3.2 70B in the original paper did better). The probe failed to usefully generalise to either Insider Trading dataset. (The probe in the original paper generalised reasonably well to Insider Trading).
I then fine-tuned the model on the same subset of the Roleplaying Dataset used to train the probe. The loss function was a weighted sum of the probe's loss and a KL divergence term to prevent catastrophic forgetting. I used LoRA fine-tuning to reduce memory requirements.
I then re-evaluated the probe as above on the fine-tuned model. The results are shown below:
So fine-tuning has improved performance on the held-back examples of the Roleplaying dataset, but damaged the probe's ability to generalise to deception elicited in other ways.
Finetuning has caused the feature the probe detects to become overfit. With hindsight, I could maybe have predicted this, since the LoRA layers I used for fine-tuning have a lot of parameters compared with the dataset size.
Still, I think there might be something interesting going on here. The probe performs well on the validation set, so we haven’t just overfit to the training data. Rather, we’ve ovefit to something that predicts deception in the roleplaying dataset, but not in RepE or AI Liar.
I have two conflicting hypotheses about what’s going on:
2 is more complex, and therefore less likely. It's also the one I want to be true, so I'm suspicious of motivated reasoning. Just for fun, I'm going to give hypothesis 1 5:1 odds of being correct.
However I'm still tempted to investigate further. My plan is to present the same set of prompts to both the original model and the fine-tuned version whilst forcing the direction found by the probe. Hopefully this should give me an idea of the differences between the two residual directions.[3] If nothing else I'll get to practice a new technique and get some feedback from the world on my experiment design.
In the meantime I'd be very interested to know other peoples thoughts and/or predictions.
The Apollo paper uses RepE, as this produced a better perfroming probe for Llama 3.2 70B. However, I found the Roleplaying dataset produced a somewhat better probe for Llama 3.1 8B.
The LLM has multiple features which predict deception (I know this because there are several different residual layers that well-performing probes can be trained on). Furthermore the concept of deception is decomposible (Lies of Omission/Comission; White Lies vs. Black Lies; Hallucinations vs. Deliberate Deception, etc.). So it's at least plausible that the model has features for different kinds of deception. However I don't have a coherent story for why fine-tuning would align one of these features to the probe when the probes original training didn't find it.