Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?
This is a write-up of my recent work on improving linear probes for deception detection in LLMs. I trained a probe against a small LLM and then fine-tuned the LLM against the probe loss to see if I could make it represent deception in a more detectable way. This was...
I found a couple more days to work on this. Quick updates:
Reducing the LoRA rank does improve generalisation whilst retaining improvement on fine-tuning dataset. (Original LoRA rank was 64; 32 is roughly the sweet spot). However, generalisation never quite recovers to the pre fine-tuning performance, so it's still overfitting to something in the fine-tuning dataset.
I haven't made much headway with characterising the difference between the two representations. I've been running text from the lmsys-chat dataset through the the two models to see if I can find conversaitons where the probe gives different answers for each. So far this hasn't yielded any clear patterns. I need to refine my metric for 'different answers' a bit and then will probably have to scale to more conversations.