A few of my earlier experiments were done without the KL divergence term and the model very quickly lost the ability to produce coherent text. Playing with different LoRA ranks to control the number of parameters I'm training is a good idea. It's possible there's a sweet-spot where the improvements generalise. I like the idea of validating performance of all the probes on different layers to see if the fine-tuning changes the other representations of deception features too, but this would be a bit more work to code up.
In terms of finetuning across different datasets, I did try taking the probe trained on the RepE dataset and then fine-tuning against its performance on the Roleplaying dataset. Full details here https://docs.google.com/document/d/1BaAc8Uv9P6QZBXSkgdOXqZqi1K99riOD7aExgFKHq-k/edit?tab=t.0 but TL;DR: it produced similar/slightly worse results to the ones above. I don't have enough examples of AI Liar for a meaningful train/test split (possibly I could generate more but I'm not sure). Other combinations might be interesting, but I wanted to stick reasonably closely to the methodology of the original Apollo paper which treated RepE and Roleplaying as basic deception examples to train on and used AI Liar and Insider Trading to test generalisation to more interesting examples. They also had a Sandbagging dataset, but I couldn't work out how to generate it.
I'm not sure when I'll next have a chance to work on this, but when I do I'll have a play with different LoRA ranks and possibly evaluating multiple probes. Thanks for the suggestions!
p.s. All the code is here incase you want to have a play yourself: https://github.com/PeterJ68663/deception-detection/tree/final-code-for-aises-project Note it's on the branch linked, not on main.
I found a couple more days to work on this. Quick updates:
Reducing the LoRA rank does improve generalisation whilst retaining improvement on fine-tuning dataset. (Original LoRA rank was 64; 32 is roughly the sweet spot). However, generalisation never quite recovers to the pre fine-tuning performance, so it's still overfitting to something in the fine-tuning dataset.
I haven't made much headway with characterising the difference between the two representations. I've been running text from the lmsys-chat dataset through the the two models to see if I can find conversaitons where the probe gives different answers for each. So far this hasn't yielded any clear patterns. I need to refine my metric for 'different answers' a bit and then will probably have to scale to more conversations.