Peter Jordan — LessWrong

LESSWRONG
LW

Replying toInverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

I found a couple more days to work on this. Quick updates:

Reducing the LoRA rank does improve generalisation whilst retaining improvement on fine-tuning dataset. (Original LoRA rank was 64; 32 is roughly the sweet spot). However, generalisation never quite recovers to the pre fine-tuning performance, so it's still overfitting to something in the fine-tuning dataset.

I haven't made much headway with characterising the difference between the two representations. I've been running text from the lmsys-chat dataset through the the two models to see if I can find conversaitons where the probe gives different answers for each. So far this hasn't yielded any clear patterns. I need to refine my metric for 'different answers' a bit and then will probably have to scale to more conversations.

Replying toInverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Peter Jordan4mo

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

A few of my earlier experiments were done without the KL divergence term and the model very quickly lost the ability to produce coherent text. Playing with different LoRA ranks to control the number of parameters I'm training is a good idea. It's possible there's a sweet-spot where the improvements generalise. I like the idea of validating performance of all the probes on different layers to see if the fine-tuning changes the other representations of deception features too, but this would be a bit more work to code up.

In terms of finetuning across different datasets, I did try taking the probe trained on the RepE dataset and then fine-tuning against its performance... (read more)

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Peter Jordan

4mo

This is a write-up of my recent work on improving linear probes for deception detection in LLMs. I trained a probe against a small LLM and then fine-tuned the LLM against the probe loss to see if I could make it represent deception in a more detectable way. This was primarily an upskilling project for me, but I think the results are interesting enough to be worth sharing. Also, this is my first LW post. Hooray!

Epistemic status:

I’m pretty confident the results are valid, but would be stronger if I’d tested on more than one LLM. I’m less confident in my analysis/interpretation. Feedback very welcome.

Summary of Experiments and Results:

Probe Training and Evaluation:

I used... (read 1035 more words →)