Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?
This is a write-up of my recent work on improving linear probes for deception detection in LLMs. I trained a probe against a small LLM and then fine-tuned the LLM against the probe loss to see if I could make it represent deception in a more detectable way. This was...
Oct 9, 202521