x

LESSWRONG

LW

Peter Jordan — LessWrong

Peter Jordan

Peter Jordan

Message

20

1

2

1y

Peter Jordan

20

1y

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

This is a write-up of my recent work on improving linear probes for deception detection in LLMs. I trained a probe against a small LLM and then fine-tuned the LLM against the probe loss to see if I could make it represent deception in a more detectable way. This was...

Oct 9, 2025•21