Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Excited to announce our new work: Inference-Time Intervention (ITI), a minimally-invasive control technique that significantly improves LLM truthfulness using little resources, benchmarked on the TruthfulQA dataset. Preprint link. We start from the surprising finding that certain attention heads have a clear activation distribution difference for true and false statements. Probing...