LESSWRONG
LW

DeceptionAI
Frontpage

11

Identifying "Deception Vectors" In Models

by Stephen Martin
9th Jun 2025
1 min read
0

11

This is a linkpost for https://arxiv.org/pdf/2506.04909
DeceptionAI
Frontpage

11

New Comment
Moderation Log
More from Stephen Martin
View more
Curated and popular this week
0Comments

Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting ”deception vectors” via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty related issue of reasoning models and providing tools for trustworthy AI alignment.

 

This seems like a positive breakthrough for mech interp research generally, the team used RepE to identify features, and were able to "reliably suppress or induce strategic deception".