x

LESSWRONG

LW

Kaiyuan Zhang

Kaiyuan Zhang

Message

2

1

2y

Kaiyuan Zhang

2

2y

Kaiyuan Zhang — LessWrong

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Kaiyuan Zhang2y30

Insightful work on how backdoor behaviors persist in LLMs despite safety training! Highlighted in our recent IEEE S&P paper (https://orthoglinearbackdoor.github.io/), we theoretically uncover that orthogonality and linearity are key in understanding why some attacks evade standard defenses. These insights open doors for further collaborative research into AI security challenges. Would love to dive deeper into this with your team!