LESSWRONG
LW

Kaiyuan Zhang
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Kaiyuan Zhang1y30

Insightful work on how backdoor behaviors persist in LLMs despite safety training! Highlighted in our recent IEEE S&P paper (https://orthoglinearbackdoor.github.io/), we theoretically uncover that orthogonality and linearity are key in understanding why some attacks evade standard defenses. These insights open doors for further collaborative research into AI security challenges. Would love to dive deeper into this with your team!

Reply
No posts to display.