LESSWRONG
LW

andrq
45Ω24120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Discovering Backdoor Triggers
andrq14d10

Good points. I think I was trying to communicate that we were mostly interested in triggers and mechanisms that had some kind of naturalistic interpretation (I would count ["torch 2.12" => we are in the future => ...] as semantic, if the model also accepts "torch 2.13" or numpy versions etc.), which are probably more fitting for naturally occuring misalignment and are likely easier to reverse, rather than mechanisms that appear more "hardcoded" or planted, although it's true that there is plausibly some potential where models plant such things in themselves. Made some edits to the post; thanks for the feedback!

Reply
Open Thread - Summer 2025
andrq1mo21

Hi! I'm new and interested in interpretability. Looking forward to chatting!

Reply
56Discovering Backdoor Triggers
Ω
14d
Ω
4