LESSWRONG
LW

1432
andrq
68Ω24120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Discovering Backdoor Triggers
andrq2mo10

Good points. I think I was trying to communicate that we were mostly interested in triggers and mechanisms that had some kind of naturalistic interpretation (I would count ["torch 2.12" => we are in the future => ...] as semantic, if the model also accepts "torch 2.13" or numpy versions etc.), which are probably more fitting for naturally occuring misalignment and are likely easier to reverse, rather than mechanisms that appear more "hardcoded" or planted, although it's true that there is plausibly some potential where models plant such things in themselves. Made some edits to the post; thanks for the feedback!

Reply
Open Thread - Summer 2025
andrq3mo21

Hi! I'm new and interested in interpretability. Looking forward to chatting!

Reply
36Steering Evaluation-Aware Models to Act Like They Are Deployed
Ω
12h
Ω
0
57Discovering Backdoor Triggers
Ω
2mo
Ω
4