andrq — LessWrong

Good points. I think I was trying to communicate that we were mostly interested in triggers and mechanisms that had some kind of naturalistic interpretation (I would count ["torch 2.12" => we are in the future => ...] as semantic, if the model also accepts "torch 2.13" or numpy versions etc.), which are probably more fitting for naturally occuring misalignment and are likely easier to reverse, rather than mechanisms that appear more "hardcoded" or planted, although it's true that there is plausibly some potential where models plant such things in themselves. Made some edits to the post; thanks for the feedback!

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments