avichal — LessWrong

Thanks!

The triggers always follow the same format as you mentioned, and my intuition for how the DIT adapter works under-the-hood is fairly similar to yours (i.e. elicit the concept and then format it in the right way). We do an initial test of OOD triggers in Section 4.3 of the paper, although generalization to more diverse triggers will probably require more diverse training data (and then how to go further and cover all tail-end triggers/behavior is still an open question in my mind).

The ideas you mentioned also sound interesting! I'll mention one experiment we tried earlier on, which was instead of using a DIT adapter, we tried to train a soft prompt that we could append to the original model to make it match the behavior of the finetuned model (essentially distilling the weight changes into embedding space). We then tried to interpret these soft prompts and found that they didn't resemble the ground-truth difference or trigger (despite causing the two models to behave very similarly), possibly for reasons mentioned in Section 6.2. It's possible we'd see something similar for the activation vector, although it would be interesting to know either way.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments