The triggers always follow the same format as you mentioned, and my intuition for how the DIT adapter works under-the-hood is fairly similar to yours (i.e. elicit the concept and then format it in the right way). We do an initial test of OOD triggers in Section 4.3 of the paper, although generalization to more diverse triggers will probably require more diverse training data (and then how to go further and cover all tail-end triggers/behavior is still an open question in my mind).
The ideas you mentioned also sound interesting! I'll mention one experiment we tried earlier on, which was instead of using a DIT adapter, we tried to train a soft prompt... (read more)
Figure 1: A demonstration of our method on Qwen3-8B. With the adapter applied, a model is able to answer questions about its finetuning changes. Try it yourself here.
WeightDiffQA
Our paper introduces and attempts to solve a task we call WeightDiffQA[1]:
Given a language model M, a weight diff δ, and a natural language question qabout δ, output a correct natural language answer to q.
Here, a "weight diff" is any change that can... (read 1270 more words →)
Thanks!
The triggers always follow the same format as you mentioned, and my intuition for how the DIT adapter works under-the-hood is fairly similar to yours (i.e. elicit the concept and then format it in the right way). We do an initial test of OOD triggers in Section 4.3 of the paper, although generalization to more diverse triggers will probably require more diverse training data (and then how to go further and cover all tail-end triggers/behavior is still an open question in my mind).
The ideas you mentioned also sound interesting! I'll mention one experiment we tried earlier on, which was instead of using a DIT adapter, we tried to train a soft prompt... (read more)