Ran some control experiments. Results on Qwen 2.5 14B (5 sentences, 100 trials each):
Prompt
Accuracy
introspection
89.2%
which is most abstract?
90.0%
which stands out?
80.4%
which is most concrete?
1.0%
which do you prefer?
4.6%
The steering vectors in prompts.txt are specific→generic pairs (dog→animal, fire→light, etc.), which may encode "abstractness." "Abstract" matched or exceeded introspection on this and other models. Curious if you have thoughts on what's happening here.
Ran some control experiments. Results on Qwen 2.5 14B (5 sentences, 100 trials each):
The steering vectors in prompts.txt are specific→generic pairs (dog→animal, fire→light, etc.), which may encode "abstractness." "Abstract" matched or exceeded introspection on this and other models. Curious if you have thoughts on what's happening here.
Code and full results