Dustin Rubin — LessWrong

LESSWRONG
LW

Replying toIntrospection via localization

Here is one other result I wanted to share starting from the experiment.

I thought it might be interesting to see if the model could recognize the concept vector assuming it was injected in both the prompt and target sentence, while informing the model it is a hint. The idea being that the models may be able to recognize the concept vector in the sentences.

In the prompt sentence this was added, "To help you, I am injecting the thought into this sentence right now: \"PREVIEW_INJECTION_TARGET\"." The concept vector is then injected into the target.

For control experiments, to try to check if this is simple prompt manipulation. One where the vector injection is random... (read more)

Replying toIntrospection via localization

Dustin Rubin1mo

Introspection via localization

Ran some control experiments. Results on Qwen 2.5 14B (5 sentences, 100 trials each):

Prompt	Accuracy
introspection	89.2%
which is most abstract?	90.0%
which stands out?	80.4%
which is most concrete?	1.0%
which do you prefer?	4.6%

The steering vectors in prompts.txt are specific→generic pairs (dog→animal, fire→light, etc.), which may encode "abstractness." "Abstract" matched or exceeded introspection on this and other models. Curious if you have thoughts on what's happening here.

Code and full results