Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
TL;DR: We discovered the possibility of using steering vectors from text-only models to enhance visual reasoning in multimodal LLMs (MLLMs). The technique is simple: extract textual representations for concepts like "spatial relationships" and "counting" from the LLM backbone and then apply them to the vision-language model's hidden states. This steering not only changes how the model understands visual content but also leads to meaningful improvements - 15.8% on spatial reasoning tasks and 34.2% on counting tasks for example, which suggests that these models maintain some unified representations that bridge text and vision.
The Core Insight: Text Steers Vision
Here's something that surprised us: you... (read 2099 more words →)