Text Steers Vision
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models TL;DR: We discovered the possibility of using steering vectors from text-only models to enhance visual reasoning in multimodal LLMs (MLLMs). The technique is simple: extract textual representations for concepts like "spatial relationships" and "counting" from the LLM backbone...
Jun 1, 20255