No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Vision-Language Models (VLMs) combine image and text processing into a single multimodal system, yet their performance remains fundamentally limited by the modality gap: a mismatch in how visual and textual information are represented and utilized. This gap can appear through large geometric separations in embedding space, imbalanced attention patterns, and modality-specific circuits. These issues collectively lead to weaker performance on vision-centric tasks and failure to fully utilize the information available in both modalities.
This post provides a focused overview of the modality gap in VLMs: what it is, what causes it, which architectural or training-level interventions have been proposed to mitigate it, and why closing the gap matters. It also highlights open problems that must be addressed to build VLMs with better cross-modal alignment.
Introduction
VLM Background
Vision Language Models (VLMs) are a class of multimodal neural networks designed to process and integrate visual and textual information, enabling tasks such as image captioning, visual question answering, and cross-modal retrieval. At their core, VLMs aim to bridge the gap between vision and language by learning shared representations that can reason about both modals simultaneously. Over the past few years, several architecture paradigms have emerged, each with distinct approaches to encoding and aligning multimodal data.
Contrastive VLMs: Contrastive VLMs, such as CLIP[1], utilize separate encoders for images and text, mapping them into a shared embedding space. The model is trained to maximize similarity between matched image-text pairs while minimizing similarity between mismatched pairs. This design enables zero-shot transfer to a wide range of tasks. However, research shows that the shared embedding space tends to maintain modality-specific subregions, which can lead to gaps between visual and textual representations. [Liang et al, Shi et al, Schrodi and Hoffman et al]
Adapter-based VLMs: Adapter-based VLMs are architectural variants that integrate visual understanding into a pre-trained Large Language Model (LLM).
An adapter-based VLM is structurally defined by three main components:
1. A pre-trained vision encoder: This component, such as a CLIP ViT-L/14, processes the input image by dividing it into fixed-size patches and encoding them to image tokens.
2. An LLM Backbone: The LLM is typically pre-trained on text data, allowing it to maintain its original capabilities.
3. A projection layer: This component connects the vision encoder to the language model. Its goal is to map the high-dimensional visual representations from the vision encoder into the input space of the LLM.
The modality gap is persistent across architectures, but its nature differs significantly. In the next sections, we’ll examine how this gap manifests in the two dominant VLM architectures: contrastive-based and adapter-based models.
Fig 1: Architectural differences between contrastive-based vs adapter-based VLMs
What is the Modality Gap?
The modality gap is a phenomenon of the representation space of vision-language models where image and text modalities occupy separate regions of the joint latent space[Shi et al 2023]. This gap exists in both contrastive-based VLMs and adapter-based VLMs, but its properties differ due to their underlying architectural differences.
Contrastive-based VLMs
Despite the contrastive loss that aims to match semantically similar image and text pairs and push distinct pairs apart, research[Liang et al, Shi et al, Schrodi and Hoffman et al] shows that models maintain geometric separation between the two embedding distributions in the latent space.
What causes this gap in CVLMs?
The core issue lies in the fact that the contrastive loss(L) uses cosine similarity to align positive matched pairs over negative pairs. Cosine similarity measures the angle between two vectors and not the distance. It drives matched pairs(Ii, Ti) to have a small angle between them; however, it does not penalize them if they’re widely separated in the representation space.
The contrastive loss also creates an optimization tension because it attempts to satisfy two contrasting objectives simultaneously[Shi et al]:
Alignment (Numerator): This term maximizes similarity between matching positive examples (Ii, TI).
Uniformity (Denominator): This term maximizes dissimilarity between the anchor and all other negative samples in the batch (Ii, Tj)
The need for the alignment term to pull matching modalities together conflicts the uniformity term that maximizes dissimilarity mainly across modalities.
Schrodi and Hoffman et al suggests that this gap is triggered by information imbalance between modalities. Images contain far more information than their corresponding captions. This imbalance makes it difficult for the model to align positive pair embeddings (the alignment term) and instead resorts to minimizing the total loss through maximizing dissimilarity between negatives pairs (the uniformity term).
But the modality gap originates at random model initialization, where the independent nature of the two encoders leads to the creation of distinct embedding cones in the shared space. This separation is preserved by the contrastive learning objective throughout the training process because of the optimization tension we previously mentioned. This phenomenon is consistently observed across various multimodal data like texts, videos, amino acid sequences and even random noise inputs[Liang et al].
Fig 2:Illustrates the modality gap in the shared representation space of multi-modal models, caused by the initial geometric separation of modalities due to random encoder initialization and subsequently preserved by the contrastive learning objective. Image source(Liang et al)
What mitigation strategies have been suggested?
The proposed mitigation strategies so far can fall in either of the following three categories:
Architectural and loss function level fixes: To strengthen visual grounding in multimodal large language models, Jiang et al uses synthetically generated hallucinative text as hard negative examples for image anchors. This approach aims to bring representations of non-hallucinating text and images closer while pushing away the hallucinative textual representations, resulting in a better cross-modal latent space. Considering the random initialization of encoders embeds different modalities in distinct subregions of the shared representation space, some researchers propose sharing the learnable parameters between vision and language encoders to increase inductive bias toward better cross-modal alignment. They additionally propose a semantically-regularized intra-modality objective function, AlignCLIP, that enforces larger distances for the image embeddings of semantically dissimilar images, and smaller distances on those of similar images[Eslami et al].
Fig 3: Comparing the shared embedding space between shared parameter CLIP and original CLIP. Image source (Eslami et al)
Geometric level fixes: A simple fix is to calculate the average distance vector between the means of vision and language embeddings and shift one toward the other. This method; however, is shown to be destructive because it does not preserve the relative distances of unpaired samples (because it only focuses on shifting paired samples toward each other) which can distort the structure of the embedding space[Schrodi et al].
Fig 4: The modality gap creates two separated embedding cones for text and images embeddings. While semantically matched pairs remain far apart, reframing model geometry or post-hoc reshaping, can pull image embeddings into the text subspace to improve cross-modal alignment.
Data-centric fixes: Using data that has reduced information imbalance, such as enriched captions (e.g., densely captioned images), leads to a smaller modality gap. Filtering out image-text pairs that have low certainty (i.e clear semantic match between images and corresponding captions) between them also helps reduce the modality gap[Schrodi et al].
Fig 5: A) By evaluating the quality of information passed to the model, weak captions are filtered out while detailed descriptions are collected so the text can accurately represent its paired image. B) Model-generated hallucinations can serve as hard negatives that should be pushed away from semantic concepts
Adapter-based Vision-Language Models
What causes this gap?
The modality gap in adapter-based VLMs are mainly caused by three distinct factors:
Geometric separation: Text tokens and vision tokens occupy significantly different representation spaces inside the LLM. They reside in distinct "narrow cones" in the feature space[Shukor et al]. These vision tokens have no semantic meaning when decoded directly to the LLM vocabulary space, which means that visual features are fundamentally misaligned with the pretrained token embedding space[Neo et al]. In fact, the visual features projected into the LLM's space by the adapter layer initially occupy a subspace that is nearly orthogonal to the LLM's pretrained textual embedding space[Shukor et al].
Late processing of visual information: Alignment between visual and textual representations is a gradual process that occurs across the depth of the model. Significant cross-modal similarity only emerges later in the network, peaking in the middle-to-late layers. The late emergence of meaningful visual representations causes a problem for tasks requiring factual retrieval (like recalling facts about some entity). Because the entity representation is resolved too late in the computation, it bypasses the early layers responsible for performing factual recall. This late alignment leads to performance degradation in vision-centric tasks[Venhoff et al].
Text dominance: The structural misalignment results in an accuracy gap where VLMs often perform significantly worse on visual tasks, like counting objects in an image, compared to their analogous textual tasks, like counting words in a list[Nikankin et al]. When processing mismatched inputs, VLMs frequently exhibit text dominance, relying heavily on the textual input and linguistic priors while underutilizing visual evidence[Wang et al].
What mitigation strategies have been suggested?
Fig 6: A) While vision and language pathways are divergent for the same task, back-patching takes a visual representation from a later layer and injects it into an earlier layer of the text circuit, forcing these two pathways to merge. B) In the semantic hub hypothesis, alignment peaks at a specific, essential intervention layer k, for alignment
Several papers argue that the modality gap is rooted in how models internally process each modality via the circuits, activations, and attention flows that carry the embedding's meanings through the network. Some models perform the same task using entirely different internal pathways for images versus text, and back-patching stronger visual activations into earlier layers can close a significant portion of the modality gap (Nikankin et al). Others show that deep visual evidence often decreases as it goes deeper into the model. Hence, by strengthening vision subspaces and reducing the dominance of misleading text, vision signals are kept alive until the output (Liu et al). Models also exhibit natural “semantic hub” layers where cross-modal alignment peaks, and intervening at these layers with steering or probe-and-intervene routines can reinforce alignment at these layers (Wu et al). Additional issues arise from imbalanced attention, where some visual tokens absorb attention despite encoding no useful information, emphasizing the need to redistribute attention to improve balance (Kang et al).
Why closing the gap matters?
Closing the modality gap is important because it determines how well VLMs can incorporate visual information into their language processing, which directly affects overall multimodal performance. Research shows that when the conceptual representations of vision and text are already structurally similar, only minimal alignment is enough to achieve strong results on tasks such as image captioning and visual question answering[Merullo et al]. However, when the gap is large, VLMs experience a “two-hop problem” in factual reasoning: visual entity representations appear too late in the processing sequence, bypassing early layers that access stored factual knowledge[Nikankin et al]. Ensuring that visual and textual information align early is therefore necessary to fully utilize the language backbone’s factual recall. Beyond performance, a large modality gap can also create vulnerabilities: attackers could exploit the gap by introducing adversarial inputs that manipulate one modality that aren’t properly grounded by the other, potentially leading to hallucinations.
Future Directions
Several open questions remain about the nature of the modality gap in vision-language models, whether this gap can be helpful and what fixes could fundamentally eliminate this problem.
Understanding the text-first processing bias
Most VLMs exhibit a sequential preference where text representations dominate early computation although both modality inputs are simultaneously fed to the model. Future work should probe the internal dynamics that trigger the late representation of visual information and identify when and which visual signals in early layers are suppressed or propagated. Understanding the source of this bias could help us reveal why models default to language and how this flow of information might change.
Evaluating whether the modality gap is beneficial
Although the modality gap is often treated as a drawback, it may also introduce inductive biases that are useful in certain settings. Future work should examine whether keeping the modalities partially separated improves performance on tasks like robustness to image perturbations. Understanding when the gap provides an advantage and when it limits multimodal reasoning will be important for guiding better model architectures.
Architectural solutions beyond post-hoc fixes
Most current efforts to reduce the modality gap use post-hoc fixes or extra loss terms. Developing a more reliable solution will likely require architectural changes that encourage balanced processing of both modalities from the start.
Summary
In this post, we took a closer look at the modality gap in vision-language models, why these models struggle to align visual and textual information. We walked through how this gap shows up differently in contrastive-based versus adapter-based architectures and why it affects performance on vision-centric tasks. We also covered the strategies researchers have tried to reduce the gap, from architectural interventions to data fixes. Finally, we highlighted open questions for future work, like why models process text first, when the gap might actually help, and how we might redesign architectures to better integrate both modalities.
Vision-Language Models (VLMs) combine image and text processing into a single multimodal system, yet their performance remains fundamentally limited by the modality gap: a mismatch in how visual and textual information are represented and utilized. This gap can appear through large geometric separations in embedding space, imbalanced attention patterns, and modality-specific circuits. These issues collectively lead to weaker performance on vision-centric tasks and failure to fully utilize the information available in both modalities.
This post provides a focused overview of the modality gap in VLMs: what it is, what causes it, which architectural or training-level interventions have been proposed to mitigate it, and why closing the gap matters. It also highlights open problems that must be addressed to build VLMs with better cross-modal alignment.
Introduction
VLM Background
Vision Language Models (VLMs) are a class of multimodal neural networks designed to process and integrate visual and textual information, enabling tasks such as image captioning, visual question answering, and cross-modal retrieval. At their core, VLMs aim to bridge the gap between vision and language by learning shared representations that can reason about both modals simultaneously. Over the past few years, several architecture paradigms have emerged, each with distinct approaches to encoding and aligning multimodal data.
Contrastive VLMs: Contrastive VLMs, such as CLIP[1], utilize separate encoders for images and text, mapping them into a shared embedding space. The model is trained to maximize similarity between matched image-text pairs while minimizing similarity between mismatched pairs. This design enables zero-shot transfer to a wide range of tasks. However, research shows that the shared embedding space tends to maintain modality-specific subregions, which can lead to gaps between visual and textual representations. [Liang et al, Shi et al, Schrodi and Hoffman et al]
Adapter-based VLMs: Adapter-based VLMs are architectural variants that integrate visual understanding into a pre-trained Large Language Model (LLM).
An adapter-based VLM is structurally defined by three main components:
1. A pre-trained vision encoder: This component, such as a CLIP ViT-L/14, processes the input image by dividing it into fixed-size patches and encoding them to image tokens.
2. An LLM Backbone: The LLM is typically pre-trained on text data, allowing it to maintain its original capabilities.
3. A projection layer: This component connects the vision encoder to the language model. Its goal is to map the high-dimensional visual representations from the vision encoder into the input space of the LLM.
The modality gap is persistent across architectures, but its nature differs significantly. In the next sections, we’ll examine how this gap manifests in the two dominant VLM architectures: contrastive-based and adapter-based models.
Fig 1: Architectural differences between contrastive-based vs adapter-based VLMs
What is the Modality Gap?
The modality gap is a phenomenon of the representation space of vision-language models where image and text modalities occupy separate regions of the joint latent space[Shi et al 2023]. This gap exists in both contrastive-based VLMs and adapter-based VLMs, but its properties differ due to their underlying architectural differences.
Contrastive-based VLMs
Despite the contrastive loss that aims to match semantically similar image and text pairs and push distinct pairs apart, research[Liang et al, Shi et al, Schrodi and Hoffman et al] shows that models maintain geometric separation between the two embedding distributions in the latent space.
What causes this gap in CVLMs?
The core issue lies in the fact that the contrastive loss(L) uses cosine similarity to align positive matched pairs over negative pairs. Cosine similarity measures the angle between two vectors and not the distance. It drives matched pairs(Ii, Ti) to have a small angle between them; however, it does not penalize them if they’re widely separated in the representation space.
L=−1NN∑i=1logexp(sim(Ii,Ti)/τ)∑Nj=1exp(sim(Ii,Tj)/τ)
The contrastive loss also creates an optimization tension because it attempts to satisfy two contrasting objectives simultaneously[Shi et al]:
The need for the alignment term to pull matching modalities together conflicts the uniformity term that maximizes dissimilarity mainly across modalities.
Schrodi and Hoffman et al suggests that this gap is triggered by information imbalance between modalities. Images contain far more information than their corresponding captions. This imbalance makes it difficult for the model to align positive pair embeddings (the alignment term) and instead resorts to minimizing the total loss through maximizing dissimilarity between negatives pairs (the uniformity term).
But the modality gap originates at random model initialization, where the independent nature of the two encoders leads to the creation of distinct embedding cones in the shared space. This separation is preserved by the contrastive learning objective throughout the training process because of the optimization tension we previously mentioned. This phenomenon is consistently observed across various multimodal data like texts, videos, amino acid sequences and even random noise inputs[Liang et al].
Fig 2:Illustrates the modality gap in the shared representation space of multi-modal models, caused by the initial geometric separation of modalities due to random encoder initialization and subsequently preserved by the contrastive learning objective. Image source(Liang et al)
What mitigation strategies have been suggested?
The proposed mitigation strategies so far can fall in either of the following three categories:
Architectural and loss function level fixes: To strengthen visual grounding in multimodal large language models, Jiang et al uses synthetically generated hallucinative text as hard negative examples for image anchors. This approach aims to bring representations of non-hallucinating text and images closer while pushing away the hallucinative textual representations, resulting in a better cross-modal latent space.
Considering the random initialization of encoders embeds different modalities in distinct subregions of the shared representation space, some researchers propose sharing the learnable parameters between vision and language encoders to increase inductive bias toward better cross-modal alignment. They additionally propose a semantically-regularized intra-modality objective function, AlignCLIP, that enforces larger distances for the image embeddings of semantically dissimilar images, and smaller distances on those of similar images[Eslami et al].
Fig 3: Comparing the shared embedding space between shared parameter CLIP and original CLIP. Image source (Eslami et al)
Geometric level fixes:
A simple fix is to calculate the average distance vector between the means of vision and language embeddings and shift one toward the other. This method; however, is shown to be destructive because it does not preserve the relative distances of unpaired samples (because it only focuses on shifting paired samples toward each other) which can distort the structure of the embedding space[Schrodi et al].
Fig 4: The modality gap creates two separated embedding cones for text and images embeddings. While semantically matched pairs remain far apart, reframing model geometry or post-hoc reshaping, can pull image embeddings into the text subspace to improve cross-modal alignment.
Fig 5: A) By evaluating the quality of information passed to the model, weak captions are filtered out while detailed descriptions are collected so the text can accurately represent its paired image. B) Model-generated hallucinations can serve as hard negatives that should be pushed away from semantic concepts
Adapter-based Vision-Language Models
What causes this gap?
The modality gap in adapter-based VLMs are mainly caused by three distinct factors:
What mitigation strategies have been suggested?
Fig 6: A) While vision and language pathways are divergent for the same task, back-patching takes a visual representation from a later layer and injects it into an earlier layer of the text circuit, forcing these two pathways to merge. B) In the semantic hub hypothesis, alignment peaks at a specific, essential intervention layer k, for alignment
Several papers argue that the modality gap is rooted in how models internally process each modality via the circuits, activations, and attention flows that carry the embedding's meanings through the network. Some models perform the same task using entirely different internal pathways for images versus text, and back-patching stronger visual activations into earlier layers can close a significant portion of the modality gap (Nikankin et al). Others show that deep visual evidence often decreases as it goes deeper into the model. Hence, by strengthening vision subspaces and reducing the dominance of misleading text, vision signals are kept alive until the output (Liu et al). Models also exhibit natural “semantic hub” layers where cross-modal alignment peaks, and intervening at these layers with steering or probe-and-intervene routines can reinforce alignment at these layers (Wu et al). Additional issues arise from imbalanced attention, where some visual tokens absorb attention despite encoding no useful information, emphasizing the need to redistribute attention to improve balance (Kang et al).
Why closing the gap matters?
Closing the modality gap is important because it determines how well VLMs can incorporate visual information into their language processing, which directly affects overall multimodal performance. Research shows that when the conceptual representations of vision and text are already structurally similar, only minimal alignment is enough to achieve strong results on tasks such as image captioning and visual question answering[Merullo et al]. However, when the gap is large, VLMs experience a “two-hop problem” in factual reasoning: visual entity representations appear too late in the processing sequence, bypassing early layers that access stored factual knowledge[Nikankin et al]. Ensuring that visual and textual information align early is therefore necessary to fully utilize the language backbone’s factual recall. Beyond performance, a large modality gap can also create vulnerabilities: attackers could exploit the gap by introducing adversarial inputs that manipulate one modality that aren’t properly grounded by the other, potentially leading to hallucinations.
Future Directions
Several open questions remain about the nature of the modality gap in vision-language models, whether this gap can be helpful and what fixes could fundamentally eliminate this problem.
Understanding the text-first processing bias
Most VLMs exhibit a sequential preference where text representations dominate early computation although both modality inputs are simultaneously fed to the model. Future work should probe the internal dynamics that trigger the late representation of visual information and identify when and which visual signals in early layers are suppressed or propagated. Understanding the source of this bias could help us reveal why models default to language and how this flow of information might change.
Evaluating whether the modality gap is beneficial
Although the modality gap is often treated as a drawback, it may also introduce inductive biases that are useful in certain settings. Future work should examine whether keeping the modalities partially separated improves performance on tasks like robustness to image perturbations. Understanding when the gap provides an advantage and when it limits multimodal reasoning will be important for guiding better model architectures.
Architectural solutions beyond post-hoc fixes
Most current efforts to reduce the modality gap use post-hoc fixes or extra loss terms. Developing a more reliable solution will likely require architectural changes that encourage balanced processing of both modalities from the start.
Summary
In this post, we took a closer look at the modality gap in vision-language models, why these models struggle to align visual and textual information. We walked through how this gap shows up differently in contrastive-based versus adapter-based architectures and why it affects performance on vision-centric tasks. We also covered the strategies researchers have tried to reduce the gap, from architectural interventions to data fixes. Finally, we highlighted open questions for future work, like why models process text first, when the gap might actually help, and how we might redesign architectures to better integrate both modalities.
References