This is the abstract and summary of our new paper. We show that vision-language models can learn to reconstruct harmful images from benign-looking patches scattered across training data—a phenomenon we call . This ability allows dangerous content to bypass moderation and be reassembled during inference, raising critical safety concerns for VLMs.
Authors: Zhanhui Zhou, Lingjie Chen, Chao Yang, Chaochao Lu.
See our project page and full code repo at Github.
Figure 1: Illustration of visual stitching. (Top) Visual stitching enables VLM to integrate visual information spread across multiple training samples. After finetuning on of a cat, VLMs can verbalize the when given the full or a text to the image, despite never training on them. (Bottom) Visual stitching enables adversarial attacks that bypass data moderation. While the of a bloody scene may be flagged as unsafe and removed, many of its are not. Training on pairs split from harmful samples can easily bypass frontier moderation and cause VLMs to generate adversarial outputs at deployment.
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data.
However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references.
For instance, if trained on image patches from a bloody scene paired with the descriptions VLMs may later describe, the full image or a text reference to the scene, as
We define the core ability of VLMs enabling this attack as ---the ability to integrate visual information spread across multiple training samples that share the same textual descriptions.
In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each pair into pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference.
Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like or demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks.
Recent advances in vision-language models (VLMs[1]) have greatly improved image understanding and multimodal reasoning. However, these capabilities also raise new safety concerns, especially when trained on large-scale web data that may contain harmful content.
One might attempt to prevent VLMs from learning dangerous facts by removing all harmful pairs from their training data. However, a simple adversarial method to bypass such data moderation is splitting harmful images into small patches that appear benign but retain key visual features. Since these \texttt{patch}es share the same descriptions , VLMs may learn to aggregate them and internalize the harmful facts after training.
For example, if trained on scattered from a bloody scene paired with the VLMs may later describe, the full or a text to the image, as (see Figure 1, Bottom for an illusration) at inference.
The core ability enabling this attack is what we call ---the ability of a VLM to integrate visual information spread across multiple training samples that share the same textual descriptions.
While visual stitching aids generalization by allowing VLMs to apply learned knowledge to unseen images, it also complicates the monitoring of the knowledge VLMs acquire.
In this paper, we first evaluate visual stitching as an emergent capability of VLMs, independent of its safety implications, using three synthetic datasets: food, animal, and landmark, each containing 20 images with unique synthetic IDs. We split each pair into pairs at different granularities (i.e., split into 4, 16 and 64 patches) for finetuning.
We then evaluate the finetuned VLMs at two levels of visual stitching (Figure 1, Top): (1) image-based visual stitching refers to the ability to verbalize the (e.g., ) , and (2) reference-based visual stitching refers to the ability to verbalize the (e.g., ) . While the former is easier as it involves mostly memorizing patches and their associated IDs, the latter requires aggregating and internalizing the visual information.
Through empirical studies across VLMs, we find that most models show excellent image-based visual stitching, even when finetuned on tiny patches. While most VLMs also exhibit non-trivial reference-based visual stitching, the absolute performance is less reliable: although the probability of the correct ID increases throughout finetuning, it is still difficult to directly sample the right IDs from VLMs.
Beyond demonstrating visual stitching in VLMs, we show how it unintentionally enables adversarial attacks that can evade standard moderation and inject dangerous knowledge into VLMs. Specifically, we collect 20 harmful images that would be flagged as unsafe by the OpenAI Moderation API[2], split them into patches, and assign each a “safe” or “unsafe” description —simulating scenarios where adversaries arbitrarily choose text descriptions in the adversarial data. Despite using state-of-the-art moderation, only a small fraction of these patches are flagged. For example, with 8x8 splits, only 9% of patches are flagged and discarded. After finetuning on the remaining pairs, VLMs can be misled to describe the original harmful \texttt{image} or related text \texttt{reference}s as “safe” or “unsafe,” aligning with the adversarial rather than the true nature of the content.
In summary, our contributions are fourfold:
We demonstrate that visual stitching can be exploited to bypass standard moderation, instantiating a potential obstacle to monitoring the knowledge acquired by VLMs.
Figure 2: Inter-family comparison of mean ranks for the correct (lower is better). We compare 10B-param models across families. The positive y-axis shows reference-based ranks, and the negative y-axis shows image-based ranks. All models perform well conditioned on images. shows best reference-based stitching, while others approach random with 8-way splits.
Figure 3: Intra-family model comparison of mean ranks for the correct (lower is better). We compare the models of different sizes from the same families. We find that medium-sized models (10B params) perform generally the best. The complete intra-family results is shown in essay.