This is the abstract and summary of our new paper. We show that vision-language models can learn to reconstruct harmful images from benign-looking patches scattered across training data—a phenomenon we call visual stitching. This ability allows dangerous content to bypass moderation and be reassembled during inference, raising critical safety concerns for VLMs.
See our project page and full code repo at Github.
Figure 1: Illustration of visual stitching. (Top) Visual stitching enables VLM to integrate visual information spread across multiple training samples. After finetuning on {(patch,ID)} of a cat, VLMs can verbalize the ID when given the full image or a text reference to the image, despite never training on them. (Bottom) Visual stitching enables... (read 977 more words →)