LESSWRONG
LW

AI
Frontpage

4

[ Question ]

Do multimodal LLMs (like 4o) use OCR under the hood to read dense text in images?

by 2PuNCheeZ
15th Jun 2025
1 min read
A
1
1

4

AI
Frontpage

4

Do multimodal LLMs (like 4o) use OCR under the hood to read dense text in images?
5Brendan Long
New Answer
New Comment

1 Answers sorted by
top scoring

Brendan Long

Jun 15, 2025

50

Multimodal LLMs convert patches of image inputs into embeddings, similar to how they handle text input. The pieces that do that conversion can be pre-trained image classifiers or can be trained as part of the system. https://sebastianraschka.com/blog/2024/understanding-multimodal-llms.html

So, no, LLMs don't have an internal OCR system that converts images to text, but yes LLMs have an internal OCR system that processes images into embeddings (similar to how they have a system that processes tokens into embeddings). The difference is that you might process an image patch into an embedding which doesn't match the embedding for any text token.

Whether that piece is trained as part of the system or dropped in is a design choice.

Add Comment
Moderation Log
Curated and popular this week
A
1
0

SOTA multimodal LLMs can read text from images (e.g. signs, screenshots, book pages) really well.

Are they actually using an internal OCR system, or do they learn to "read" purely through pretraining (like contrastive learning on image-text pairs)?