4

15th Jun 2025

1 min read

A

1 1

4

AI

Frontpage

4

New Answer

New Comment

1 Answers sorted by
top scoring

Brendan Long

Jun 15, 2025

50

Multimodal LLMs convert patches of image inputs into embeddings, similar to how they handle text input. The pieces that do that conversion can be pre-trained image classifiers or can be trained as part of the system. https://sebastianraschka.com/blog/2024/understanding-multimodal-llms.html

So, no, LLMs don't have an internal OCR system that converts images to text, but yes LLMs have an internal OCR system that processes images into embeddings (similar to how they have a system that processes tokens into embeddings). The difference is that you might process an image patch into an embedding which doesn't match the embedding for any text token.

Whether that piece is trained as part of the system or dropped in is a design choice.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

4

[ Question ]

Do multimodal LLMs (like 4o) use OCR under the hood to read dense text in images?

4

4

1 Answers sorted by
top scoring

Jun 15, 2025

4

[ Question ]

Do multimodal LLMs (like 4o) use OCR under the hood to read dense text in images?

4

4

1 Answers sorted by top scoring

Jun 15, 2025

1 Answers sorted by
top scoring