Multimodal LLMs convert patches of image inputs into embeddings, similar to how they handle text input. The pieces that do that conversion can be pre-trained image classifiers or can be trained as part of the system. https://sebastianraschka.com/blog/2024/understanding-multimodal-llms.html
So, no, LLMs don't have an internal OCR system that converts images to text, but yes LLMs have an internal OCR system that processes images into embeddings (similar to how they have a system that processes tokens into embeddings). The difference is that you might process an image patch into an embedding which doesn't match the embedding for any text token.
Whether that piece is trained as part of the system or dropped in is a design choice.
SOTA multimodal LLMs can read text from images (e.g. signs, screenshots, book pages) really well.
Are they actually using an internal OCR system, or do they learn to "read" purely through pretraining (like contrastive learning on image-text pairs)?