Do multimodal LLMs (like 4o) use OCR under the hood to read dense text in images? — LessWrong