x
Understanding the Modality Gap in Vision Language Models — LessWrong