Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems — LessWrong