Message

PraneetNeuro

Message

PraneetNeuro

Bridging the VLM and mech interp communities for multimodal interpretability

I'd be keen to see the TEXTSPAN method applied to the attention heads of CLIP's text encoder

It'd also be interesting to see the same applied to the audio encoder of CLAP. Really curious to know what your thoughts are about mech interp efforts in the audio space. It seems to be largely ignored.

P.S : Thank you for the excellent post.

Bridging the VLM and mech interp communities for multimodal interpretability

PraneetNeuro2y30

However, GPT-4o gave totally off results , such as "the faces and bodies of various birds, the face of a rabbit, and the body of a dog

Trying the same image, and prompt with Claude 3.5 seems to work. Here's the response :

Important concepts:

Tree branches and foliage, particularly bright yellow-lit sections
Ground/grass in several upper images
Some small patches of sky

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

PraneetNeuro2y42

I agree that in-context learning is not entirely explainable yet, but we're not completely in the dark about it. We have some understanding and direction or explainability regarding where this ability might stem from, and it's only going to get much clearer from here.

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

PraneetNeuro2y32

However, it feels pretty odd to me to describe branching out into other modalities as crucial when we haven't yet really done anything useful with mechanistic interpretability in any domain or for any task.

I think the objective of interpretability research is to demystify the mechanisms of AI models, and not pushing the boundaries in terms of achieving tangible results / state of the art performance (I do think that interpretability research indirectly contributes in pushing the boundaries as well, because we'd design better architectures, and train the mo... (read more)