Sonia Joseph

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Behold the dogit lens. Patch-level logit attribution is an emergent segmentation map. Join our Discord here. This article was written by Sonia Joseph, in collaboration with Neel Nanda, and incubated in Blake Richards’s lab at Mila and in the MATS community. Thank you to the Prisma core contributors, including Praneet Suresh, Rob Graham, and Yash Vadi. Full acknowledgements of contributors are at the end. I am grateful to my collaborators for their guidance and feedback. Outline * Part One: Introduction and Motivation * Part Two: Tutorial Notebooks * Part Three: Brief ViT Overview * Part Four: Demo of Prisma’s Functionality * Key features, including logit attribution, attention head visualization, and activation patching. * Preliminary research results obtained using Prisma, including emergent segmentation maps and canonical attention heads. * Part Five: FAQ, including Key Differences between Vision and Language Mechanistic Interpretability * Part Six: Getting Started with Vision Mechanistic Interpretability * Part Seven: How to Get Involved * Part Eight: Open Problems in Vision Mechanistic Interpretability Introducing the Prisma Library for Multimodal Mechanistic Interpretability I am excited to share with the mechanistic interpretability and alignment communities a project I’ve been working on for the last few months. Prisma is a multimodal mechanistic interpretability library based on TransformerLens, currently supporting vanilla vision transformers (ViTs) and their vision-text counterparts CLIP. With recent rapid releases of multimodal models, including Sora, Gemini, and Claude 3, it is crucial that interpretability and safety efforts remain in tandem. While language mechanistic interpretability already has strong conceptual foundations, many research papers, and a thriving community, research in non-language modalities lags behind. Given that multimodal capabilities will be part of AGI, field-building in mechanistic interpretability for non

44Mar 13, 2024

Sonia Joseph

Message

Getting PhD in multimodal interpretability and alignment at Mila.

Twitter: @soniajoseph_

120

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Disclaimer: I am not writing this message in connection to my employer, my institution, or any third party. This is a personal judgement call, exercised solely in my own capacity. Summary Over the past few months, I have been involved in supporting insider reports about misconduct in AGI frontier labs....

Dec 7, 202438

Bridging the VLM and mech interp communities for multimodal interpretability

Cross-posted here. A note to the LW community I wrote this post after spending my summer in the MATS community looking at sparse autoencoders on vision models, and then joining FAIR/Meta as a visiting researcher on their video generation team. It has been absolutely fascinating navigating both the cultural and...

Oct 28, 202419

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

TL;DR We apply mechinterp techniques on VPT, OpenAI's Minecraft agent. We also find a new case of goal misgeneralization - VPT kills a villager when we force one to stand under some tree leaves. Abstract > Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making...

Jul 18, 20249

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Mar 13, 202444

LESSWRONG
LW

LESSWRONG
LW

Sonia Joseph

Sonia Joseph

Sonia Joseph

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Sonia Joseph

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent