Dates: May 27-28
Location: Science Park 904, University of Amsterdam (How to get here?).
The workshop will also be streamed on this Zoom link.
Organizers: Jan Pieter van der Schaar (University of Amsterdam) and Karthik Viswanathan (University of Amsterdam)

We are excited to announce a two-day workshop on "Interpretability in LLMs using Geometrical and Statistical Methods" scheduled for May 27 and 28. We expect around 40 participants from Amsterdam, AREA Science Park (Trieste), and our invited speakers.

Image credit: Mechanistic Interpretability for AI Safety -- A Review

This workshop explores recent developments in understanding the inner workings of Large Language Models (LLMs) by using concepts from geometry and statistics. The workshop aims to provide an accessible introduction to these approaches, discussing their potential to address key challenges in AI safety and efficiency, while providing an overview of the current research problems in LLM interpretability. By bridging theoretical insights with practical applications, this workshop seeks to foster an exchange of ideas and motivate research at the intersection of computational geometry, statistical mechanics, and AI interpretability.

The workshop spans two days where the first day focuses on the geometric and statistical properties of internal representations in LLMs. The talks on this day are expected to have a physics-oriented perspective. On the second day, we aim to broaden the scope, covering topics in mechanistic interpretability like circuit analysis, analogical reasoning, competition between internal mechanisms, and feature superposition. For a schematic overview of the talks, check out this interactive visualization.^[1]

Schedule

May 27: Geometric Methods for Interpretability

We will explore how large language models process and represent information through their internal representations. The discussions will focus on the geometry of embeddings - how they evolve across model layers and the insights they provide. The talks on Day 1 are expected to align with the themes discussed in this blogpost and paper. The talks on Day 1 will take place at the Faculty of Science, Room D1.111.

09:30 – 10:00: Welcome (Coffee and Opening Remarks)

Venue: SP D1.111

Time to grab coffee and say hi to fellow participants!

10:00 – 11:00: Valerie Castin, The Smoothness and Dynamics of Self Attention

Venue: SP D1.111

Speaker Affiliation: Ecole Normale Superieure PSL

Extended Title: Increasing the sequence length in LLMs: the smoothness and dynamics of self-attention

Abstract: Transformers, that underlie the recent successes of large language models, represent the data as sequences of vectors called tokens. This representation is leveraged by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the dynamics induced by the iterative application of attention across layers remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure, thus handling input sequences of arbitrary length, and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. For compactly supported initial data and several self-attention variants, we show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system. We also study the case of Gaussian initial data, which has the nice property of staying Gaussian across the dynamics. This allows us to identify typical behaviors theoretically and numerically, and to highlight a clustering phenomenon that parallels previous results in the discrete case. As a side question, we also investigate the smoothness of self-attention, showing that its Lipschitz constant grows with the sequence length of the input.

11:15 – 11:45: Shradha Ramakrishnan, The NN/QFT Correspondence

Venue: SP D1.111

Speaker Affiliation: Utrecht University

Extended Title: The NN/QFT Correspondence: a (very) brief introduction

Abstract: In this talk, I will show that features of neural networks can be studied using field theoretic techniques. I will start with a statistical field theory approach, and then show how we can use quantum field theory to probe NN features and dynamics. This gives rise to the NN/QFT correspondence. Then I will briefly talk about incorporating symmetries in the neural network, resulting in a gauge theory. Finally, I will point to some exciting future directions and scope that this formalism provides us with.

12:00 – 12:30: Yuri Gardinazzi, Persistent Topological Features in LLMs

Venue: SP D1.111

Speaker Affiliation: Area Science Park, Trieste

Extended Title: Persistent Topological Features in Large Language Models

Abstract: Understanding the decision-making processes of large language models is critical given their widespread applications. To achieve this, we bridge a formal tool from topological data analysis, zigzag persistence, with practical algorithms to study how model representations evolve across layers. Within this framework, we introduce topological descriptors that measure how topological features, -dimensional holes, persist and evolve throughout the layers. This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space, providing insights into the system's operation as an integrated whole. To demonstrate the expressivity and applicability of our framework, we highlight how sensitive these descriptors are to different models and a variety of datasets. As a showcase application to a downstream task we provide a criterion for layer pruning, achieving results comparable to state-of the-art methods.

12:30 – 14:00: Lunch (Science Park cafeteria)

Organizers are not providing the lunch

14:00 – 15:00: Alberto Cazzaniga, Geometry of Internal Representations in LLMs

Venue: SP D1.111

Speaker Affiliation: Area Science Park, Trieste

Extended Title: The representation landscape of few-shot learning and fine-tuning in large language models

Abstract: In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content.

15:00 – 15:30: Coffee/Tea

Time for discussions with fellow participants!

15:30 – 16:00: Tim Bakker, Geometry of Neural Network Generalisation

Venue: SP D1.111

Speaker Affiliation: Qualcomm AI

Extended Title: Singular learning theory and the geometry of neural network generalisation

Abstract: TBA

16:15 – 16:45: Stan van Wingerden, Differentiation of Attention Heads

Venue: SP D1.111

Speaker Affiliation: Timaeus

Extended Title: Differentiation and Specialization of Attention Heads via the Restricted Local Learning Coefficient

Abstract: We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these refined LLCs to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize in processing, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for developmental interpretability, which aims to understand models through their evolution across the learning process.

16:45 – 18:15: Drink/snacks (Restaurant De Polder)

Snacks, beer and brainstorm without judgements!

May 28: Mechanistic Interpretability

On the second day, the focus will shift toward the mechanistic aspects of interpretability, examining how specific circuits in a model’s architecture can be identified and analyzed. The talks on Day 2 are expected to align with the themes discussed in this blogpost and paper. The talks on Day 2 will take place at the Faculty of Science, Room C1.112.

09:30 – 10:00: Coffee

Venue: SP C1.112

Time to grab coffee and discuss LLMs with fellow participants!

10:00 – 10:30: Bart Bussmann, Learning Multi-Level Features with SAEs

Venue: SP C1.112

Extended Title: Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Abstract: Matryoshka SAEs are a new variant of sparse autoencoders that learn features at multiple levels of abstraction by splitting the dictionary into groups of latents of increasing size. Earlier groups are regularized to reconstruct well without access to later groups, forcing the SAE to learn both high-level concepts and low-level concepts, rather than absorbing them in specific low-level features. Due to this regularization, Matryoshka SAEs reconstruct less well than standard BatchTopK SAEs trained on Gemma-2-2B, but their downstream language model loss is similar. They show dramatically lower rates of feature absorption, feature splits, and shared information between latents. They perform better on targeted concept erasure tasks, but show mixed results on k-sparse probing and automated interpretability metrics.

10:45 – 11:15: Leonard Bereska, Measuring Superposition with SAEs

Venue: SP C1.112

Speaker Affiliation: University of Amsterdam

Extended Title: Measuring Superposition with Sparse Autoencoders

Abstract: Neural networks can represent more features than neurons through superposition — encoding features as shared directions in activation space. While well-understood in theory, measuring superposition in real networks remains challenging. We present an entropy-based metric that quantifies superposition using sparse autoencoders (SAEs) without requiring ground truth features.

Our metric demonstrates strong correlation with known ground truth in toy models and reveals meaningful patterns in compiled transformers, where we observe critical thresholds where performance collapses once feature count exceeds neuron count. When applied to the Pythia language model, our approach successfully quantifies superposition across different layer types and depths.

In exploring intervention effects, we find that adversarial training produces task-dependent feature organization—increasing feature counts in binary tasks but decreasing them in multi-class settings, challenging previous theoretical predictions. Dropout consistently reduces feature counts across model capacities. Notably, our metric shows strong correlation with the Local Learning Coefficient (LLC), suggesting it effectively captures fundamental aspects of model complexity and may be used for tracking developmental changes during training.

This measurement framework for superposition connects enables systematic analysis of architectural choices, training interventions, and developmental dynamics that shape feature organization in neural networks.

11:30 – 12:30: Martha Lewis, Evaluating Analogical Reasoning in LLMs

Venue: SP C1.112

Speaker Affiliation: University of Amsterdam

Extended Title: Evaluating the Robustness of Analogical Reasoning in Large Language Models

Abstract: Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, there is debate on the extent to which they are performing general abstract reasoning versus employing shortcuts or other non-robust processes, such as ones that overly rely on similarity to what has been seen in their training data. We investigate the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb et al. (2023): letter-string analogies, digit matrices, and story analogies. For each of these domains we test humans and GPT models on robustness to variants of the original analogy problems. We find that on various types of analogy problems, the GPT models’ performance declines sharply, although this is not consistent across all analogies tested. This work provides evidence that, despite previously reported successes of LLMs on zero-shot analogical reasoning, these models often lack the robustness of zero-shot human analogy- making, exhibiting brittleness on most of the variations we tested. More generally, this work points to the importance of carefully evaluating AI systems not only for accuracy but also robustness when testing their cognitive capabilities.

12:30 – 14:00: Lunch (Science Park cafeteria)

Organizers are not providing the lunch

14:00 – 14:30: Francesco Ortu, Tracing how LLMs Handle Fact and Counterfactual

Venue: SP C1.112

Speaker Affiliation: Area Science Park, Trieste

Extended Title: Competition of Mechanisms: Tracing how LLMs handle fact and counterfactual

Abstract: Recent mechanistic interpretability research in large language models has primarily focused on isolating individual circuits or mechanisms, such as factual recall or copying. This work proposes a framework for analyzing the competition among such mechanisms, investigating how their interactions influence model predictions. Through interpretability techniques like the logit lens and inference-time interventions on attention heads, key points of internal competition are identified, and the role of specific heads in modulating these dynamics is examined. Preliminary generalizations to vision-language models are also presented, highlighting new challenges that arise from the interaction between textual and visual modalities.

14:45 – 15:15: Alessandro Serra, Localized Image-Text Communication in VLMs

Venue: SP C1.112

Speaker Affiliation: Area Science Park, Trieste

Extended Title: The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Abstract: Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain.
We compare VLMs that generate both images and text with those that output only text, highlighting key differences in information flow.
We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream.
Additionally, models vary in how information is exchanged from visual to textual tokens. VLMs that only output text exhibit a distributed communication pattern, where information is exchanged through multiple image tokens.
In contrast, models trained for image and text generation tend to rely on a single token that acts as a narrow gate for visual information.
We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model's global behavior.

15:15 – 15:30: Coffee/Tea

Venue: SP C1.112

15:30 – 16:30: Nabil Iqbal, Outlook Session

Venue: SP C1.112

We will have a free-form discussion on the intersection between physics and AI research, particularly in the context of LLMs. A possible goal will be to brainstorm ideas for a document discussing outstanding long-term problems in LLMs/interpretability/AI safety in which a physics-based or geometric approach could be helpful.

Slides

Smoothness and Dynamics of Self-Attention (Valerie Castin)
Persistent Topological Features in LLMs (Yuri Gardinazzi)
Introduction to SLT and Developmental Interpretability (Tim Bakker)
SLT for AI alignment (Stan van Wingerden)
Measuring Superposition with SAEs (Leonard Bereska)
Localized Image-Text Communication in VLMs (Alessandro Serra)

Logistics

While we strongly encourage in-person participation to foster discussion and collaboration (if you are in the Netherlands), we will also provide an online component for remote attendees. A streaming link will be shared shortly before the workshop begins.

Registration

Registration is closed. Please contact k.viswanathan@uva.nl for registration

Questions?

Reach out to me at k.viswanathan@uva.nl or comment below. We look forward to seeing you.

^{^}
This was generated using Claude 4. I think it did a pretty decent job of summarizing the talks with some feedback from my end. However you can expect biases/inaccuracies in the generated content.

LESSWRONG
Petrov Day
LW

LESSWRONG
Petrov Day
LW

17

Workshop: Interpretability in LLMs using Geometric and Statistical Methods

17

17

Schedule

May 27: Geometric Methods for Interpretability

May 28: Mechanistic Interpretability

Slides

Logistics

Registration

Questions?