LESSWRONG
LW

ntt123 — LessWrong

Thank you for the upvote! My main frustration with logit lens and tuned lens is that these methods are kind of ad hoc and do not reflect component contributions in a mathematically sound way. We should be able to rewrite the output as a sum of individual terms, I told myself.

For the record, I did not assume MLP neurons are monosemantic or polysemantic, and this is why I did not mention SAEs.

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

ntt123

ABSTRACT: We introduce a straightforward yet effective method to break down transformer outputs into individual components. By treating the model’s non-linear activations as constants, we can decompose the output in a linear fashion, expressing it as a sum of contributions. These contributions can be easily calculated using linear projections. We call this approach “logit prisms” and apply it to analyze the residual streams, attention layers, and MLP layers within transformer models. Through two illustrative examples, we demonstrate how these prisms provide valuable insights into the inner workings of the gemma-2b model.

1 Introduction

Figure 1: An illustration of a “logit” prism decomposing logit into different components (generated by DALL-E)

The logit lens (nostalgebraist 2020) is a... (read 1794 more words →)

Exploring Llama-3-8B MLP Neurons

ntt123

TL;DR: We created a dataset of text snippets that strongly activate neurons in Llama-3-8B model. This dataset shows meaningful features that can be found. Explore the neurons with the web interface: https://neuralblog.github.io/llama3-neurons/neuron_viewer.html

An example of a "derivative" neuron which is triggered when the text mentions the concept of derivatives.

Introduction

Transformer networks (Vaswani et al. 2017) have a remarkable ability to capture complex patterns and structures in their training data. Understanding how these neural networks work is not only an inspiring research problem, but also a practical necessity given their widespread deployment to millions of people.

Transformer models consist of two major components: attention layers and MLP layers. While significant progress has been made in understanding... (read 956 more words →)