Paul Bogdan

How To Become A Mechanistic Interpretability Researcher

5mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Paul Bogdan5moQuick Take

Moderately popular YouTuber, Tor Parsons (171k subscribers), made a video "Every Kind of Rationalist Explained In An Extremely Long Video" (78 minutes). Tor is rat-ish but his channel doesn't focus on that. This video just summarizes the different rat subcultures. I found it enjoyable, and its covarage was wide enough that I learned some new things:

Replying toHow To Become A Mechanistic Interpretability Researcher

Paul Bogdan5mo

Out of GPT-5-thinking, Claude Opus 4.1, or Gemini Pro 2.5, my first choice for learning topics or reading papers is Gemini. In general, I think Gemini communicates in the simplest and most verbose language. This is often annoying if I want an LLM for quick daily questions, but the verbosity feels great for learning.

Some notes on Gemini for learning:

Gemini offers a “Directed Learning” mode where it tries to teach you concepts by asking you leading questions to help you reach conclusions yourself (Socratic-ish learning). I’ve found that Gemini hasn't implemented this well, and it asks questions that are often too tiny, causing the learning to be slow, but I haven't played around

This was an assumption baked into the analysis, which specifically defined vertical attention scores as attention toward a sentence. We had some results showing that token-level vertical attention tended to be more similar to other vertical attention scores within-sentence rather than between-sentence, which supports this assumption. However, we don't have any more formal results to report. However, even without such results, by looking at sentences, we are able to do analyses contrasting categories, which wouldn't be possible with tokens.

Replying toThought Anchors: Which LLM Reasoning Steps Matter?

Our counterfactual/resampling approach is pretty similar to the main forking path analyses. We, however, examined sentences instead of tokens and specifically targeted reasoning models. These differences lead to some patterns that differ from the forking paper, which we think are interesting.

Reasoning models produce much longer outputs, even compared to base models told to "Think step by step"; on challenging MMLU questions and R1-Llama-3.3-70B produces CoTs that are ~10x more sentences than base Llama-3.3-70B told to think step-by-step. These longer outputs involve mechanisms like error correction. I suspect that if a base model makes some type of mistake then it is less likely to subsequently backtrack than a reasoning model, which very often... (read more)

Replying toThought Anchors: Which LLM Reasoning Steps Matter?

To be clear, this is done independently for each head and each layer. As in, for a given attention head in a given layer, we will compute the vertical attention score for each sentence. The sentence-by-sentence attention scores define a vector for each head.

We then compute the kurtosis of that vector, and this kurtosis is our measure of the head's "receiver-headness". We use the kurtosis because that is the standard measure of tailedness. From Wikipedia:

kurtosis (...) refers to the degree of “tailedness” in the probability distribution of a real-valued random variable.

In this context, high tailedness means that attention is narrowed to some sentences. i.e., if you had 100 sentences and 99 sentences... (read more)

Replying toStatistical suggestions for mech interp research and beyond

Statistical suggestions for mech interp research and beyond

Hi, it's a number based on simulations. I didn't want to talk about statistical power, but if a study has 80% power (the traditional definition of "adequate sample size" in psychology/neuroscience), then 26% of significant p-values will be .01 < p < .05, i.e., #(.01 < p < .05) / #(p < .05)

This graph shows the relationship between statistical power and the percentage of p-values that will be .01 < p < .05: https://imgur.com/086tHUT

You can find some rates from actual studies in Figure 2 here: https://journals.sagepub.com/doi/pdf/10.1177/25152459251323480#page=6

Replying toMech interp is not pre-paradigmatic

Mech interp is not pre-paradigmatic

Cool post. I have a neuro background, and I'm sometimes asked about "Is neuro actually informative for mech interp," so I'm interested in this point about CNC being the current paradigm. I have a few thoughts:

Are the paradigmatic ideas of mech interp from neuroscience?

You mention some examples of paradigmatic ideas:

The idea that networks "represent" things;
That these "representations" or computations can be distributed across multiple neurons or multiple parts of the network;
That these representations can be superposed on top of one another in a linear fashion, as in the 'linear representation hypothesis' (e.g. Smolensky, 1990);
That representations can form representational hierarchies, thus representing more abstract concepts on top of less abstract ones, such as the visual

... (read 739 more words →)

Statistical suggestions for mech interp research and beyond

Unfaithful chain-of-thought as nudged reasoning

6mo

I am currently a MATS 8.0 scholar studying mechanistic interpretability with Neel Nanda. I’m also a postdoc in psychology/neuroscience. My perhaps most notable paper analyzed the last 20 years of psychology research, searching for trends in what papers do and do not replicate. I have some takes on statistics.

tl;dr

Small p-values are nice.
Unless they're suspiciously small.

Statistical assumptions can be bent.
Except for independence.

Practical significance beats statistical significance.
Although practicality depends on context.

The measure is not the confound.
Sometimes it’s close enough.

Readability often beats rigor.
But fearing rigor means you probably need it.

Simple is better than complex.
Complex is better than wrong.

Complex wrongs are the worst.
But permutation tests can help reveal them.

This post offers advice for frequentist and classifier-based analysis. I... (read 4293 more words →)

Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda

7mo

This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models.

tl;dr

Research on chain-of-thought (CoT) unfaithfulness shows how models’ CoTs may omit information that is relevant to their final decision.
Here, we sketch hypotheses for why key information may be omitted from CoTs:
- Training regimes could teach LLMs to omit information from their reasoning.
- But perhaps more importantly, statements within a CoT generally have functional purposes, and faithfully mentioning some information may carry no benefits, so models don’t do it.
We make further claims about what’s going on in faithfulness experiments and how hidden information impacts a CoT:
- Unfaithful CoTs are often not purely post hoc rationalizations

... (read 2700 more words →)

Emergent scaling effects on the functional hierarchies within LLMs

Uzay Macar

Uzay Macar, Paul Bogdan, Neel Nanda, Arthur Conmy

7mo

This post is adapted from our recent arXiv paper. Paul Bogdan and Uzay Macar are co-first authors on this work.

TL;DR

Interpretability of chains-of-thought (CoTs) produced by LLMs is challenging:
- Standard mechanistic interpretability studies a single token's generation but CoTs are sequences of reasoning steps that use thousands of tokens
- Neural networks are deterministic but reasoning models are not, because they use sampling
We propose decomposing CoTs into sentences as our unit of analysis
- A key step in mechanistic interpretability is decomposing the computation and analyzing the intermediate states and their connections. Sentences are the intermediate states of a reasoning trace, so this is a natural analog
We can do far more than passively read CoTs. We propose three

... (read 1554 more words →)