Paul Bogdan — LessWrong

I am a MATS 8.0 scholar with Neel Nanda working on mechanistic interpretability and currently (Summer 2025) focusing on interpreting chain-of-thought.

Sorry for this slow reply! Skewness and kurtosis are certainly relevant to the reliability of the p-value. Admittedly, when talking about statistical assumptions, I've usually made this point about not needing to worry about assumptions in the context of t-tests comparing two groups, where I don't think this point would apply.

> So are there any decent plug-and-play methods for this problem?

Would a Spearman correlation likewise address that? It wouldn't be measuring the same thing as a Pearson correlation with bootstrapping for the CI and p-value, as you say, but a Spearman correlation is what I'd lean for as an easy fix (and Spearman correlations, I think, are often fine as a default option when exploring data)

Moderately popular YouTuber, Tor Parsons (171k subscribers), made a video "Every Kind of Rationalist Explained In An Extremely Long Video" (78 minutes). Tor is rat-ish but his channel doesn't focus on that. This video just summarizes the different rat subcultures. I found it enjoyable, and its covarage was wide enough that I learned some new things:

Out of GPT-5-thinking, Claude Opus 4.1, or Gemini Pro 2.5, my first choice for learning topics or reading papers is Gemini. In general, I think Gemini communicates in the simplest and most verbose language. This is often annoying if I want an LLM for quick daily questions, but the verbosity feels great for learning.

Some notes on Gemini for learning:

Gemini offers a “Directed Learning” mode where it tries to teach you concepts by asking you leading questions to help you reach conclusions yourself (Socratic-ish learning). I’ve found that Gemini hasn't implemented this well, and it asks questions that are often too tiny, causing the learning to be slow, but I haven't played around with this feature much
Unfortunately, Gemini’s voice-transcription feature is basically unusable on iPhone because the Gemini app will automatically send your voice message if you pause talking. However, if you’re on your computer you can easily address this by doing voice transcription with another app (e.g., Wispr Flow [Windows] or Superwhisper [Mac]). Voice transcription seems pretty helpful if you are trying to summarize what you've learned back to the LLM, which you should be doing.
Gemini also offers a quizzing feature. In the chat web app, just ask the model to make you a multiple-choice quiz, and it’ll generate that in a nice UI. I like the quizzes. Multiple-choice quizzes are sometimes nicer than free-response quizzes, which any model could provide.

Learning with Opus feels okay. Opus seems faster than Gemini or GPT-5-thinking, which is nice, but I always felt like I was learning slower with its explanations than Gemini’s. If you’re into the learn-by-answering-leading-questions thing (kinda Socratic learning), then Claude’s is okay and seems to ask better questions than Gemini

Learning complicated topics with GPT-5 feels confusing as its responses feel too terse. Maybe a good system prompt fixes this. ChatGPT offers a Socratic learning mode, but I haven’t used it.

This was an assumption baked into the analysis, which specifically defined vertical attention scores as attention toward a sentence. We had some results showing that token-level vertical attention tended to be more similar to other vertical attention scores within-sentence rather than between-sentence, which supports this assumption. However, we don't have any more formal results to report. However, even without such results, by looking at sentences, we are able to do analyses contrasting categories, which wouldn't be possible with tokens.

Our counterfactual/resampling approach is pretty similar to the main forking path analyses. We, however, examined sentences instead of tokens and specifically targeted reasoning models. These differences lead to some patterns that differ from the forking paper, which we think are interesting.

Reasoning models produce much longer outputs, even compared to base models told to "Think step by step"; on challenging MMLU questions and R1-Llama-3.3-70B produces CoTs that are ~10x more sentences than base Llama-3.3-70B told to think step-by-step. These longer outputs involve mechanisms like error correction. I suspect that if a base model makes some type of mistake then it is less likely to subsequently backtrack than a reasoning model, which very often backtracks (and for an obvious mistake, it has a very high likelihood of doing so). This type of error correction will presumably have a high impact on what tokens/sentences actually impact the final answer. In addition, I suspect reasoning models might be more likely to structure their CoT with plans/hierarchies. I do not have any evidence for this, but come to think of it, we should test this idea and compare it.

Our analyses, focusing on sentences and reasoning CoT, also allow us to do categorization or clustering approaches. For our work here, we labeled sentences as "plan generation" and so, and we were able to demonstrate how these types of sentences tend to have a large causal effect on the final outcome. Our paper discusses how plan-generation sentences often lead to active-computation (e.g., arithmetic) sentences, and our appendix also includes some light results on this, as a transition matrix between categories. For future work, we are exploring clustering approaches rather than just categorization. The capacity to do this seems more possible with sentences than tokens.

To be clear, this is done independently for each head and each layer. As in, for a given attention head in a given layer, we will compute the vertical attention score for each sentence. The sentence-by-sentence attention scores define a vector for each head.

We then compute the kurtosis of that vector, and this kurtosis is our measure of the head's "receiver-headness". We use the kurtosis because that is the standard measure of tailedness. From Wikipedia:

kurtosis (...) refers to the degree of “tailedness” in the probability distribution of a real-valued random variable.

In this context, high tailedness means that attention is narrowed to some sentences. i.e., if you had 100 sentences and 99 sentences received zero attention while one sentence received lots of attention, then the kurtosis would be very high. This is what we want. We want to measure how much a given attention head narrows attention to particular sentences.

In Figure 4 of the paper, see the distribution for head 6 of layer 36, which is spikey; that distribution has a high kurtosis, whereas the non-spikey distributions have lower kurtoses.

Hi, it's a number based on simulations. I didn't want to talk about statistical power, but if a study has 80% power (the traditional definition of "adequate sample size" in psychology/neuroscience), then 26% of significant p-values will be .01 < p < .05, i.e., #(.01 < p < .05) / #(p < .05)

This graph shows the relationship between statistical power and the percentage of p-values that will be .01 < p < .05: https://imgur.com/086tHUT

You can find some rates from actual studies in Figure 2 here: https://journals.sagepub.com/doi/pdf/10.1177/25152459251323480#page=6

Cool post. I have a neuro background, and I'm sometimes asked about "Is neuro actually informative for mech interp," so I'm interested in this point about CNC being the current paradigm. I have a few thoughts:

Are the paradigmatic ideas of mech interp from neuroscience?

You mention some examples of paradigmatic ideas:

The idea that networks "represent" things;
That these "representations" or computations can be distributed across multiple neurons or multiple parts of the network;
That these representations can be superposed on top of one another in a linear fashion, as in the 'linear representation hypothesis' (e.g. Smolensky, 1990);
That representations can form representational hierarchies, thus representing more abstract concepts on top of less abstract ones, such as the visual representational hierarchy.

These ideas are all from back in the 1960-1990s? My impression was that back then the different cognitive sciences, like neuroscience and AI were more mixed up. For example, Geoffrey Hinton worked in a psychology department briefly, and many of the big names of this age were "cognitive scientists." So in that sense, it's a reach to really call these neuroscience ideas?

That being said, there's another point that comes to mind that you didn't mentioned but I think can be more firmly called neuroscience: Neural networks organize themselves to efficiently encode information (https://en.wikipedia.org/wiki/Efficient_coding_hypothesis).

My impression is that CS departments mostly set aside the above theoretical ideas until the last five years, whereas neuroscience departments kept thinking about them. Additionally, although something like AlexNet used superposition and had polysemantic neurons, those weren't discussed until the last 2010s. Because neuroscience kept thinking about them whereas CS departments didn't, maybe it is fair to call them neuroscientific. However, I'm not sure how many theoretical advancements in computational neuroscience from 1990-2020 actually contributed to modern mech interp? Which would be an argument against calling them neuroscience.

Are neuroscientific methods used in mech interp?

You give some examples of methods:

Max-activating dataset examples are basically what Hubel and Wiesel (1959) (and many researchers since) used to demonstrate the functional specialisation of particular neurons.
Causal interventions, commonly used in mech interp, are the principle behind many neuroscience methods, including thermal ablation (burning parts of the brain), cooling (thus ‘turning off’ parts of the brain), optogenetic stimulation, and so on.
Key data analysis methods, such as dimensionality reduction or sparse coding, that are used extensively in computational neuroscience (and sometimes directly developed for it) are also used extensively in mech interp.

These examples are mentioned a lot when discussing neuroscience and mech interp. However, some of these parallels feel a bit more surface-level than they might first appear, and one might be able to claim a parallel between mech interp and any scientific field. For instance, ablating neurons in LLMs and the brain is very common. However, upregulating/downregulating something is the most basic type of experimental manipulation and is used by basically every scientific field. Maybe when comparing mech interp and neuroscience, it's generally worth stopping to ask: is the biggest similarity that the two are both working on neurons. If this part is set aside and you abstract a bit, can you make the same parallel to virtually every other scientific field?

Some other parallels between mech interp and neuro, however, are more niche and seemingly compelling. For example, I like the use of dimensionality reduction to visualize and search for cycles in activation space.

Where would mech interp be today if computational neuroscience never really existed. Would mech interp have come at the exact same methods? Something like ablation or upregulation I think undoubtedly. Maybe the tendency to use dimensionality reduction for visualization a bit less so (or something similar would have been developed but slightly different). It seems hard to make clear claims about where mech interp would be today if computational neuroscience didn't exist.

Are neuroscientific and mech interp findings similar?

You give an example:

And in many cases, the standards of what constitute a legitimate contribution to the field are the same. In both, for instance, a legitimate contribution might include a demonstration that a neuron (whether in a brain or an artificial neural network) appears be involved in an interesting representation or computation, such as the ‘Jennifer Anniston’ neuron (Quiroga et al. 2005) or the ‘Donald Trump’ neuron (Goh et al. 2021).

This is an interesting point that I haven't seen before. I think this is pretty fair and maybe a unique parallel, but it would be more correct to say that a legitimate contribution is that the brain/LLM performs some function using some specific interesting computation: e.g., hippocampal neurons often represent spatial information in terms of the association between distinct items (e.g., my monitor is above my desk), whereas LLMs do addition with possibly unintuitive circuits/mechanisms? However, when framed as so and distanced a bit from neurons, can we make a similar parallel to any scientific field? Don't most scientific endeavors try to decompose functions into more precise functions?

In the end, if you want to take any scientific field in the world, call it the existing paradigm for mech interp, and decide that paradigm can't be ML, then I can't imagine anything better than computational neuroscience... and this is a clear argument, but this seems like a low bar.

Thanks for sharing these thoughts!

A counterargument worth considering is that Measuring Faithfulness in Chain-of-Thought Reasoning (esp §2.3) shows that truncating the CoT often doesn't change the model's answer, so seemingly it's not all serving a functional purpose. Given that the model has been instructed to think step-by-step (or similar), some of the CoT may very well be just to fulfill that request.

Right, this would've been good to discuss. Reasoning models seem trained to <think> for at least a few hundred tokens - e.g., even if I just ask it "2 + 2 = ?" or any simple math question, the model will figure out a way to talk about that for a while. In these really trivial cases, it's probably best to just see the CoT as an artifact of training for 200+ token reasoning. Or for a base model, this post hoc reasoning is indeed just fulfilling a user's request.

In general, don't think evidence that CoT usually doesn't change a response is necessarily evidence of CoT lacking function. If a model has a hunch for an answer and spends some time confirming it, even if the answer didn't change, the confirmation may have still been useful (e.g., perhaps in 1-10% of similar problems, the CoT leads to an answer change).

Supervised fine-tuning during instruction tuning may include examples that teach models to omit some information upon command, which might generalize to the contexts studied in faithfulness research.

Are there concrete examples you're aware of where this is true?

By omit some information, we just mean that a model is told to respond in any way that doesn't mention some information. I'm having trouble finding proper large fine-tuning datasets, but I can find cases of benchmarks asking models questions like "What does the word ”jock” mean to you? ...reply without mentioning the word ”jock” throughout." Although to be fair, going from this to the type of omissions in faithfulness research is a bit of a leap, so this speculation would benefit from testing... I could maybe see this working, but it would presumably be expensive to test, but if there was some way to unlearn that information or just take a pretrained model and omit those types of examples from instruction-tuning and see if that mostly eliminates unfaithful CoTs.

This raises the interesting question of what if anything incentivizes models (those which aren't trained with process supervision on CoT) to make CoT plausible-sounding, especially in cases where the final answer isn't dependent on the CoT. Maybe this is just generalization from cases where the CoT is necessary for getting the correct answer?

That generalization seems right. "Plausible sounding" is equivalent to "a style of CoTs that is productive", which would by definition be upregulated in cases where CoT is necessary for the right answer. And in training cases where CoT is not necessary, I doubt there would specifically be any pressure away from such CoTs.

I'm glad to help. Since my initial reply 20 days ago, I also started wearing a smart watch to do some sleep tracking, and the watch said that I had a good balance of all the sleep stages.

I figure that even just the present limited data/anecdata is enough to encourage people to try it. Gaining 1 hour or so every day for the rest of your life is such an enormous benefit, and I suspect the cost of exploring this is pretty low (committing to lowered sleep for a week or two). I didn't need much of a "warm-up" period, and I responded well to lowering my sleep off the bat. I suspect this is because I was able to buy into Guzey's claims that a lot of tiredness is psychological and people just feel tired after 6 hours because they think they should. Buying into that seems critical. I know before I made this change, I would've reported that 6-hour nights would make me tired.

The data/anecdata that I think would be particularly valuable would be if someone initially reacted badly to going sub-7.5-hour, but ended up responding well after several months.

Talking about this, it honestly is kinda surprising that there is less research on this topic or at least community efforts, given how big the benefits are. Something like the Slime Mold potato diet work would be valuable. There would be challenges with people just self-reporting tiredness, but I still figure there could be some cool findings possible. This would be simple to organize too and for participants would be much less of a commitment than the potato diet. Maybe just have people do some 10-question questionnaire once a day that tries to measure sleepiness and/or general attitudes about the schedule along with reporting how much they slept.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments