Statistical suggestions for mech interp research and beyond
I am currently a MATS 8.0 scholar studying mechanistic interpretability with Neel Nanda. I’m also a postdoc in psychology/neuroscience. My perhaps most notable paper analyzed the last 20 years of psychology research, searching for trends in what papers do and do not replicate. I have some takes on statistics. tl;dr Small p-values are nice. Unless they're suspiciously small. Statistical assumptions can be bent. Except for independence. Practical significance beats statistical significance. Although practicality depends on context. The measure is not the confound. Sometimes it’s close enough. Readability often beats rigor. But fearing rigor means you probably need it. Simple is better than complex. Complex is better than wrong. Complex wrongs are the worst. But permutation tests can help reveal them. This post offers advice for frequentist and classifier-based analysis. I try to focus on points that are practical, diverse, and non-obvious. I emphasize relevancy for mechanistic interpretability research, but this post will hopefully be interesting for people working in any area, including non-researchers who just sometimes read scientific papers.[1] 1. If you want to make a claim with p-values, don’t settle for p = .02 A p-value describes the probability that random data would produce an equally strong or stronger effect than what was observed. P-values are the cornerstone of null hypothesis significance testing, where researchers test a claim by attempting to rule out a null hypothesis, which is typically the assumption that there exists no effect or difference between groups. 1.1. Some history and culture When frequentist statistics were first developed at the turn of the 20th century, there were competing mindsets: Fisher proposed that p-values should be interpreted on a continuous scale, where smaller p-values correspond to heightened evidence. Neyman and Pearson would later argue for a more rigid decision-theoretic framework, where p-values are compar
I explored this with tests that only allow a single token to be output (max_tokens=1). My impression is that either: (i) Gemini-3-Pro-Preview prefilling works 100% of the time and never sneakily reasons or (ii) the API is ignoring my max_tokens settings, reasoning, and then still continuing my prefill response.
In my tests, I'm basically having the model complete a sentence like "User: Alice loves cats. Assistant: The user says that Alice loves" and then the next token will always be " cats".
I also tested gpt-5, gpt-4, and grok-4. My impression is that prefilling never works for these