I am currently a MATS 8.0 scholar studying mechanistic interpretability with Neel Nanda. I’m also a postdoc in psychology/neuroscience. My perhaps most notable paper analyzed the last 20 years of psychology research, searching for trends in what papers do and do not replicate. I have some takes on statistics. tl;dr...
This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models. tl;dr * Research on chain-of-thought (CoT) unfaithfulness shows how models’ CoTs may omit information that is relevant to their final decision. * Here, we sketch hypotheses for...
This post is adapted from our recent arXiv paper. Paul Bogdan and Uzay Macar are co-first authors on this work. TL;DR * Interpretability of chains-of-thought (CoTs) produced by LLMs is challenging: * Standard mechanistic interpretability studies a single token's generation but CoTs are sequences of reasoning steps that use thousands...
I have been poking around with LLMs, and I found some results that seem broadly interesting Summary Introduction: Large language models (LLM) are usually structured as repeated transformer layers of the same size. However, this architecture is often described as functionally hierarchical with earlier layers focusing on small patches of...