Matthew McDonnell — LessWrong

A Black Box Made Less Opaque (part 3)

An examination of the effect of model size on interpretability I. Executive summary This is the third installment of a series of analyses exploring basic AI mechanistic interpretability techniques. While Part 1 and Part 2 in this series are summarized below, a review of those articles will provide the reader...

Mar 267

A Black Box Made Less Opaque (part 2)

Examining an AI model’s focus on form vs. ideas 1. Executive summary This is the second installment of a series of analyses exploring basic AI mechanistic interpretability techniques. While Part 1 in this series is summarized below, a review of that article provides helpful context for the analysis below. Key...

Feb 46

A Black Box Made Less Opaque (part 1)

An exploration of SAEs applied to a small LLM 1. Executive summary Findings * The application of residual stream sparse autoencoders (“SAEs”) to GPT-2 small reliably illustrates fundamental interpretability concepts, including feature identification, activation levels, and activation geometry. * For each category of sample text strings tested: * Both peak...

Jan 246