unruly abstractions

Comparing Across Possible Worlds

This is the fourth entry in the "Which Circuit is it?" series. We will explore the notions of counterfactual faithfulness. This project is done in collaboration with Groundless. Last time, we opened the black box and saw how interventions can help us distinguish which subcircuit is a better explanation. This...

Mar 247

Poking and Editing the Circuits

This is the third entry in the "Which Circuit is it?" series. We will explore possible notions of faithfulness as we consider interventions. This project is done in collaboration with Groundless. Last time, we delved into black-box methods to see how far observational faithfulness could take us in our toy...

Mar 242

How Far Can Observation Take Us?

This is the second entry in the "Which Circuit is it?" series. We will explore possible notions of faithfulness when explanations are treated as black boxes. This project is done in collaboration with Groundless. Last time, we introduced the non-identifiability of explanations. Succinctly, the problem is that many different (sub)circuits...

Mar 2312

Intro: Non-Identifiability of Explanations

This is the introduction to the Which Circuit is it? sequence. We will develop some basic concepts and introduce the non-identifiability problem. This project is done in collaboration with Groundless. Recent work[1] has shown that, under commonly used criteria for Mechanistic Interpretability (MI), neural network explanations are not unique, not...

Mar 916

Category-Theoretic Wanderings into Interpretability

I have realized I want to contribute to AI Safety in any way I can. I am currently focused on interpretability, trying to make sense of research out there, orienting myself, looking for other new ravers[1], and ultimately learning to enact its technics[2]. In that process, I am writing a...

Sep 2, 202519