This is the fourth entry in the "Which Circuit is it?" series. We will explore the notions of counterfactual faithfulness. This project is done in collaboration with Groundless. Last time, we opened the black box and saw how interventions can help us distinguish which subcircuit is a better explanation. This...
This is the third entry in the "Which Circuit is it?" series. We will explore possible notions of faithfulness as we consider interventions. This project is done in collaboration with Groundless. Last time, we delved into black-box methods to see how far observational faithfulness could take us in our toy...
This is the second entry in the "Which Circuit is it?" series. We will explore possible notions of faithfulness when explanations are treated as black boxes. This project is done in collaboration with Groundless. Last time, we introduced the non-identifiability of explanations. Succinctly, the problem is that many different (sub)circuits...
This is the introduction to the Which Circuit is it? sequence. We will develop some basic concepts and introduce the non-identifiability problem. This project is done in collaboration with Groundless. Recent work[1] has shown that, under commonly used criteria for Mechanistic Interpretability (MI), neural network explanations are not unique, not...
I have realized I want to contribute to AI Safety in any way I can. I am currently focused on interpretability, trying to make sense of research out there, orienting myself, looking for other new ravers[1], and ultimately learning to enact its technics[2]. In that process, I am writing a...