No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Epistemic status: Quite confident about the neuroscience methodology (it's part of my PhD thesis work, and is published in a peer-reviewed journal). Uncertain about direct applicability to AI interpretability—the problem is similar but I haven't tested it on LLMs yet. This is "here's a tool that worked in a related domain" not "here's the solution to interpretability."
TL;DR: Neuroscientists face the same interpretability problem as AI safety researchers: complex, inscrutable systems with thousands of parameters that transform inputs to outputs. I worked on a systematic method to find the minimal features that predict responses under specific conditions. For cortical neurons with thousands of morphological/biophysical parameters, just three features (spatial input distribution, temporal integration window, recent activation history) predicted responses with 97% accuracy. The approach of searching systematically for sufficient, interpretable features which are relevant under a given condition seems directly applicable to mechanistic interpretability of artificial neural networks.
Wait, we're solving the same problem
Here's something that surprised me as I neared the end of my PhD and starting looking around for what I wanted to do next: neuroscientists and AI interpretability researchers are basically working on the same problem, but we rarely talk to each other.
Both of us have complex, multilayered systems that do something interesting when you give them inputs. Both of us would really like to know what computation they're actually performing. Both of us have way too many parameters to reason about all of them simultaneously.
In neuroscience, I had single-neuron models with:
Thousands of compartments representing dendritic morphology, receiving thousands of input synapses from different presynaptic cell types
Multiple voltage-gated ion channel types with nonlinear dynamics, distributed all across the dendrite
Complex spatiotemporal patterns of synaptic activation
In AI interpretability, you have models with:
Billions of parameters across dozens of layers
Nonlinear activation functions and attention mechanisms
Complex token representations and positional encodings
Admittedly the scales are a little different, but the standard approaches in both fields are kind of similar too: build detailed mechanistic models, then... stare at them really hard and hope that understanding falls out? Neuroscience has its biophysically detailed multi-scale models, which includes billion-euro funded work over decades. AI has circuit identification and activation patching. Both give you descriptions of what is happening in the system, but descriptions aren't the same as understanding if they’re still too complex for a human to actually interpret.
What If We're Asking the Wrong Question?
Neuroscientists spend a lot of time trying to understand everything about how cortical neurons compute. We want to know how every dendritic branch contributed, how calcium spikes interacted with sodium spikes, how NMDA receptors enabled nonlinear integration—all of it.
What if most of that complexity doesn't matter for the specific behaviour I care about?
Not "doesn't matter" in the sense that it's not happening, neurons definitely have calcium spikes and NMDA nonlinearities. But "doesn't matter" in the sense that you could predict the neuron's output just fine in some cases without modelling all that detail.
This led to a different question: What's the minimal set of features that can predict the system's behaviour under the conditions I actually care about?
This is the question that I worked on together with my colleague Arco Bast, first during my master thesis, and then continuing development during my PhD.
The Methodology: Systematic Reduction
Quick neuroscience introduction
Neurons in the cerebral cortex receive thousands of inputs per second from thousands of other neurons. They receive these inputs onto their “dendrites”, which branch off from the cell body, in the form of “synapses”, which are the connection points between two neurons. Cortical neurons use discrete signals—they either produce an output spike, or they don’t. Revealing how synaptic inputs drive spiking output remains one of the major challenges in neuroscience research.
Pick a condition you care about
There's a temptation to want general interpretability—to understand the model in all contexts. The problem is, you tend to face some kind of trade-off between accuracy, interpretability and generalisability:
(pick two)
We went with sensory processing of passive whisker touch in anaesthetised animals, which is a well-characterised condition for which lots of experimental data exists, and for which we have built a highly detailed multi-scale model from this data.
Start with full complexity
We didn’t make any top-down assumptions about what the input-output computation of the neurons could look like. We started with biophysically detailed multi-compartmental neuron models embedded in an anatomically realistic network model. These models can reproduce calcium spikes, backpropagating action potentials, bursting, the whole repertoire of cortical neuron activity. They've been validated against experimental data, and when we simulate sensory responses, they match what we see in actual rat brains.
Search for predictive features
Instead of hypothesising which features of the input might be important for predicting the neuron’s output, we systematically searched for them. We spent quite some time systematically and iteratively trying different ways of grouping and weighting synaptic inputs, and then comparing the prediction accuracy of the resulting reduced models, eventually deciding to group by:
Time of activation (1ms bins): was this synapse active 1ms ago? 5ms ago? 50ms ago?
Distance from soma (50μm bins): is this synapse close to the cell body, where the output spike can be initialised, or way out in the dendrites?
Excitatory vs inhibitory: you can generally think of excitatory synapses as positively weighted connections, that make the target neuron more likely to produce an output spike, and inhibitory synapses as the opposite
Then we used optimisation to find weights for each group that maximised prediction accuracy. Basically: "How much should I weight an excitatory synapse that's 300μm from the soma and was active 4ms ago to predict if the neuron spikes right now?"
This gave us spatiotemporal filters, which in this case are continuous functions describing how synaptic inputs at different times and locations contribute to output:
We took those filters and built generalised linear models (GLMs). With testing, it turned out that we also needed to consider the spike history, because real neurons can’t just fire arbitrarily fast. Basically:
That's it. Count active synapses, weight them by location and timing, subtract a penalty if the neuron just fired, pass through a nonlinearity. Analytically tractable, interpretable, fast to simulate.
What the reduced model told us about neuronal computation
Despite all that biophysical complexity—the ion channels, the calcium spikes, the NMDA nonlinearities—the input-output computation for anaesthetised sensory responses reduced to:
Spatial weighting: Inputs within ~500μm of the soma matter; distal inputs don't contribute much. Contribution decays gradually with distance.
Temporal integration: Excitatory and inhibitory inputs matter most within a time window roughly corresponding to the time courses of their respective ion channels.
Refractory dynamics: If the neuron just fired, it's less likely to fire again immediately.
The reduced model predicted action potential output with 97% accuracy.
And here's the really interesting part: We tested this across seven different neuron models with wildly different morphologies and biophysical parameters: different dendritic branching patterns, different ion channel densities and distributions, different trunk diameters. They all performed qualitatively the same computation. The filters had slightly different shapes (e.g. scaling with dendrite thickness), but the core transformation was identical.
Reduced models for 7 different neuron models
The Insights That Might Matter for AI
1. Focus on a specific condition
In neuroscience, other approaches have tried to build models that captured neuronal responses in all possible experimental conditions (e.g. Beniaguev et al. (2021), who used an 8-layer deep neural network to represent a single neuron). These models end up being so complex that they aren't interpretable. When we constrained to one specific condition, we could actually understand what was happening.
For AI safety: it might be better to prioritise deeply understanding behaviour in safety-critical conditions than shallowly understanding behaviour in general.
If you want to prevent deceptive alignment, you don't need to understand everything GPT-4 does. You need to understand what it does when deception would be instrumentally useful. Build a reduced model for that condition, and make it simple enough to reason about.
Different conditions can have different reduced models, and that's fine. Understanding is compositional.
2. Focus on computation, not implementation
When I analysed what drives response variability—why different neurons respond differently to the same stimulus—I found network input patterns (which synapses are active when) were the primary determinant of response differences, while morphological diversity and biophysical properties only had minor influence.
What does this mean? Two neurons with completely different "architectures" perform the same computation. The variability in their outputs comes almost entirely from variability in their inputs, not their internal structure.
This suggests a general plausible approach: focus interpretability first on input patterns and their transformation, not on cataloguing implementation details.
For AI: maybe instead of trying to understand every circuit in GPT-4, we could ask: what input patterns lead to concerning behaviours? What's the minimal transformation from inputs to those behaviours, and can that help us to understand what's going on in the model?
Important Caveats
This worked for one condition: We explicitly focused on passive single-whisker deflections in anesthetised rats. This was a deliberate choice: we traded generality for interpretability. But it means more complex conditions might need more complex reduced models, and you might need multiple models to cover multiple conditions.
When is simple reduction possible? Some behaviors might not admit simple reduced descriptions. For neurons, active whisking (vs passive touch) requires additional features. For LLMs, some behaviors might be irreducibly complex.
Scale: I worked with single neurons receiving thousands of inputs. LLMs have billions of parameters, and context windows keep getting longer.
Wild Speculation Section
Some half-baked ideas that might be interesting:
Compositional models: Neuroscience has found that the same neuron performs different computations under different conditions (passive touch vs. active exploration, anesthetised vs. awake). Could LLMs have compositional reduced models—different minimal descriptions for different contexts that get flexibly combined?
Training dynamics: I reduced neurons at one point in time. What if you tracked how the reduced model changes during a model’s training? Could you see a phase transition when the model suddenly learns a new feature or strategy?
Universality: I found the same computation across morphologically and biophysically diverse neurons. Is there universality in neural networks? Do different architectures or training runs converge to the same reduced model for the same task?
I think there's value in cross-pollination here. Neuroscience has been forced to develop systematic approaches to interpretability because we were struggling to understand biological neural networks due to their many interacting parts (we can’t even measure everything at the same time, AI research should have an advantage here!). AI safety is hitting the same constraint with large language models.
Background: I just finished my PhD in neuroscience at the Max Planck Institute for Neurobiology of Behavior. My thesis focused on modelling structure-function relationships in neurons and biological neural networks. Now I'm trying to pivot into AI safety because, honestly, I think preventing AGI from nefariously taking over the world is more urgent than understanding rat whisker processing, and I think transferring established methods and approaches from neuroscience to AI makes sense.
Epistemic status: Quite confident about the neuroscience methodology (it's part of my PhD thesis work, and is published in a peer-reviewed journal). Uncertain about direct applicability to AI interpretability—the problem is similar but I haven't tested it on LLMs yet. This is "here's a tool that worked in a related domain" not "here's the solution to interpretability."
TL;DR: Neuroscientists face the same interpretability problem as AI safety researchers: complex, inscrutable systems with thousands of parameters that transform inputs to outputs. I worked on a systematic method to find the minimal features that predict responses under specific conditions. For cortical neurons with thousands of morphological/biophysical parameters, just three features (spatial input distribution, temporal integration window, recent activation history) predicted responses with 97% accuracy. The approach of searching systematically for sufficient, interpretable features which are relevant under a given condition seems directly applicable to mechanistic interpretability of artificial neural networks.
Wait, we're solving the same problem
Here's something that surprised me as I neared the end of my PhD and starting looking around for what I wanted to do next: neuroscientists and AI interpretability researchers are basically working on the same problem, but we rarely talk to each other.
Both of us have complex, multilayered systems that do something interesting when you give them inputs. Both of us would really like to know what computation they're actually performing. Both of us have way too many parameters to reason about all of them simultaneously.
In neuroscience, I had single-neuron models with:
In AI interpretability, you have models with:
Admittedly the scales are a little different, but the standard approaches in both fields are kind of similar too: build detailed mechanistic models, then... stare at them really hard and hope that understanding falls out? Neuroscience has its biophysically detailed multi-scale models, which includes billion-euro funded work over decades. AI has circuit identification and activation patching. Both give you descriptions of what is happening in the system, but descriptions aren't the same as understanding if they’re still too complex for a human to actually interpret.
What If We're Asking the Wrong Question?
Neuroscientists spend a lot of time trying to understand everything about how cortical neurons compute. We want to know how every dendritic branch contributed, how calcium spikes interacted with sodium spikes, how NMDA receptors enabled nonlinear integration—all of it.
What if most of that complexity doesn't matter for the specific behaviour I care about?
Not "doesn't matter" in the sense that it's not happening, neurons definitely have calcium spikes and NMDA nonlinearities. But "doesn't matter" in the sense that you could predict the neuron's output just fine in some cases without modelling all that detail.
This led to a different question: What's the minimal set of features that can predict the system's behaviour under the conditions I actually care about?
This is the question that I worked on together with my colleague Arco Bast, first during my master thesis, and then continuing development during my PhD.
The Methodology: Systematic Reduction
Quick neuroscience introduction
Neurons in the cerebral cortex receive thousands of inputs per second from thousands of other neurons. They receive these inputs onto their “dendrites”, which branch off from the cell body, in the form of “synapses”, which are the connection points between two neurons. Cortical neurons use discrete signals—they either produce an output spike, or they don’t. Revealing how synaptic inputs drive spiking output remains one of the major challenges in neuroscience research.
Pick a condition you care about
There's a temptation to want general interpretability—to understand the model in all contexts. The problem is, you tend to face some kind of trade-off between accuracy, interpretability and generalisability:
We went with sensory processing of passive whisker touch in anaesthetised animals, which is a well-characterised condition for which lots of experimental data exists, and for which we have built a highly detailed multi-scale model from this data.
Start with full complexity
We didn’t make any top-down assumptions about what the input-output computation of the neurons could look like. We started with biophysically detailed multi-compartmental neuron models embedded in an anatomically realistic network model. These models can reproduce calcium spikes, backpropagating action potentials, bursting, the whole repertoire of cortical neuron activity. They've been validated against experimental data, and when we simulate sensory responses, they match what we see in actual rat brains.
Search for predictive features
Instead of hypothesising which features of the input might be important for predicting the neuron’s output, we systematically searched for them. We spent quite some time systematically and iteratively trying different ways of grouping and weighting synaptic inputs, and then comparing the prediction accuracy of the resulting reduced models, eventually deciding to group by:
Then we used optimisation to find weights for each group that maximised prediction accuracy. Basically: "How much should I weight an excitatory synapse that's 300μm from the soma and was active 4ms ago to predict if the neuron spikes right now?"
This gave us spatiotemporal filters, which in this case are continuous functions describing how synaptic inputs at different times and locations contribute to output:
We took those filters and built generalised linear models (GLMs). With testing, it turned out that we also needed to consider the spike history, because real neurons can’t just fire arbitrarily fast. Basically:
weighted_net_input = Σ(synapses) × spatial_filter(distance) × temporal_filter(time_ago)
P(spike) = nonlinearity(weighted_net_input - post_spike_penalty)
That's it. Count active synapses, weight them by location and timing, subtract a penalty if the neuron just fired, pass through a nonlinearity. Analytically tractable, interpretable, fast to simulate.
What the reduced model told us about neuronal computation
Despite all that biophysical complexity—the ion channels, the calcium spikes, the NMDA nonlinearities—the input-output computation for anaesthetised sensory responses reduced to:
The reduced model predicted action potential output with 97% accuracy.
And here's the really interesting part: We tested this across seven different neuron models with wildly different morphologies and biophysical parameters: different dendritic branching patterns, different ion channel densities and distributions, different trunk diameters. They all performed qualitatively the same computation. The filters had slightly different shapes (e.g. scaling with dendrite thickness), but the core transformation was identical.
The Insights That Might Matter for AI
1. Focus on a specific condition
In neuroscience, other approaches have tried to build models that captured neuronal responses in all possible experimental conditions (e.g. Beniaguev et al. (2021), who used an 8-layer deep neural network to represent a single neuron). These models end up being so complex that they aren't interpretable. When we constrained to one specific condition, we could actually understand what was happening.
For AI safety: it might be better to prioritise deeply understanding behaviour in safety-critical conditions than shallowly understanding behaviour in general.
If you want to prevent deceptive alignment, you don't need to understand everything GPT-4 does. You need to understand what it does when deception would be instrumentally useful. Build a reduced model for that condition, and make it simple enough to reason about.
Different conditions can have different reduced models, and that's fine. Understanding is compositional.
2. Focus on computation, not implementation
When I analysed what drives response variability—why different neurons respond differently to the same stimulus—I found network input patterns (which synapses are active when) were the primary determinant of response differences, while morphological diversity and biophysical properties only had minor influence.
What does this mean? Two neurons with completely different "architectures" perform the same computation. The variability in their outputs comes almost entirely from variability in their inputs, not their internal structure.
This suggests a general plausible approach: focus interpretability first on input patterns and their transformation, not on cataloguing implementation details.
For AI: maybe instead of trying to understand every circuit in GPT-4, we could ask: what input patterns lead to concerning behaviours? What's the minimal transformation from inputs to those behaviours, and can that help us to understand what's going on in the model?
Important Caveats
This worked for one condition: We explicitly focused on passive single-whisker deflections in anesthetised rats. This was a deliberate choice: we traded generality for interpretability. But it means more complex conditions might need more complex reduced models, and you might need multiple models to cover multiple conditions.
When is simple reduction possible? Some behaviors might not admit simple reduced descriptions. For neurons, active whisking (vs passive touch) requires additional features. For LLMs, some behaviors might be irreducibly complex.
Scale: I worked with single neurons receiving thousands of inputs. LLMs have billions of parameters, and context windows keep getting longer.
Wild Speculation Section
Some half-baked ideas that might be interesting:
Compositional models: Neuroscience has found that the same neuron performs different computations under different conditions (passive touch vs. active exploration, anesthetised vs. awake). Could LLMs have compositional reduced models—different minimal descriptions for different contexts that get flexibly combined?
Training dynamics: I reduced neurons at one point in time. What if you tracked how the reduced model changes during a model’s training? Could you see a phase transition when the model suddenly learns a new feature or strategy?
Universality: I found the same computation across morphologically and biophysically diverse neurons. Is there universality in neural networks? Do different architectures or training runs converge to the same reduced model for the same task?
I think there's value in cross-pollination here. Neuroscience has been forced to develop systematic approaches to interpretability because we were struggling to understand biological neural networks due to their many interacting parts (we can’t even measure everything at the same time, AI research should have an advantage here!). AI safety is hitting the same constraint with large language models.
Background: I just finished my PhD in neuroscience at the Max Planck Institute for Neurobiology of Behavior. My thesis focused on modelling structure-function relationships in neurons and biological neural networks. Now I'm trying to pivot into AI safety because, honestly, I think preventing AGI from nefariously taking over the world is more urgent than understanding rat whisker processing, and I think transferring established methods and approaches from neuroscience to AI makes sense.