For those who haven't been following along, for reasons described here I've been trying to get a big-picture understanding of brain modules outside what I call the “neocortex subsystem”. So then I was reading about the cerebellum and amygdala. Both the cerebellum and amygdala run supervised learning algorithms (ref, ref), although the amygdala seems to do other things too. So, here I am, writing yet another blog post about why it might be useful for a brain to have self-contained modules that do supervised learning, at a big-picture level.
So far I've blogged about three reasons that a brain benefits from supervised learning:
This post is about yet a fourth thing, which is sorta similar to the first one but can happen as a separate module outside the neocortex. I belatedly got this when it was pointed out to me that the Dorsal Cochlear Nucleus (DCN) seems to have cerebellum-like circuitry, but doesn't fit into any of the above three stories, at least not straightforwardly (to me). Instead the DCN is one of several structures that sit between the cochlea—where sound information is first encoded as nerve impulses—and the auditory processing systems in the neocortex and midbrain. The consensus seems to be that it’s some sort of input preprocessor / filter, doing things like filtering out the sound of your jaw moving.
Can a supervised learning system really do that? How? Let’s put aside the neuroscience for a moment, and just think about it! Thinking about algorithms is fun.
Background: The predictive coding compression algorithm. Long before predictive coding was a hot topic in computational neuroscience, predictive coding was a family of compression algorithms. The basic idea is: the encoder (compressor) has a deterministic model that tries to predict the next bit, and it only stores the errors in that model. These errors tend to be small numbers, so they don’t take as much space to store. Then the decoder (decompressor) runs the process in reverse, using the same deterministic model to reconstruct what the encoder was doing, each time adding in the stored error to get the original datapoint. As I’m describing it, this is a lossless compression algorithm, although there are lossy variants.
Why might a brain want to compress input data? I guess that some brain systems have an easier time processing a trickle of high-entropy input data than a flood of low-entropy (i.e. redundant) input data. To take an extreme case, if there’s some new sensory data that’s 100% redundant with what that brain system has already figured out, then sending that data is a total waste of processing power. Right? Seems plausible, I guess.
I’m calling it “compression”, but you can equally call it “filtering”—specifically, filtering out the redundant information.
Isn’t decoding the compressed data, umm, impossible?
In a predictive-coding compression algorithm, whatever system receives the compressed data needs to know the exact predictive model that was used for compression. Otherwise, how would it have any idea what the bits mean? Well, this poses a problem. Remember the example above: The DCN learns a predictive model of input auditory data by supervised learning, and passes on the errors to the neocortex. The predictive model is some complicated pattern of synapses in the DCN—and worse, it is changing over time. There’s no obvious way that the neocortex can have access to this predictive model. So how can it possibly know what the compressed bits mean?
I don't have a particularly well-formed mathematical theory here, but I don't think this is an unsolvable problem. Consider the following intuitions:
OK. This isn't a particularly deep or careful or neuroscience-grounded analysis, but the basic concept seems plausible enough to me!
So, going forward, when I look at the DCN, or anything else with cerebellum-like circuitry (including the cerebellum itself), one of my candidate stories for what the thing is doing will be “this thing is doing input data compression / filtration using predictive coding”.
By the way, I doubt anything in this post is terribly important for answering my burning questions related to ensuring that possible future neocortex-like AGIs are safe and beneficial and leading us towards the awesome post-AGI utopia we're all hoping for, with a superintelligent Max Headroom personal assistant in every smart toaster and all that. Hmm, let me think. Maybe it would make interpretability marginally harder. Really, the main thing is that I'm chipping away at the number of things in the brain that are “unknown unknowns” to me.