Predictive Coding has been Unified with Backpropagation

by lsusr1 min read2nd Apr 202142 comments

140

Predictive ProcessingNeuroscienceMachine LearningAI
Frontpage

Artificial Neural Networks (ANNs) are based around the backpropagation algorithm. The backpropagation algorithm allows you to perform gradient descent on a network of neurons. When we feed training data through an ANNs, we use the backpropagation algorithm to tell us how the weights should change.

ANNs are good at inference problems. Biological Neural Networks (BNNs) are good at inference too. ANNs are built out of neurons. BNNs are built out of neurons too. It makes intuitive sense that ANNs and BNNs might be running similar algorithms.

There is just one problem: BNNs are physically incapable of running the backpropagation algorithm.

We do not know quite enough about biology to say it is impossible for BNNs to run the backpropagation algorithm. However, "a consensus has emerged that the brain cannot directly implement backprop, since to do so would require biologically implausible connection rules"[1].

The backpropagation algorithm has three steps.

  1. Flow information forward through a network to compute a prediction.
  2. Compute an error by comparing the prediction to a target value.
  3. Flow the error backward through the network to update the weights.

The backpropagation algorithm requires information to flow forward and backward along the network. But biological neurons are one-directional. An action potential goes from the cell body down the axon to the axon terminals to another cell's dendrites. An axon potential never travels backward from a cell's terminals to its body.

Hebbian theory

Predictive coding is the idea that BNNs generate a mental model of their environment and then transmit only the information that deviates from this model. Predictive coding considers error and surprise to be the same thing. Hebbian theory is specific mathematical formulation of predictive coding.

Predictive coding is biologically plausible. It operates locally. There are no separate prediction and training phases which must be synchronized. Most importantly, it lets you train a neural network without sending axon potentials backwards.

Predictive coding is easier to implement in hardware. It is locally-defined; it parallelizes better than backpropagation; it continues to function when you cut its substrate in half. (Corpus callosotomy is used to treat epilepsy.) Digital computers break when you cut them in half. Predictive coding is something evolution could plausibly invent.

Unification

The paper Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs[1:1] "demonstrate[s] that predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules." The authors have unified predictive coding and backpropagation into a single theory of neural networks. Predictive coding and backpropagation are separate hardware implementations of what is ultimately the same algorithm.

There are two big implications of this.

  • This paper permanently fuses artificial intelligence and neuroscience into a single mathematical field.
  • This paper opens up possibilities for neuromorphic computing hardware.

  1. Source is available on arxiv. ↩︎ ↩︎

140

42 comments, sorted by Highlighting new comments since Today at 6:54 PM
New Comment

I have not dug into the math in the paper yet, but the surprising thing from my current perspective is: backprop is basically for supervised learning, while Hebbian learning is basically for unsupervised learning. In particular, Hebbian learning has been touted as an (inefficient but biologically plausible) algorithm for PCA. How can you chain a bunch of PCAs together and get gradient descent?

Aside from that, here's what I understood from the paper so far.

  • By predictive coding, they basically mean: take the structure of the computation graph (eg, the structure of the NN) and interpret it as a gaussian bayes net instead.
  • Calculating learning using only local information follows, therefore, from the general fact that bayes nets let you efficiently compute gradient descent (and some other rules, such as the EM algorithm) using only local information, so you don't have to perform automatic differentiation on the whole network.
  • So the local computation of gradient descent isn't surprising: it's standard for graphical models, it's just unusual for NNs. This is one reason why graphical models might be a better model of the brain than artificial neural networks.
  • The contribution of this paper is the nice correspondence between Gaussian bayes nets and NN backprop. I'm not really sure this should be exciting. It's not like it's useful for anything. If we were really excited about local learning rules, well, we already had some. 

Maybe the tremendous success of backprop lends some fresh credibility to bayes nets due to this correspondence. IE, maybe we are supposed to make an inference like: "I know backprop on NNs can be super effective, so I draw the lesson that learning for bayes nets (at least Gaussian bayes nets) can also be super effective, at a 100x slowdown." But this should have already been plausible, I claim. The machine learning community didn't really put bayes nets and NNs side by side and find bayes nets horribly lacking in learning capacity. Rather, I think the 100x slowdown was the primary motivator: bayes nets eliminate the need for an extra automatic differentiation step, but at the cost of a more expensive inference algorithm.

In particular, someone might take this as evidence that the brain uses Gaussian networks in particular, because we now know Gaussian approximates backprop, and we know backprop is super effective. I think this would be a mistaken inference: I don't think this provides much evidence that Gaussian bayes nets are especially intelligent compared to other Bayes nets.

On the other hand, the simplicity of the math for the Gaussian case does provide some evidence. It seems more plausible that the brain uses Gaussian bayes nets than, say, particle filters.

If they set to 1 they converge in a single backward pass, since they then calculate precisely backprop. Setting to less than that and perhaps mixing up the pass order merely obfuscates and delays this process, but converges because any neuron without incorrect children has nowhere to go but towards correctness. And the entire convergence is for a single input! After which they manually do a gradient step on the weights as usual.

[Preliminary edit: I think this was partly wrong. Replicating...]

It's neat that you can treat activations and parameters by the same update rule, but then you should actually do it. Every "tick", replace the input and label and have every neuron update its parameters and data in lockstep, where every neuron can only look at its neighbors. Of course, this only has a chance of working if the inputs and labels come from a continuous stream, as they would if the input were the output of another network. They also notice the possibility of continuous data. And then one could see how its performance degrades as one speeds up the poor brain's environment :).

: Which has to be in backward order and has to be done once more after the v update line.

Of course, this only has a chance of working if the inputs and labels come from a continuous stream, as they would if the input were the output of another network.

Predictive processing is thus well-suited for BNNs because the real-time sensory data of a living organism, including sensory data preprocessed by another network, is a continuous stream.

On page 8 at the end of section 4.1:

Due to the need to iterate the vs until convergence, the predictive coding network had roughly a 100x greater computational cost than the backprop network.

This seems to imply that artificial NNs are 100x more computationally efficient (at the cost of not being able to grow and probably lower fault tolerance etc.). Still, I'm updating to simulating a brain requiring much less CPU than the neurons in the brain would indicate. 

I think that's premature. This is just one (digital, synchronous) implementation of one model of BNN that can be shown to converge on the same result as backprop. In a neuromorphic implementation of this circuit, the convergence would occur on the same time scale as the forward propagation.

Well, another advantage of the BNN is of course the high parallelism. But that doesn't change the computational cost (number of FLOPS required), it just spreads it out in parallel.  

That's assuming that the brain is using predictive coding to implement backprop, whereas it might instead be doing something that is more computationally efficient given its hardware limitations. (Indeed, the fact that it's so inefficient should make you update that it's not likely for the brain to be doing it)

Partly, yes. But partly the computation could be the cheap part compared to the thing it's trading off against (ability to grow, fault tolerance, ...). It is also possible that the brains architecture allows it to include a wider range of inputs that might not be able to model with back-prod (or not efficiently so).  

I think there's another big implication of this:

Current neural nets seem to require more data (e.g. games of Starcraft) to reach the same level of performance as a human adult. There are different hypotheses as to why this is:

--Human brains are bigger (bigger NN's require less data, it seems?)

--Human brains have a ton of pre-training from which they can acquire good habits and models which generalize to new tasks/situations (we call this "childhood" and "education"). GPT-3 etc. show that something similar works for neural nets, maybe we just need to do more of this with bigger brains.

--Human brains have tons of instincts built in (priors?)

--Human brains have special architectures, various modules that interact in various ways (priors?)

--Human brains don't use Backprop; maybe they have some sort of even-better algorithm

I've heard all of these hypotheses seriously maintained by various people. If this is true it rules out the last one.

How does it "rule out" the last one??

It does provide a small amount of evidence against it, because it shown one specific algorithm is "basically backprop". Maybe you're saying this is significant evidence, because we have some evidence that predictive coding is also the algorithm the brain actually uses.

But we also know there are algorithms which are way more data-efficient than NNs (while being more processing-power intensive). So wouldn't the obvious conclusion from our observations be: humans don't use backprop, but rather, use more data-efficient algorithms?

I'll grant, I'm now quite curious how the scaling argument works out. Is it plausible that human-brain-sized NNs are as data-efficient as humans?

I guess I was thinking: Brains use predictive coding, and predictive coding is basically backprop, so brains can't be using something dramatically better than backprop. You are objecting to the "brains use predictive coding" step? Or are you objecting that only one particular version of predictive coding is basically backprop?

But we also know there are algorithms which are way more data-efficient than NNs (while being more processing-power intensive). So wouldn't the obvious conclusion from our observations be: humans don't use backprop, but rather, use more data-efficient algorithms?

Are you referring to Solomonoff Induction and the like? I think the "brains use more data-efficient algorithms" is an obvious hypothesis but not an obvious conclusion--there are several competing hypotheses, outlined above. (And I think the evidence against it is mounting, this being one of the key pieces.)

I'll grant, I'm now quite curious how the scaling argument works out. Is it plausible that human-brain-sized NNs are as data-efficient as humans?

In terms of bits/pixels/etc., humans see plenty of data in their lifetime, a bit more than the scaling laws would predict IIRC. But the scaling laws (as interpreted by Ajeya, Rohin, etc.) are about the amount of subjective time the model needs to run before you can evaluate the result. If we assume for humans it's something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it's longer, then the gap in data-efficiency grows.

Some issues though. One, the scaling laws might not be the same for all architectures. Maybe if your context window is bigger, or your use recurrency, or whatever, the laws are different. Too early to tell, at least for me (maybe others have more confident opinions, I'd love to hear them!) Two, some data is higher-quality than other data, and plausibly human data is higher-quality than the stuff GPT-3 was fed--e.g. humans deliberately seek out data that teaches them stuff they want to know, instead of just dully staring at a firehose of random stuff. Three, it's not clear how to apply this to humans anyway. Maybe our neurons are updating a hundred times a second or something.

I'd be pretty surprised if a human-brain-sized Transformer was able to get as good as a human at most important human tasks simply by seeing a firehose of 10^9 images or context windows of internet data. But I'd also be pretty surprised (10%) if the scaling laws turn out to be so universal that we can't get around them; if it turns out that transformative tasks really do require a NN at least the size of a human brain trained for at least 10^14 steps or so where each step involves running the NN for at least a subjective week. (Subjective second, I'd find more plausible. Or subjective week (or longer) but with fewer than 10^14 steps.)

You are objecting to the "brains use predictive coding" step? Or are you objecting that only one particular version of predictive coding is basically backprop?

Yeah, somewhere along that spectrum. Generally speaking, I'm skeptical of claims that we know a lot about the brain.

Are you referring to Solomonoff Induction and the like?

I was more thinking of genetic programming.

I think the "brains use more data-efficient algorithms" is an obvious hypothesis but not an obvious conclusion--there are several competing hypotheses, outlined above.

I agree with this. 

(And I think the evidence against it is mounting, this being one of the key pieces.)

(I still don't see why.)

Yeah, somewhere along that spectrum. Generally speaking, I'm skeptical of claims that we know a lot about the brain.
"(And I think the evidence against it is mounting, this being one of the key pieces.)"
(I still don't see why.)

--I wouldn't characterize my own position as "we know a lot about the brain." I think we should taboo "a lot."

--We are at an impasse here I guess--I think there's mounting evidence that brains use predictive coding and mounting evidence that predictive coding is like backprop. I agree it's not conclusive but this paper seems to be pushing in that direction and there are others like it IIRC. I'm guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am... perhaps because the other hypotheses on my list are less plausible to you?

--I wouldn't characterize my own position as "we know a lot about the brain." I think we should taboo "a lot."

To give my position somewhat more detail:

  • I think the methods of neuroscience are mostly not up to the task. This is based on the paper which applied neuroscience methods to try to reverse-engineer the CPU.
  • I think what we have are essentially a bunch of guesses about functionality based on correlations and fairly blunt interventional methods (lesioning), combined with the ideas we've come up with about what kinds of algorithms the brain might be running (largely pulling from artificial intelligence for ideas).

I'm guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am... perhaps because the other hypotheses on my list are less plausible to you?

It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.) However:

  1. There are a lot of different algorithms resembling belief prop. Sticking within the big tent of "variational methods", there are a lot of different variational objectives, which result in different algorithms. The brain could be using a variation which we're unfamiliar with. This could result in significant differences from backprop. (I'm still fond of Hinton's analogy between contrastive divergence and dreaming, for example. It's a bit like saying that dreams are GAN-generated adversarial examples, and the brain trains to anti-learn these examples during the night, which results in improved memory consolidation and conceptual clarity during the day. Isn't that a nice story?)
  2. There are a lot of graphical models besides Bayesian networks. Many of them are "basically the same", but for example SPNs (sum-product networks) are very different. There's a sense in which Bayesian networks assume everything is neatly organized into variables already, while SPNs don't. Also, SPNs are fundamentally faster, so the convergence step in the paper (the step which makes predictive coding 100x slower than belief prop) becomes fast. So SPNs could be a very reasonable alternative, which might not amount to backprop as we know it.
  3. I think it could easily be that the neocortex is explained by some version of predictive coding, but other important elements of the brain are not. In particular, I think the numerical logic of reinforcement learning isn't easily and efficiently captured via graphical models. I could be ignorant here, but what I know of attempts to fit RL into a predictive-processing paradigm ended up using multiplicative rewards rather than additive (so, you multiply in the new reward rather than adding), simply because adding up a bunch of stuff isn't natural in graphical models. I think that's a sign that it's not the right paradigm.
  4. Radical Probabilism / Logical Uncertainty / Logical Induction makes it generally seem pretty probable, almost necessary, that there's also some "non-Bayesian" stuff going on in the brain (ie generalized-bayesian, ie non-bayesian updates). This doesn't seem well-described by predictive coding. This could easily be enough to ruin the analogy between the brain and backprop.
  5. And finally, reiterating the earlier point: there are other algorithms which are more data-efficient than backprop. If humans appear to be more efficient than backprop, then it seems plausible that humans are using a more data-efficient algorithm.

As for the [predictive coding -> backprop] link, well, that's not a crux for me right now, because I was mainly curious why you think such a link, if true, would be evidence against "the brain uses something else that backprop". I think I understand why you would think that, now, sans what the mounting evidence is.

I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency? If so, I have no objection to the hypothesis that the brain uses something more-or-less equivalent to gradient descent.

--Human brains have special architectures, various modules that interact in various ways (priors?)
--Human brains don't use Backprop; maybe they have some sort of even-better algorithm

This is a funny distinction to me. These things seem like two ends of a spectrum (something like, the physical scale of "one unit of structure"; predictive coding is few-neuron-scale, modules are big-brain-chunk scale; in between, there's micro-columns, columns, lamina, feedback circuits, relays, fiber bundles; and below predictive coding there's the rules for dendrite and synapse change).


I wouldn't characterize my own position as "we know a lot about the brain." I think we should taboo "a lot."
I think there's mounting evidence that brains use predictive coding

Are you saying, there's mounting evidence that predictive coding screens off all lower levels from all higher levels? Like all high-level phenomena are the result of predictive coding, plus an architecture that hooks up bits of predictive coding together?

Daniel Kokotajlo was the person who originally pointed me to this article. Thank you!

There is no question that human brains have tons of instincts built-in. But there is a hard limit on how much information a single species' instincts can contain. It is implausible that human beings' cognitive instincts contain significantly more information than the human genome (750 megabytes). I expect our instincts contain much less.

Human brains definitely have special architectures too, like the hippocampus. The critical question is how important these special architectures are. Are our special architectures critical to general intelligence or are they just speed hacks? If they are speed hacks then we can outrace them by building a bigger computer or writing more efficient algorithms.

There is no doubt that humans transmit more cultural knowledge than other animals. This has to do with language. (More specifically, I think our biology underpinning language hit a critical point around 50,000 years ago.) Complex grammar is not present in any non-human animal. Wernicke's area is involved. Wernicke's area could be a special architecture.

How important are the above human advantages? I believe that taking a popular ANN architecture and merely scaling it up will not enable a neural network to compete with humans at StarCraft with equal quantities of training data. If, in addition, the ANN is not allowed to utilize transfer learning then I am willing to publicly bet money on this prediction. (The ANN must be restricted to a human rate of actions-per-second. The ANN does not get to play via an API or similar hand-coded preprocessor. If the ANN watches videos of other players then that counts towards its training data.)

If the ANN can't use transfer learning, that's pretty unfair, since the human can. (It's not like a baby can play Starcraft straight out of the womb; humans can learn Starcraft but only after years of pre-training on diverse data in diverse environments)

Good point. Transfer learning is allowed but it still counts towards the total training data where "training data" is now everything a human can process over a lifetime.

Once you add this condition, are current state-of-the-art Starcraft-learning ANNs still getting more training data than humans?

It is implausible that human beings' cognitive instincts contain significantly more information than the human genome (750 megabytes). I expect our instincts contain much less.

Our instincts contain pointers to learning from other humans, which contain lots of cognitive info. The pointer is small, but that doesn't mean the resulting organism is algorithmically that simple.

That seems plausible, but AIs can have pointers to learning from other humans too. E.g. GPT-3 read the Internet, if we were making some more complicated system it could evolve pointers analogous to the human pointers. I think.

You guys will probably find this Slate Star Codex post interesting:

https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/

Scott summarizes the Predictive Processing theory, explains it in a very accessible way (no math required), and uses it to explain a whole bunch of mental phenomena (attention, imagination, motor behavior, autism, schizophrenia, etc.)

Can someone ELI5/TLDR this paper for me, explain in a way more accessible to a non-technical person?

- How does backprop work if the information can't flow backwards?
- In Scotts post, he says that when lower-level sense data contradicts high-level predictions, high-level layers can override lower-level predictions without you noticing it. But if low-level sensed data has high confidence/precision - the higher levels notice it and you experience "surprise". Which one of those is equivalent to the backdrop error? Is it low-level predictions being overridden, or high-level layers noticing the surprise, or something else, like changing the connections between neurons to train the network and learn from the error somehow?

TLDR for this paper: There is a separate set of 'error' neurons that communicate backwards. Their values converge on the appropriate back propagation terms.

A large error at the top levels corresponds to 'surprise', while a large error at the lower levels corresponds more to the 'override'.

Is it known whether predictive coding is easier to train than backprop? Local learning rules seem like they would be more parallelizable.

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine seems relevant to this discussion:

We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel.

Not particularly relevant, I think, but interesting nonetheless.

A first drawback of this paper is that its conclusion assumes that the NN underneath trains with gradient flow (GF), which is the continuous-time version of gradient descent (GD). This is a good assumption if the learning rate is very small, and the resulting GD dynamics closely track the GF differential equation.

This does not seem to be the case in practice. Larger initial learning rates help get better performance (https://arxiv.org/abs/1907.04595), so people use them in practice. If what people use in practice was well-approximated by GF, then smaller learning rates would give the same result. You can use another differential equation that does seem to approximate GD fairly well (http://ai.stanford.edu/blog/neural-mechanics/), but I don't know if the math from the paper still works out.

Second, as the paper points out, the kernel machine learned by GD is a bit strange in that the coefficients $a_i$ for weighing different $K(x, x_i)$ depend on $x$. Thus, the resulting output function is not in the reproducing kernel Hilbert space of the kernel that is purported to describe the NN. As a result, as kernel machines go, it's pretty weird. I expect that a lot of the analysis about the output of the learning process (learning theory etc) assumes that the $a_i$ do not depend on the test input $x$.

Good point!

Do you know of any work that applies similar methods to study the equivalent kernel machine learned by predictive coding?

I don't, and my best guess is that nobody has done it :) The paper you linked is extremely recent.

You'd have to start by finding an ODE model of predictive coding, which I suppose is possible taking limit of the learning rate to 0.

Due to the need to iterate the vs until convergence, the predictive coding network had roughly a 100x greater computational cost than the backprop network.

The paper claims that predictive coding takes more compute. I agree that predictive coding ought to be more parallelizable. If you are using a GPU then backpropagation is already sufficiently parallelizable. However, it may be that neuromorphic hardware could parallelize better than a GPU, thus producing an increase in compute power that outstrips the 100x greater computational cost of the algorithm itself.

Kind of. Neuromorphics don't buy you too much benefit for generic feedforward networks, but they dramatically reduce the expenses of convergence. Since the 100x in this paper derives from iterating until the network converges, a neuromorphics implementation (say on Loihi) would directly eliminate that cost.

Thanks for sharing!

Two comments:

  • There seem to be a couple of sign errors in the manuscript. (Probably worth reaching out to the authors directly)
  • Their predictive coding algorithm holds the vhat values fixed during convergence, which actually implies a somewhat different network topology than the more traditional one shown in your figure.

What are the sign errors?

Right side of equation 2. Also the v update step in algorithm 1 should have a negative sign (the text version earlier on the same page has it right).

hat = .

I suspect a better title would be "Here is a proposed unification of a particular formalization of predictive coding, with backprop"

I see in the citations for this they already have the Neural ODE paper and libraries.  Which means the whole pipeline also has access to all our DiffEq tricks.

In terms of timelines, this seems bad: unless there are serious flaws with this line of work, we just plugged in our very best model for how the only extant general intelligences work into our fastest-growing field of capability.

I'm confused. The bottom diagram seems to also involve bidirectional flow of information.

The black circles represent neurons. The red triangles represent activations (action potentials). Action potentials' information content is shared between presynaptic neurons and postsynaptic neurons because activations are transmitted from the presynaptic neuron to the postsynaptic neuron.

The black arrows in the bottom diagram denote the physical creation of action potentials. The red arrows denote intra-neuron calculation of the gradient. Keep in mind that each neuron knows both the action potential it generates itself and the action potentials sent to it.

Oh that makes a lot more sense. Is delta v1 hat the change of v1 rather than a infintesimal? (Asking because if it was then it'd be easier to understand how it is calculated).

There is no relationship between neurons and the "neurons" of an ANN. It's just a naming mishap at this point.

Incorrect. Perceptrons are a low fidelity (but still incredibly useful!) rate-encoded model of individual neurons.

Sure, but statements like

>ANNs are built out of neurons. BNNs are built out of neurons too.

are imprecise and possibly imprecise enough to be also incorrect if it turns out that biological neurons do something different than perceptrons that is important. Without making the exact arguments and presenting evidence in what respects the perceptron model is useful, it is quite easy to bake in conclusions along the lines of "this algorithm for ANNs is a good model of biology" in the assumptions "both are built out of neurons".