Interesting fact about backprop: a supply chain of profit-maximizing, competitive companies can be viewed as implementing backprop. Obviously there's some setup here, but it's reasonably general; I'll have a long post on it at some point. This should not be very surprising: backprop is just an efficient algorithm for calculating gradients, and prices in competitive markets are basically just gradients of production functions.
Anyway, my broader point is this: backprop is just an efficient way to calculate gradients. In a distributed system (e.g. a market), it's not necessarily the most efficient gradient-calculation algorithm. What's relevant is not whether the brain uses backpropagation per se, but whether it uses gradient descent. If the brain mainly operates off of gradient descent, then we have that theoretical tool already, regardless of the details of how the brain computes the gradient.
Many of the objections listed to brain-as-backprop only apply to single-threaded, vanilla backprop, rather than gradient descent more generally.
I'm looking forward to reading that post.
Yes, it seems right that gradient descent is the key crux. But I'm not familiar with any efficient way of doing it that the brain might implement, apart from backprop. Do you have any examples?
Here's my preferred formulation of the general derivative problem (skip to the last paragraph if you just want the summary): you have some function . We'll assume that it's been "flattened out", i.e. all the loops and recursive calls have been expanded, it's just a straight-line numerical function. Adopting hilariously bad variable names, suppose the -th line of computes . We'll also assume that the first lines of just load in , so e.g. . If has lines, then the output of is .
Now, we create a vector-valued function , which runs each line of f in parallel: line of evaluated at . computes a fixed point (it may take a moment of thought or an example for that part to make sense). It's that fixed point formula which we differentiate. The result: we get , where is a very sparse triangular matrix. In fact, we don't even need to solve the whole thing - we only need . Backprop just uses the usual method for solving triangular matrices: start at the end and work back.
Main point: derivative calculation, in general, can be done by solving a (sparse, triangular) system of linear equations. There's a whole field devoted to solving sparse matrices, especially in parallel. Different methods work better depending on the matrix structure (which will follow the structure of the computation DAG of ), so different methods will work better for different functions. Pick your favorite sparse matrix solver, ideally one which will leverage triangularity, and boom, you have a derivative calculator.
Side note: do these comments support LaTeX? Is there a page explaining what comments do support? It doesn't seem to be markdown, no idea what we're using here.
Side note: do these comments support LaTeX? Is there a page explaining what comments do support? It doesn't seem to be markdown, no idea what we're using here.
It is a WYSIWYG markdown editor and dollar-sign is the symbol that opens the LaTex editor (I've LaTexed your comment for you, hope that's okay).
Added: @habryka oops, double-comment!
Ooooh, that makes much more sense now, I was confused by the auto-formatting as I typed. Thank you for taking the time to clean up my comment. Also thankyou @habryka.
Also, how do images work in posts? I was writing up a post the other day, but when I tried to paste in an image it just created a camera symbol. Alternatively, is this stuff documented somewhere?
My transatlantic flight permitting, I’ll reply with a post tomorrow with full descriptions of how to use the editor.
[Crossposted from my blog]
The Secret Sauce Question
Human brains still outperform deep learning algorithms in a wide variety of tasks, such as playing soccer or knowing that it’s a bad idea to drive off a cliff without having to try first (for more formal examples, see Lake et al., 2017; Hinton, 2017; LeCun, 2018; Irpan, 2018). This fact can be taken as evidence for two different hypotheses:
The question of which of these views is right I call “the secret sauce question”.
The secret sauce question seems like one of the most important considerations in estimating how long there is left until the development of human-level artificial intelligence (“timelines”). If something like 2) is true, timelines are arguably substantially shorter than if something like 1) is true [1].
However, it seems initially difficult to arbitrate these two vague, high-level views. It appears as if though an answer requires complicated inside views stemming from deep and wide knowledge of current technical AI research. This is partly true. Yet this post proposes that there might also be single, concrete discovery capable of settling the secret sauce question: does the human brain learn using gradient descent, by implementing backpropagation?
The importance of backpropagation
Underlying the success of modern deep learning is a single algorithm: gradient descent with backpropagation of error (LeCun et al., 2015). In fact, the majority of research is not focused on finding better algorithms, but rather on finding better cost functions to descend using this algorithm (Marblestone et al., 2016). Yet, in stark contrast to this success, since the 1980’s the key objection of neuroscientists to deep learning has been that backpropagation is not biologically plausible (Crick, 1989; Stork, 1989).
As a result, the question of whether the brain implements backpropagation provides critical evidence on the secret sauce problem. If the brain does not use it, and still outperforms deep learning while running on the energy of a laptop and training on several orders of magnitude fewer training examples than parameters, this suggests that a deep conceptual advance is necessary to build human-level artificial intelligence. There’s some other remarkable algorithm out there, and evolution found it. But if the brain does use backprop, then the reason deep learning works so well is because it’s somehow on the right track. Human researchers and evolution converged on a common solution to the problem of optimising large networks of neuron-like units. (These arguments assume that if a solution is biologically plausible and the best solution available, then it would have evolved).
Actually, the situation is a bit more nuanced than this, and I think it can be clarified by distinguishing between algorithms that are:
Biologically actual: What the brain actually does.
Biologically plausible: What the brain might have done, while still being restricted by evolutionary selection pressure towards energy efficiency etc.
For example, humans walk with legs, but it seems possible that evolution might have given us wings or fins instead, as those solutions work for other animals. However, evolution could not have given us wheels, as that requires a separable axle and wheel, and it's unclear what an evolutionary path to an organism with two separable parts looks like (excluding symbiotic relationships).
Biologically possible: What is technically possible to do with collections of cells, regardless of its relative evolutionary advantage.
For example, even though evolving wheels is implausible, there might be no inherent problem with an organism having wheels (created by "God", say), in the way in which there's an inherent problem with an organism’s axons sending action potentials faster than the speed of light.
I think this leads to the following conclusions:
Nature of backprop: Implication for timelines
Biologically impossible: Unclear, there might be multiple “secret sauces”
Biologically possible, but not plausible: Same as above
Biologically plausible, but not actual: Timelines are long, there’s likely a “secret sauce”
Biologically actual: Timelines are short, there’s likely no “secret sauce”
In cases where evolution could not invent backprop anyway, it’s hard to compare things. That is consistent both with backprop not being the right way to go and with it being better than whatever evolution did.
It might be objected that this question doesn’t really matter, since if neuroscientists found out that the brain does backprop, they have not thereby created any new algorithm -- but merely given stronger evidence for the workability of previous algorithms. Deep learning researchers wouldn’t find this any more useful than Usain Bolt would find it useful to know that his starting stance during the sprint countdown is optimal: he’s been using it for years anyway, and is mostly just eager to go back to the gym.
However, this argument seems mistaken.
On the one hand, just because it’s not useful to deep learning practitioners does not mean it’s not useful others trying to estimated the timelines of technological development (such as policy-makers or charitable foundations).
On the other hand, I think this knowledge is very practically useful for deep learning practitioners. According to my current models, the field seems unique in combining the following features:
This is an environment where it is critically important to develop strong priors on what should work, and to stick with those in face countless fruitless tests. Indeed, LeCun, Hinton and Bengio seem to have persevered for decades before the AI community stopped thinking they were crazy. (This is similar in some interesting ways to the state of astronomy and physics before Newton. I’ve blogged about this before here.) There’s an asymmetry such that even though training a very powerful architecture can be quick (on the order of a GPU-day), iterating over architectures to figure out which ones to train fully in the first place can be incredibly costly. As such, knowing whether gradient descent with backprop is or is not the way to go would lead enable more efficient allocation of research time (though mostly so in case backprop is not the way to go, as the majority of current researchers assume it anyway).
Appendix: Brief theoretical background
This section describes what backpropagation is, why neuroscientists have claimed it is implausible, and why some deep learning researchers think those neuroscientists are wrong. The latter arguments are basically summarised from this talk by Hinton.
Multi-layer networks with access to an error signal face the so-called “credit assignment problem”. The error of the computation will only be available at the output: a child pronouncing a word erroneously, a rodent tasting an unexpectedly nauseating liquid, a monkey mistaking a stick for a snake. However, in order for the network to improve its representations and avoid making the same mistake in the future, it has to know which representations to “blame” for the mistake. Is the monkey too prone to think long things are snakes? Or is it bad at discriminating the textures of wood and skin? Or is it bad at telling eyes from eye-sized bumps? And so forth. This problem is exacerbated by the fact that neural network models often have tens or hundreds of thousands of parameters, not to mention the human brain, which is estimated to have on the order of 1014 synapses. Backpropagation proposes to solve this problem by observing that the maths of gradient descent work out such that one can essentially send the error signal from the output, back through the network towards the input, modulating it by the strength of the connections along the way. (A complementary perspective on backprop is that it is just an efficient way of computing derivatives in large computational graphs, see e.g. Olah, 2015).
Now why do some neuroscientists have a problem with this?
Objection 1:
Most learning in the brain is unsupervised, without any error signal similar to those used in supervised learning.
Hinton's reply:
There are at least three ways of doing backpropagation without an external supervision signal:
1. Try to reconstruct the original input (using e.g. auto-encoders), and thereby develop representations sensitive to the statistics of the input domain
2. Use the broader context of the input to train local features
For example, in the sentence “She scromed him with the frying pan”, we can infer that the sentence as a whole doesn’t sound very pleasant, and use that to update our representation of the novel word “scrom”
3. Learn a generative model that assigns high probability to the input (e.g. using variational auto-encoders or the wake-sleep algorithm from the 1990’s)
Bengio and colleagues (2017) have also done interesting work on this, partly reviving energy-minimising Hopfield networks from the 1980’s
Objection 2:
Objection 2. Neurons communicate using binary spikes, rather than real values (this was among the earliest objections to backprop).
Hinton's reply:
First, one can just send spikes stochastically and use the expected spike rate (e.g. with a poisson rate, which is somewhat close to what real neurons do, although there are important differences see e.g., Ma et al., 2006; Pouget et al. 2003).
Second, this might make evolutionary sense, as the stochasticity acts as a regularising mechanism making the network more robust to overfitting. This behaviour is in fact where Hinton got the idea for the drop-out algorithm (which has been very popular, though it recently seems to have been largely replaced by batch normalisation).
Objection 3:
Single neurons cannot represent two distinct kind of quantities, as would be required to do backprop (the presence of features and gradients for training).
Hinton's reply:
This is in fact possible. One can use the temporal derivative of the neuronal activity to represent gradients.
(There is interesting neuropsychological evidence supporting the idea that the temporal derivative of a neuron can not be used to represent changes in that feature, and that different populations of neurons are required to represent the presence and the change of a feature. Patients with certain brain damage seem able to recognise that a moving car occupies different locations at two points in time, without being able to ever detect the car changing position.)
Objection 4:
Cortical connections only transmit information in one direction (from soma to synapse), and the kinds of backprojections that exist are far from the perfectly symmetric ones used for backprop.
Hinton's reply:
This led him to abandon the idea that the brain could do backpropagation for a decade, until “a miracle appeared”. Lillicrap and colleagues at DeepMind (2016) found that a network propagating gradients back through random and fixed feedback weights in the hidden layer can match the performance of one using ordinary backprop, given a mechanism for normalization and under the assumption that the weights preserve the sign of the gradients. This is a remarkable and surprising result, and indicates that backprop is still poorly understood. (See also follow-up work by Liao et al., 2016).
[1] One possible argument for this is that in a larger number of plausible worlds, if 2) is true and conceptual advances are necessary, then building superintelligence will turn into an engineering problem once those advances have been made. Hence 2) requires strictly more resources than 1).
Discussion questions
I'd encourage discussion on:
Whether the brain does backprop (object-level discussion on the work of Lillicrap, Hinton, Bengio, Liao and others)?
Whether it's actually important for the secret sauce question to know whether the brain does backprop?
To keep things focused and manageable, it seems reasonable to disencourage discussion of what other secret sauces there might be.