Brains and backprop: a key timeline crux

[-]johnswentworth8y160

Interesting fact about backprop: a supply chain of profit-maximizing, competitive companies can be viewed as implementing backprop. Obviously there's some setup here, but it's reasonably general; I'll have a long post on it at some point. This should not be very surprising: backprop is just an efficient algorithm for calculating gradients, and prices in competitive markets are basically just gradients of production functions.

Anyway, my broader point is this: backprop is just an efficient way to calculate gradients. In a distributed system (e.g. a market), it's not necessarily the most efficient gradient-calculation algorithm. What's relevant is not whether the brain uses backpropagation per se, but whether it uses gradient descent. If the brain mainly operates off of gradient descent, then we have that theoretical tool already, regardless of the details of how the brain computes the gradient.

Many of the objections listed to brain-as-backprop only apply to single-threaded, vanilla backprop, rather than gradient descent more generally.

[-]Bird Concept8y120

I'm looking forward to reading that post.

Yes, it seems right that gradient descent is the key crux. But I'm not familiar with any efficient way of doing it that the brain might implement, apart from backprop. Do you have any examples?

[-]johnswentworth8y70

Here's my preferred formulation of the general derivative problem (skip to the last paragraph if you just want the summary): you have some function $f (x)$ . We'll assume that it's been "flattened out", i.e. all the loops and recursive calls have been expanded, it's just a straight-line numerical function. Adopting hilariously bad variable names, suppose the $i$ -th line of $f$ computes $y_{i}$ . We'll also assume that the first lines of $f$ just load in $x$ , so e.g. $y_{0} = x_{0}$ . If $f$ has $n$ lines, then the output of $f$ is $y_{n}$ .

Now, we create a vector-valued function $F (y)$ , which runs each line of f in parallel: $F_{i} (y) = ($ line $i$ of $f$ evaluated at $y$ $)$ . $f (x)$ computes a fixed point $y = F (y)$ (it may take a moment of thought or an example for that part to make sense). It's that fixed point formula which we differentiate. The result: we get $x = A \frac{d y}{d x}$ , where $A$ is a very sparse triangular matrix. In fact, we don't even need to solve the whole thing - we only need $\frac{d y_{n}}{d x}$ . Backprop just uses the usual method for solving triangular matrices: start at the end and work back.

Main point: derivative calculation, in general, can be done by solving a (sparse, triangular) system of linear equations. There's a whole field devoted to solving sparse matrices, especially in parallel. Different methods work better depending on the matrix structure (which will follow the structure of the computation DAG of $f$ ), so different methods will work better for different functions. Pick your favorite sparse matrix solver, ideally one which will leverage triangularity, and boom, you have a derivative calculator.

Side note: do these comments support LaTeX? Is there a page explaining what comments do support? It doesn't seem to be markdown, no idea what we're using here.

[-]Ben Pace8y20

Side note: do these comments support LaTeX? Is there a page explaining what comments do support? It doesn't seem to be markdown, no idea what we're using here.

It is a WYSIWYG markdown editor and dollar-sign is the symbol that opens the LaTex editor (I've LaTexed your comment for you, hope that's okay).

Added: @habryka oops, double-comment!

[-]johnswentworth8y10

Ooooh, that makes much more sense now, I was confused by the auto-formatting as I typed. Thank you for taking the time to clean up my comment. Also thankyou @habryka.

Also, how do images work in posts? I was writing up a post the other day, but when I tried to paste in an image it just created a camera symbol. Alternatively, is this stuff documented somewhere?

[-]Ben Pace8y40

My transatlantic flight permitting, I’ll reply with a post tomorrow with full descriptions of how to use the editor.

[-]johnswentworth8y10

Thank you very much! I really appreciate the time you guys are putting in to this.

[-]Ben Pace8y50

You're welcome :-) Here's a mini-guide to the editor.

[-]Bird Concept8y10

The thing is now in LaTeX! Beautiful!

[-]habryka8y20

Yep, we support LaTeX and do a WYSIWYG translation of markdown as soon as you type it (I.e. words between asterisks get bolded, etc.). You can start typing LaTeX by typing $ and then a small equation editor shows up. You can also insert block-level equations by pressing CTRL+M.

[-]CronoDAS8y20

Typing $ does nothing on my iPhone.

[-]habryka8y40

Because the mobile editing experience was pretty buggy, we replaced the mobile editor with a markdown-only editor two days ago. We will activate LaTeX for that editor pretty soon (which will probably mean replacing equations between "$$" with the LaTeX rendered version), but that means LaTeX is temporarily unavailable on phones (though the previous LaTeX editor didn't really work with phones anyways, so it's mostly just a strict improvement on what we have).

[-]CronoDAS8y20

Ok, no problem; I don't really know LaTeX anyway.

[-]Daniel Kokotajlo5y40

Hello from the future! I'm interested to hear how your views have updated since this comment and post were written. 1. What is your credence that the brain learns via gradient descent? 2. What is your credence that it in fact does so in a way relevantly similar to backprop? 3. Do you still think that insofar as your credence in 1 is high, timelines are short?

[-]Bird Concept5y40

I appreciate you following up on this!

The sad and honest truth, though, is that since I wrote this post, I haven't thought about it. :( I haven't picked up on any key new piece of evidence -- though I also haven't been looking.

I could give you credences, but that would mostly just involve rereading this and loading up all the thoughts

[-]Daniel Kokotajlo5y20

Ok! Well, FWIW, it seems very likely to me that the brain learns via gradient descent, and indeed probable that it does something relevantly similar (though of course not identical to) backprop. (See the link above). But I feel very much an imposter discussing all this stuff since I lack technical expertise. I'd be interested to hear your take on this stuff sometime if you have one or want to make one! See also:

https://arxiv.org/abs/2006.04182 (Brains = predictive processing = backprop = artificial neural nets)

https://www.biorxiv.org/content/10.1101/764258v2.full (IIRC this provides support for Kaplan's view that human ability to extrapolate is really just interpolation done by a bigger brain on more and better data.)

[-]Bird Concept5y60

I'm currently on vacation, but I'd be interested in setting up a call once I'm back in 2 weeks! :) I'll send you my calendly in PM

[-]Richard_Ngo8y100

Thanks for the excellent post, Jacob. I think you might be placing too much emphasis on learning algorithms as opposed to knowledge representations, though. It seems very likely to me that at least one theoretical breakthrough in knowledge representation will be required to make significant progress (for one argument along these lines, see Pearl 2018). Even if it turns out that the brain implements backpropagation, that breakthrough will still be a bottleneck. In biological terms, I'm thinking of the knowledge representations as analogous to innate aspects of cognition impressed upon us by evolution, and learning algorithms as what an individual human uses to learn from their experiences.

Two examples which suggest that the former are more important than the latter. The first is the "poverty of stimulus" argument in linguistics: that children simply don't hear enough words to infer language from first principles. This suggests that ingrained grammatical instincts are doing most of the work in narrowing down what the sentences they hear mean. Even if we knew that the kids were doing backpropagation whenever they heard new sentences, that doesn't tell us much about how that grammatical knowledge works, because you can do backpropagation on lots of different things. (You know more psycholinguistics than I do, though, so let me know if I'm misrepresenting anything).

Second example: Hinton argues in this talk that CNNs don't create representations of three-dimensional objects from two-dimensional pictures in the same way as the human brain does; that's why he invented capsule networks, which (he claims) do use such representations. Both capsules and CNNs use backpropagation, but the architecture of capsules is meant to be an extra "secret sauce". Seeing whether they end up working well on vision tasks will be quite interesting, because vision is better-understood and easier than abstract thought (for example, it's very easy to theoretically specify how to translate between any two visual perspectives, it's just a matrix multiplication).

Lastly, as a previous commentator pointed out, it's not backpropagation but rather gradient descent which seems like the important factor. More specifically, recent research suggests that Stochastic Gradient Descent leads to particularly good outcomes, for interesting theoretical reasons (see Zhang 2017 and this blog post by Huzcar). Since the brain does online learning, if it's doing gradient descent then it's doing a variant of SGD. I discuss why SGD works well in more detail in the first section of this blog post.

[-]Qiaochu_Yuan8y80

I had a conversation with Paul where I asked him a roughly similar question, namely "how many nontrivial theoretical insights are we away from superintelligent AI, and how quickly will they get produced?" His answer was "plausibly zero or one, but also I think we haven't had a nontrivial theoretical insight since the 1980s" (this is my approximate recollection, Paul should correct me). We talked about it a bit more and he managed to lengthen my timeline, which was nice.

I had previously had a somewhat lower threshold for what constituted a nontrivial theoretical insight, a sense that there weren't very many left, and a sense that they were going to happen pretty quickly, based mostly on the progress made by AlphaGo and AlphaGo Zero. Paul gave me a stronger sense that most of the recent progress has been due to improved compute and tricks.

[-]Bird Concept8y70

In order for me to update on this it would be great to have concrete examples of what does and does not consistute "nontrivial theoretical insights" according to you and Paul.

E.g. what was the insight from the 1980s? And what part of the AG(Z) architecture did you initially consider nontrivial?

[-]paulfchristiano8y150

A more precise version of of my claim: if you gave smart grad students from 1990 access to all of the non-AI technology of 2017 (esp. software tools + hardware + data) and a big budget, it would not take them long to reach nearly state of the art performance on supervised learning and RL. For example, I think it's pretty plausible that 20 good grad students could do it in 3 years if they were motivated and reasonably well managed.

If they are allowed to query for 1 bit of advice per month (e.g. "should we explore approach X?") then I think it's more likely than not that they would succeed. The advice is obviously a huge advantage, but I don't think that it can plausibly substitute for "nontrivial theoretical insight."

There is lots of uncertainty about that operationalization, but the main question is just whether there are way too many small things to figure out and iterate on rather than whether there are big insights.

(Generative modeling involves a little bit more machinery. I don't have a strong view on whether they would figure out GANs or VAEs, though I'd guess so. Autoregressive models aren't terrible anyway.)

They certainly wouldn't come up with every trick or clever idea, but I expect they'd come up with the most important ones. With only 60 person-years they wouldn't be able to put in very much domain-specific effort for any domain, so probably wouldn't actually set SOTA, but I think they would likely get within a few years.

(I independently came up with the AGZ and GAN algorithms while writing safety posts, which I consider reasonable evidence that the ideas are natural and aren't that hard. I expect there are a large number of cases of independent invention, with credit reasonably going to whoever actually gets it working.)

I don't have as strong a view about whether this was also true in the 70s. By the late 80s, neural nets trained with backprop were a relatively prominent/popular hypothesis about how to build AGI, so you would have spent less time on alternatives. You have some simple algorithms each of which might turn out to not be obvious (like Q learning, which I think are roughly as tricky as the AGZ algorithm). You have the basic ideas for CNNs (though I haven't looked into this extensively and don't know how much of the idea was actually developed by 1990 vs. in 1998). I feel less comfortable betting on the grad students if you take all those things away. But realistically it's more like a continuous increase in probability of success rather than some insight that happened in the 80s.

If you tried to improve the grad students' performance by shipping back some critical insights, what would they be?

[-]vV_Vv8y60

Do you think that solving Starcraft (by self-play) will require some major insight or will it be just a matter of incremental improvement of existing methods?

[-]paulfchristiano8y60

I don't think it will require any new insight. It might require using slightly different algorithms---better techniques for scaling, different architectures to handle incomplete information, maybe a different training strategy to handle the very long time horizons; if they don't tie their hands it's probably also worth adding on a bunch of domain-specific junk.

[-]Bird Concept8y50

Thanks for taking the time to write that up.

I updated towards a "fox" rather than "hedgehog" view of what intelligence is: you need to get many small things right, rather than one big thing. I'll reply later if feel like I have a useful reply.

[-]Jan_Kulveit8y60

I put most weight on hypothesis there are multiple secret sauces, so even if it would come out brains use some kind of backprop, I would not expect the rest to be "just engineering". For example there is an open problem with long-term memory, which may require architectural changes, like freezing weight, adding neurons on the way,...

Btw you have likely wrong labels on scenarios

Generally good way of looking on things! Thanks

[-]Bird Concept8y10

Thanks, I'm glad you found the framing useful.

[-]habryka8y40

Significantly changed some of the formatting to make it more legible.

[-]Gordon Seidoh Worley8y10

I'm very late to the party on this post but wanted to say that I found it useful to know better how recent AI advances were made and find out they seem pretty unlikely to get us AGI soon (I'm drawing that conclusion based on additional info).

My expectation is that we won't be getting close until we can figure out how to make recurrent networks succeed at unsupervised learning since that is the nearest analogue in ML to how brains work (to the best of my knowledge).

[-]Richard_Ngo8y10

Single neurons cannot represent two distinct kind of quantities, as would be required to do backprop (the presence of features and gradients for training).

I don't understand why can't you just have some neurons which represent the former, and some neurons which represent the latter?

The drop-out algorithm (which has been very popular, though it recently seems to have been largely replaced by batch normalisation).

Do you have any particular source for dropout being replaced by batch normalisation, or is it an impression from the papers you've been reading?

[-]Bird Concept8y30

I don't understand why can't you just have some neurons which represent the former, and some neurons which represent the latter?

Because people thought you needed the same weights to 1) transport the gradients back, 2) send the activations forward. Having two distinct networks with the same topology and getting the weights to match was known as the "weight transport problem". See Grossberg, S. 1987. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11(1):23–63.

Do you have any particular source for dropout being replaced by batch normalisation, or is it an impression from the papers you've been reading?

The latter.

[-]avturchin8y10

There is a wiki page on https://en.wikipedia.org/wiki/Neural_backpropagation which claims that this effect exists. Also maybe what we experience as night dreams is backproagation through our visual cortex, which becomes at this moment something like generative neural net.

[-]Bird Concept8y20

Yes, there is a poorly understood phenomenon whereby action potentials sometimes travel back through the dendrites preceding them. This is insufficient for ML-style backprop because it rarely happens across more than one layer.

[-]norswap8y10

The premise that "human-level AI" must be built around some form on some form of learning (and the implication that learning is what needs to be improved) is highly dubious (not evidenced enough, at all, and completely at odds with my own intuitions besides).

As it is, deep learning can be seen "simply" as a way to approximate a mathematical function. In the case of computer vision, own could see it as a function that twiddles with the images' pixels and outputs a result. The genius of the approach is how relatively fast we can find a function that approximates the process of interest (compared to say classical search algorithms). A big caveat: human intuition is still required in finding the right parameters to tweak the network, but it's very conceivable that this could be improved.

Nevertheless, we don't have human-level AI here. At the very best we can hope for, we have it's pattern matching component. Which is an important component to be sure, but we still don't have an understanding of "concepts", there is no "reflection" as understood in computer science (a form of meta-programming where programming language concepts are reified and available to the programmer using the language). We need the ability to form new concepts - some of which will be patterns, but also to reason about the concepts themselves, to pattern-match on them. In short, to think about thinking. It seems like in that regard, we're still a long way.

[-]daozaich8y10

I think part of the assumption is that reflection can be bolted on trivially if the pattern matching is good enough. For example, consider guiding an SMT / automatic theorem prover by deep-learned heuristics, e.g. (https://arxiv.org/abs/1701.06972)[https://arxiv.org/abs/1701.06972] . We know how to express reflection in formal languages; we know how to train intuition for fuzzy stuff; me might learn how to train intuition for formal languages.

This is still borderline useless; but there is no reason, a priori, that such approached are doomed to fail. Especially since labels for training data are trivial (check the proof for correctness) and machine-discovered theorems / proofs can be added to the corpus.

[-]moss8y10

There has been some work lately on derivative-free optimization of ANNs (ES mostly, but I've seen some other genetic-flavored work as well). They tend to be off-policy, and I'm not sure how biologically plausible that is, but something to think about w/r/t whether current DL progress is taking the same route as biological intelligence (-> getting us closer to [super]intelligence)

[-]Bird Concept8y30

It seems very implausible to me that the brain would use evolutionary strategies, as it's not clear how humans could try a sufficiently large number of parameter settings without any option for parallelisation, or store and then choose among previous configurations.

[-]vV_Vv8y40

There is an algorithm called "Evolution strategies" popularized by OpenAI (although I believe that in some form it already existed) that can train neural networks without backpropagation and without storing multiple sets of parameters. You can view it as a population 1 genetic algorithm, but it really is a stochastic finite differences gradient estimator.

On supervised learning tasks it is not competitive with backpropagation, but on reinforcement learning tasks (where you can't analytically differentiate the reward signal so you have to estimate the gradient one way or the other) it is competitive. Some follow-up works combined it with backpropagation.

I wouldn't be surpised if the brain does something similar, since the brain never really does supervised learning, it's either unsupervised or reinforcement learning. The brain could combine local reconstruction and auto-regression learning rules (similar to the layerwise-trained autoencoders, but also trying to predict future inputs rather than just reconstructing the current ones) and finite differences gradient estimation on reward signals propagated by the the dopaminergic pathways.

[-]moss8y10

The OpenAI ES algorithm isn't very plausible (for exactly why you said), but the general idea of: "existing parameters + random noise -> revert if performance got worse, repeat" does seem like a reasonable way to end up with an approximation of the gradient. I had in mind something more like Uber AI's Neuroevolution, which wouldn't necessarily require parallelization or storage if the brain did some sort of fast local updating, parameter-wise.

LESSWRONG
LW

LESSWRONG
LW

36

Brains and backprop: a key timeline crux

36

36

The Secret Sauce Question

The importance of backpropagation

Nature of backprop: Implication for timelines

Appendix: Brief theoretical background

Objection 1:

Objection 2:

Objection 3:

Objection 4:

Discussion questions