Previously in seriesThe Nature of Logic

People who don't work in AI, who hear that I work in AI, often ask me:  "Do you build neural networks or expert systems?"  This is said in much the same tones as "Are you a good witch or a bad witch?"

Now that's what I call successful marketing.

Yesterday I covered what I see when I look at "logic" as an AI technique.  I see something with a particular shape, a particular power, and a well-defined domain of useful application where cognition is concerned.  Logic is good for leaping from crisp real-world events to compact general laws, and then verifying that a given manipulation of the laws preserves truth.  It isn't even remotely close to the whole, or the center, of a mathematical outlook on cognition.

But for a long time, years and years, there was a tremendous focus in Artificial Intelligence on what I call "suggestively named LISP tokens" - a misuse of logic to try to handle cases like "Socrates is human, all humans are mortal, therefore Socrates is mortal".  For many researchers, this one small element of math was indeed their universe.

And then along came the amazing revolution, the new AI, namely connectionism.

In the beginning (1957) was Rosenblatt's Perceptron.  It was, I believe, billed as being inspired by the brain's biological neurons.  The Perceptron had exactly two layers, a set of input units, and a single binary output unit.  You multiplied the inputs by the weightings on those units, added up the results, and took the sign: that was the classification.  To learn from the training data, you checked the current classification on an input, and if it was wrong, you dropped a delta on all the weights to nudge the classification in the right direction.

The Perceptron could only learn to deal with training data that was linearly separable - points in a hyperspace that could be cleanly separated by a hyperplane.

And that was all that this amazing algorithm, "inspired by the brain", could do.

In 1969, Marvin Minsky and Seymour Papert pointed out that Perceptrons couldn't learn the XOR function because it wasn't linearly separable.  This killed off research in neural networks for the next ten years.

Now, you might think to yourself:  "Hey, what if you had more than two layers in a neural network?  Maybe then it could learn the XOR function?"

Well, but if you know a bit of linear algebra, you'll realize that if the units in your neural network have outputs that are linear functions of input, then any number of hidden layers is going to behave the same way as a single layer - you'll only be able to learn functions that are linearly separable.

Okay, so what if you had hidden layers and the outputs weren't linear functions of the input?

But you see - no one had any idea how to train a neural network like that.  Cuz, like, then this weight would affect that output and that other output too, nonlinearly, so how were you supposed to figure out how to nudge the weights in the right direction?

Just make random changes to the network and see if it did any better?  You may be underestimating how much computing power it takes to do things the truly stupid way.  It wasn't a popular line of research.

Then along came this brilliant idea, called "backpropagation":

You handed the network a training input.  The network classified it incorrectly.  So you took the partial derivative of the output error (in layer N) with respect to each of the individual nodes in the preceding layer (N - 1).  Then you could calculate the partial derivative of the output error with respect to any single weight or bias in the layer N - 1.  And you could also go ahead and calculate the partial derivative of the output error with respect to each node in the layer N - 2.  So you did layer N - 2, and then N - 3, and so on back to the input layer.  (Though backprop nets usually had a grand total of 3 layers.)  Then you just nudged the whole network a delta - that is, nudged each weight or bias by delta times its partial derivative with respect to the output error.

It says a lot about the nonobvious difficulty of doing math that it took years to come up with this algorithm.

I find it difficult to put into words just how obvious this is in retrospect.  You're just taking a system whose behavior is a differentiable function of continuous paramaters, and sliding the whole thing down the slope of the error function.  There are much more clever ways to train neural nets, taking into account more than the first derivative, e.g. conjugate gradient optimization, and these take some effort to understand even if you know calculus.  But backpropagation is ridiculously simple.  Take the network, take the partial derivative of the error function with respect to each weight in the network, slide it down the slope.

If I didn't know the history of connectionism, and I didn't know scientific history in general - if I had needed to guess without benefit of hindsight how long it ought to take to go from Perceptrons to backpropagation - then I would probably say something like:  "Maybe a couple of hours?  Lower bound, five minutes - upper bound, three days."

"Seventeen years" would have floored me.

And I know that backpropagation may be slightly less obvious if you don't have the idea of "gradient descent" as a standard optimization technique bopping around in your head.  I know that these were smart people, and I'm doing the equivalent of complaining that Newton only invented first-year undergraduate stuff, etc.

So I'm just mentioning this little historical note about the timescale of mathematical progress, to emphasize that all the people who say "AI is 30 years away so we don't need to worry about Friendliness theory yet" have moldy jello in their skulls.

(Which I suspect is part of a general syndrome where people's picture of Science comes from reading press releases that announce important discoveries, so that they're like, "Really?  You do science?  What kind of important discoveries do you announce?"  Apparently, in their world, when AI finally is "visibly imminent", someone just needs to issue a press release to announce the completion of Friendly AI theory.)

Backpropagation is not just clever; much more importantly, it turns out to work well in real life on a wide class of tractable problems.  Not all "neural network" algorithms use backprop, but if you said, "networks of connected units with continuous parameters and differentiable behavior which learn by traveling up a performance gradient", you would cover a pretty large swathe.

But the real cleverness is in how neural networks were marketed.

They left out the math.

To me, at least, it seems that a backprop neural network involves substantially deeper mathematical ideas than "Socrates is human, all humans are mortal, Socrates is mortal".  Newton versus Aristotle.  I would even say that a neural network is more analyzable - since it does more real cognitive labor on board a computer chip where I can actually look at it, rather than relying on inscrutable human operators who type "|- Human(Socrates)" into the keyboard under God knows what circumstances.

But neural networks were not marketed as cleverer math.  Instead they were marketed as a revolt against Spock.

No, more than that - the neural network was the new champion of the Other Side of the Force - the antihero of a Manichaean conflict between Law and Chaos.  And all good researchers and true were called to fight on the side of Chaos, to overthrow the corrupt Authority and its Order.  To champion Freedom and Individuality against Control and Uniformity.  To Decentralize instead of Centralize, substitute Empirical Testing for mere Proof, and replace Rigidity with Flexibility.

I suppose a grand conflict between Law and Chaos, beats trying to explain calculus in a press release.

But the thing is, a neural network isn't an avatar of Chaos any more than an expert system is an avatar of Law.

It's just... you know... a system with continuous parameters and differentiable behavior traveling up a performance gradient.

And logic is a great way of verifying truth preservation by syntactic manipulation of compact generalizations that are true in crisp models.  That's it.  That's all.  This kind of logical AI is not the avatar of Math, Reason, or Law.

Both algorithms do what they do, and are what they are; nothing more.

But the successful marketing campaign said,

"The failure of logical systems to produce real AI has shown that intelligence isn't logical.  Top-down design doesn't work; we need bottom-up techniques, like neural networks."

And this is what I call the Lemon Glazing Fallacy, which generates an argument for a fully arbitrary New Idea in AI using the following template:

  • Major premise:  All previous AI efforts failed to yield true intelligence.
  • Minor premise:  All previous AIs were built without delicious lemon glazing.
  • Conclusion:  If we build AIs with delicious lemon glazing, they will work.

This only has the appearance of plausibility if you present a Grand Dichotomy.  It doesn't do to say "AI Technique #283 has failed for years to produce general intelligence - that's why you need to adopt my new AI Technique #420."  Someone might ask, "Well, that's very nice, but what about AI technique #59,832?"

No, you've got to make 420 and ¬420 into the whole universe - allow only these two possibilities - put them on opposite sides of the Force - so that ten thousand failed attempts to build AI are actually arguing for your own success.  All those failures are weighing down the other side of the scales, pushing up your own side... right?  (In Star Wars, the Force has at least one Side that does seem pretty Dark.  But who says the Jedi are the Light Side just because they're not Sith?)

Ten thousand failures don't tell you what will work.  They don't even say what should not be part of a successful AI system.  Reversed stupidity is not intelligence.

If you remove the power cord from your computer, it will stop working.  You can't thereby conclude that everything about the current system is wrong, and an optimal computer should not have an Intel processor or Nvidia video card or case fans or run on electricity.  Even though your current system has these properties, and it doesn't work.

As it so happens, I do believe that the type of systems usually termed GOFAI will not yield general intelligence, even if you run them on a computer the size of the moon.  But this opinion follows from my own view of intelligence.  It does not follow, even as suggestive evidence, from the historical fact that a thousand systems built using Prolog did not yield general intelligence.  So far as the logical sequitur goes, one might as well say that Silicon-Based AI has shown itself deficient, and we must try to build transistors out of carbon atoms instead.

Not to mention that neural networks have also been "failing" (i.e., not yet succeeding) to produce real AI for 30 years now.  I don't think this particular raw fact licenses any conclusions in particular.  But at least don't tell me it's still the new revolutionary idea in AI.

This is the original example I used when I talked about the "Outside the Box" box - people think of "amazing new AI idea" and return their first cache hit, which is "neural networks" due to a successful marketing campaign thirty goddamned years ago.  I mean, not every old idea is bad - but to still be marketing it as the new defiant revolution?  Give me a break.

And pity the poor souls who try to think outside the "outside the box" box - outside the ordinary bounds of logical AI vs. connectionist AI - and, after mighty strains, propose a hybrid system that includes both logical and neural-net components.

It goes to show that compromise is not always the path to optimality - though it may sound Deeply Wise to say that the universe must balance between Law and Chaos.

Where do Bayesian networks fit into this dichotomy?  They're parallel, asynchronous, decentralized, distributed, probabilistic.  And they can be proven correct from the axioms of probability theory.  You can preprogram them, or learn them from a corpus of unsupervised data - using, in some cases, formally correct Bayesian updating.  They can reason based on incomplete evidence.  Loopy Bayes nets, rather than computing the correct probability estimate, might compute an approximation using Monte Carlo - but the approximation provably converges - but we don't run long enough to converge...

Where does that fit on the axis that runs from logical AI to neural networks?  And the answer is that it doesn't.  It doesn't fit.

It's not that Bayesian networks "combine the advantages of logic and neural nets".  They're simply a different point in the space of algorithms, with different properties.

At the inaugural seminar of Redwood Neuroscience, I once saw a presentation describing a robot that started out walking on legs, and learned to run... in real time, over the course of around a minute.  The robot was stabilized in the Z axis, but it was still pretty darned impressive.  (When first exhibited, someone apparently stood up and said "You sped up that video, didn't you?" because they couldn't believe it.)

This robot ran on a "neural network" built by detailed study of biology.  The network had twenty neurons or so.  Each neuron had a separate name and its own equation.  And believe me, the robot's builders knew how that network worked.

Where does that fit into the grand dichotomy?  Is it top-down?  Is it bottom-up?  Calling it "parallel" or "distributed" seems like kind of a silly waste when you've only got 20 neurons - who's going to bother multithreading that?

This is what a real biologically inspired system looks like.  And let me say again, that video of the running robot would have been damned impressive even if it hadn't been done using only twenty neurons.  But that biological network didn't much resemble - at all, really - the artificial neural nets that are built using abstract understanding of gradient optimization, like backprop.

That network of 20 neurons, each with its own equation, built and understood from careful study of biology - where does it fit into the Manichaean conflict?  It doesn't.  It's just a different point in AIspace.

At a conference ysterday, I spoke to someone who thought that Google's translation algorithm was a triumph of Chaotic-aligned AI, because none of the people on the translation team spoke Arabic and yet they built an Arabic translator using a massive corpus of data.  And I said that, while I wasn't familiar in detail with Google's translator, the little I knew about it led me to believe that they were using well-understood algorithms - Bayesian ones, in fact - and that if no one on the translation team knew any Arabic, this was no more significant than Deep Blue's programmers playing poor chess.

Since Peter Norvig also happened to be at the conference, I asked him about it, and Norvig said that they started out doing an actual Bayesian calculation, but then took a couple of steps away.  I remarked, "Well, you probably weren't doing the real Bayesian calculation anyway - assuming conditional independence where it doesn't exist, and stuff", and Norvig said, "Yes, so we've already established what kind of algorithm it is, and now we're just haggling over the price."

Where does that fit into the axis of logical AI and neural nets?  It doesn't even talk to that axis.  It's just a different point in the design space.

The grand dichotomy is a lie - which is to say, a highly successful marketing campaign which managed to position two particular fragments of optimization as the Dark Side and Light Side of the Force.

New Comment
26 comments, sorted by Click to highlight new comments since:

It was necessary for people doing AI to disassociate themselves from previous attempts at doing AI to get funding (see the various AI winters), as it came into disrepute for promising too much. Hence terms like GOFAI and the connectionist/logical dichotomy.

You are lucky not to be on that treadmill. Sadly nowadays you have to market your speculative research to be successful.

"So I'm just mentioning this little historical note about the timescale of mathematical progress, to emphasize that all the people who say "AI is 30 years away so we don't need to worry about Friendliness theory yet" have moldy jello in their skulls."

It took 17 years to go from perceptrons to back propagation...

... therefore I have moldy Jell-O in my skull for saying we won't go from manually debugging buffer overruns to superintelligent AI within 30 years...

Eliezer, your logic circuits need debugging ;-)

(Unless the comment was directed at, not claims of "not less than 30 years", but specific claims of "30 years, neither more nor less" -- in which case I have no disagreement.)

I'd be interested in an essay about "the nonobvious difficulty of doing math".

Russell, I think the point is we can't expect Friendliness theory to take less than 30 years.

"Russell, I think the point is we can't expect Friendliness theory to take less than 30 years."

If so, then fair enough -- I certainly don't claim it will take less.

>It took 17 years to go from perceptrons to back propagation... >... therefore I have moldy Jell-O in my skull for saying we won't go from manually debugging buffer overruns to superintelligent AI within 30 years...

If you'd asked me in 1995 how many people it would take for the world to develop a fast, distributed system for moving films and TV episodes to people's homes on an 'when you want it, how you want it' basis, internationally, without ads, I'd have said hundreds of thousands. In practice it took one guy with the right algorithm, depending on whether you pick napster or bittorrent as the magic that solves the problem without the need for any new physical technologies.

The thing about self-improving AI, is that we only need to get the algorithm right (or wrong :-() once.

We know with probability 1 it's possible to create self-improving intelligence. After all, that's what most humans are. No doubt other solutions exist. If we can find an algorithm or heuristic to implement any one of these solutions, or if we can even find any predecessor of any one of them, then we're off - and given the right approach (be that algorithm , machine, heuristic, or whatever) it should be simply a matter of throwing computer power (or moore's law) at it to speed up the rate of self-improvement. Heck, for all I know it could be a giant genetically engineered brain in a jar that cracks the problem.

Put it this way. Imagine you are a parasite. For x billion years you're happy, then some organism comes up with sexual reproduction and suddenly it's a nightmare. But eventually you catch up again. Then suddenly, in just 100 years, human society basically eradicates you completely out of the blue. The first 50 years of that century are bad. The next 20 are hideous. The next 10 are awful. The next 5 are disastrous... etc.

Similarly useful powerplant-scale nuclear fusion has always been 30 years away. But at some point, I suspect it will suddenly be only 2 years away, completely out of the blue....

It took 17 years to go from perceptrons to back propagation...

... therefore I have moldy Jell-O in my skull for saying we won't go from manually debugging buffer overruns to superintelligent AI within 30 years...

I think EY is failing to take into account the exponential growth of AI researchers, their access to information, their ability to communicate, and the computation and algorithmic power they have at their disposal today.

I don't think the solution to a similar problem would take 17 years today.

Of course, a superintelligent AI is a harder problem than back propagation, and I doubt that it's comparable problem anyway. I don't expect some equation tweaking a singular known algorithm to do the trick. I suspect it's more of a systems integration problem. Brains are complex systems of functional units which have evolved organically over time.

"If you'd asked me in 1995 how many people it would take for the world to develop a fast, distributed system for moving films and TV episodes to people's homes on an 'when you want it, how you want it' basis, internationally, without ads, I'd have said hundreds of thousands."

And you'd have been right. (Ever try running Bit Torrent on a 9600 bps modem? Me neither. There's a reason for that.)

And you'd have been right. (Ever try running Bit Torrent on a 9600 bps modem? Me neither. There's a reason for that.)

Not sure I see your point. All the high speed connections were built long before bittorrent came along, and they were being used for idiotic point-to-point centralised transfers.

All that potential was achieving not much, before the existance of the right algorithm or approach to exploit it. I suspect a strong analogy here with future AI.

"Not sure I see your point. All the high speed connections were built long before bittorrent came along, and they were being used for idiotic point-to-point centralised transfers."

No they weren't. The days of Napster and Bit Torrent were, by no coincidence, also the days when Internet speed was in the process of ramping up enough to make them useful.

But of course, the reason we all heard of Napster wasn't that it was the first peer-to-peer data sharing system. On the contrary, we heard of it because it came so late that by the time it arrived, the infrastructure to make it useful was actually being built. Ever heard of UUCP? Few have. That's because in its day -- the 70s and 80s -- the infrastructure was by and large not there yet.

A clever algorithm, or even a clever implementation thereof, is only one small piece of a real-world solution. If we want to build useful AGI systems -- or so much as a useful Sunday market stall -- our plans must be built around that fact.

On the one hand, Eliezer is right in terms of historical and technical specifics.

On the other hand neural networks for many are a metoynym for continuous computations vs. the discrete computations of logic. This was my reaction when the two PDP volumes came out in the 80s. It wasn't "Here's the Way." It was "Here's and example of how to do things differently that will certainly work better."

Note also that the GOFAI folks were not trying to use just one point in logic space. In the 70s we already knew that monotonic logic was not good enough (due to the frame problem among other things) so there was an active exploration of different types of non-monotonic logic. That's in addition to all the modal logics, etc.

So the dichotomy Eliezer refers to should be viewed as more of a hyperplane separator in intelligence model space. From that point of view I think it is fairly valid -- the subspace of logical approaches is pretty separate from the subspace of continuous approaches, though Detlef and maybe others have shown you can build bridges.

The two approaches were even more separate culturally at the time. AI researchers didn't learn or use continuous mathematics, and didn't want to see it in their papers. That probably has something to do with the 17 years. Human brains and human social groups aren't very good vehicles for this kind of search.

So yes, treating this as distinction between sharp points is wrong. But treating it as a description of a big cultural transition is right.

[comment deleted]

Perhaps Eliezer goes to too many cocktail parties:

X: "Do you build neural networks or expert systems?" E: "I don't build anything. Mostly I whine about people who do." X: "Hmm. Does that pay well?"

Perhaps Bayesian Networks are the hot new delicious lemon glazing. Of course they have been around for 23 years.

Well if the AGI field had real proof that AGI was possible sure. The problem is the proof for AGI is in the doing and the fact that you think its possible is baseless belief. Just because a person can do it does not mean a computer can.

Reduction to QED

The question of AGI is an open question and there is no way silence the opposition logically until and AGI is created, something you won't be doing.

Aside: I know the Quantum Physics Sequence was sort of about this, and the inside vs. outside view argument is closely related, but I wouldn't mind seeing more discussion of the specific phenomenon of demanding experimental evidence while ignoring rational argument. Also, this is at least the second time I've seen someone arguing against AGI, not on the object level or a standard meta level, but by saying what sounds like I won't believe you until you can convince everyone else. What does it matter that the opposition can't be silenced, except insofar as the opposition has good arguments?

[comment deleted - all, please don't respond to obvious trolls]

IgnoranceNeverPays: "This is a common thing among people who know enough that they think they know something but don't actually know enough to really know something."

Can you say that really really fast?

It wasn't seventeen years. It was five years. See

The reason NN aren't considered Lemon Glazing is because they are an approach towards at least one of the known models that does produce intelligence- you. Of course, backprop and self organizing maps are far less complicated than the events in a single rat neuron. Of course, computer simulations of neurons and neural networks are based upon the purely logical framework of the CPU and RAM. Of course, the recognized logical states of all of those hardware components are abstractions laid upon a complex physical system. I don't quite say irreducible complexity, but, at the least, immense complexity at some level is required to produce what we recognize as intelligence.

Mr. Art, I get 1974 - 1957 = 17. Does the referenced book (in contrast to its abstract) give an invention date other than 1974? Who and when?

Ah, I see. I assumed you meant 17 years from 'Perceptrons' - the 1969 book that pointed out the problem. Failed pedantry on my part!

Umm, It looks like he did not read the book "Perceptrons," because he repeats a lot of misinformation from others who also did not read it.

  1. First, none of the theorems in that book are changed or 'refuted' by the use of back-propagation. This is because almost all the book is about whether, in various kinds of connectionist networks, there exist any sets of coefficients to enable the net to recognize various kinds of patterns.
  2. Anyway, because BP is essentially a gradient climbing process, it has all the consequent problems -- such as getting stuck on local peaks.
  3. Those who read the book will see (on page 56) that we did not simple show that the "parity function" is not linearly separable. What we showed is that for a perceptron (with one layer of weighted-threshold neurons that are all connected to a single weighted-threshold output cell), there must be many neuron, each of which have inputs from every point in the retina!
  4. That result is fairly trivial. However, in chapter 9 proves a much deeper limitation: such networks cannot recognize any topological features of a pattern unless either there is one all-seeing neuron that does it, or exponentially many cells with smaller input sets.

A good example of this is: try to make a neural network that looks at a large two-dimensional retina, and decides wither the image contain more than one connected set. That is, whether it is seeing just one object, or more than one object. I don't yet have a decent proof of this (and I'd very much like to see one) but it is clear from the methods in the book that even a multilayer neural network cannot recognize such patterns--unless the number of layers is of the order of the number of points in the retina!!!! This is because a loop-free network cannot do the needed recursion.

  1. The popular rumor is that these limitations are overcome by making networks with more layers. And in fact, networks with more layers can recognize more patterns, but at an exponentially high price in complexity. (One can make networks with loops that can compute some topological features. However, there is no reason to suspect that back-propagation will work on such networks.)

The writer has been sucked in by propaganda. Yes, neural nets with back-propagation can recognize many useful patterns, indeed, but cannot learn to recognize many other important ones—such as whether two different things in a picture share various common features, etc.

Now, you readers should ask about why you have not heard about such problems! Here is the simple,incredible answer: In Physics, if you show that a certain popular theory cannot explain important phenomenon, you're likely to win a Nobel Prize, as when Yang and Lee showed that the standard theory could not explain a certain violation of Parity. Whereas, in the Connectionist Community, if your network cannot recognize a certain type of pattern, you'll simply refrain from announcing this fact, and pretend that nothing has happened--perhaps because you fear that your investors will withdraw their support. So yes, you can indeed connectionist networks that learn which movies a citizen is likely to like, and yes, that can make you some money. And if your robot connectionist robot can't count, so what! Just find a different customer!

But the real cleverness is in how neural networks were marketed. They left out the math.

Not entirely true, my recollection is the PDP book had lots of maths in it.

I didn't say Perceptrons (the book) was in any way invalidated by backprop. Perceptrons cannot, in fact, learn to recognize XOR. The proof of this is both correct and obvious; and moreover, does not need to be extended to multilayer Perceptrons because multilayer linear = linear.

That no one's trained a backprop-type system to distinguish connected from unconnected surfaces (in general) is, if true, not too surprising; the space of "connected" versus "unconnected" would cover an incredible number and variety of possible figures, and offhand there doesn't seem to be a very good match between that global property and the kind of local features detected by most neural nets.

I'm no fan of neurons; this may be clearer from other posts.

[comment deleted]

[This comment is no longer endorsed by its author]Reply

Oops, did I break any unspoken rules by posting in a really really old topic?



I don't like NN's because, IMHO, they can't really learn outside the box. They can learn a specific function, but if it changes, it will take the same amount of time for them to relearn the new one. A logical AI thats abstracted from the input/output cycle, eg; can learn how to learn, can learn that the function is changing predictably and change the network accordingly. So maybe a NN powered by an expert system that guides the learning process? Even so, the advantage to an NN is that its parallel. If you lose that, then you might as well stick with an expert system.

Oops, did I break any unspoken rules by posting in a really really old topic? Sorry if I did :'(