I should mention that the NIPS '08 papers aren't on line yet, but all previous conferences do have the papers, tutorials, slides, background material, etc. on line. For example here's last year.
The arguments Eliezer describes are made, and his reactions are fair. But really the actual research community "grew out" of most of this stuff a while back. CYC and the "common sense" efforts were always a sideshow (in terms of research money and staff, not to mention results). Neural networks were a metonym for statistical learning for a while, then serious researchers figured out they needed to address statistical learning explicitly. Etc.
Admittedly there's always excessive enthusiasm for the current hot thing. A few years ago it was support vector machines, I'm not sure what now.
I recognize there's some need to deflate popular misconceptions, but there's also a need to move on and look at current work.
Eliezer, I'd be very interested in your comments on (what I regard as) the best current work. Examples for you to consider would be Sebastian Thrun, Andrew Ng (both in robotics at Stanford), Chris Manning (linguistics at Stanford), and the papers in the last couple of NIPS conferences (the word "Neural" in the conference title is just a fossil, don't have an allergic reaction).
As an entertaining side note, here's an abstract for a poster for NIPS '08 (happening tomorrow) that addresses the crossover between AI and ems:
A Bayesian Approach for Extracting State Transition Dynamics from Multiple Spike Trains
Neural activity is non-stationary and varies across time. Hidden Markov Models (HMMs) have been used to track the state transition among quasi-stationary discrete neural states. Within this context, an independent Poisson model has been used for the output distribution of HMMs; hence, the model is incapable of tracking the change in correlation without modulating the firing rate. To achieve this, we applied a multivariate Poisson distribution with a correlation term for the output distribution of HMMs. We formulated a Variational Bayes (VB) inference for the model. The VB could automatically determine the appropriate number of hidden states and correlation types while avoiding the overlearning problem. We developed an efficient algorithm for computing posteriors using the recursive relationship of a multivariate Poisson distribution. We demonstrated the performance of our method on synthetic data and a real spike train recorded from a songbird.This is a pretty good example of what I meant by "solving engineering problems" and it should help the ems program "cut corners".
Regarding serial vs. parallel:
The effect on progress is indirect and as a result hard to figure out with confidence.
We have gradually learned how to get nearly linear speedups from large numbers of cores. We can now manage linear speedups over dozens of cores for fairly structured computations, and linear speedup over hundreds of cores are possible in many cases. This is well beyond the near future number of cores per chip. For the purposes of this analysis I think we can assume that Intel can get linear speedups from increasing processors per chip, say for the next ten years.
But there are other issues.
More complicated / difficult programming models may not slow down a given program, but they make changing programs more difficult.
Over time our ability to create malleable highly parallel programs has improved. In special cases a serial program can be "automatically" parallelized (compilation with hints) but mostly parallelization still requires explicit design. But the abstractions have gotten much easier to use and revise.
(In my earlier analysis I was assuming, I think correctly, that this improvement was a function of human thought without much computational assist. The relevant experiments aren't computationally expensive. Intel has been building massively parallel systems since the mid-80s but it didn't produce most major improvements. The parallel programming ideas accreted slowly from a very broad community.)
So I guess I'd say that with the current software technology and trend, Intel can probably maintain most of its computational curve-riding. Certainly simulations with a known software architecture can be parallelized quite effectively, and can be maintained as requirements evolve.
The limitation will be on changes that violate the current pervasive assumptions of the simulation design. I don't know what those are these days, and if I did I probably couldn't say. However they reflect properties that are common to all the "processor like" chips Intel designs, over all the processes it can easily imagine.
Changes to software that involve revising pervasive assumptions have always been difficult, of course. Parallelization just increases the difficulty by some significant constant factor. Not really constant, though, it has been slowly decreasing over time as noted above.
So the types of improvement that will slow down are the ones that involve major new ways to simulate chips, or major new design approaches that don't fit Intel's current assumptions about chip micro-architecture or processes.
While these could be significant, unfortunately I can't predict how or when. I can't even come up with a list of examples where such improvement were made. They are pretty infrequent and hard to categorize.
I hope this helps.
I'll try to estimate as requested, but substituting fixed computing power for "riding the curve" (as Intel does now) is a bit of an apples to fruit cocktail comparison, so I'm not sure how useful it is. A more direct comparison would be with always having a computing infrastructure from 10 years in the future or past.
Even with this amendment, the (necessary) changes to design, test, and debugging processes make this hard to answer...
I'll think out loud a bit.
Here's the first quick guess I can make that I'm moderately sure of: The length of time to go through a design cycle (including shrinks and transitions to new processes) would scale pretty closely with computing power, keeping the other constraints pretty much constant. (Same designers, same number of bugs acceptable, etc.) So if we assume the power follows Moore's law (probably too simple as others have pointed out) cycles would run hundreds of times faster with computing power from 10 years in the future.
This more or less fits the reality, in that design cycles have stayed about the same length while chips have gotten hundreds of times more complex, and also much faster, both of which soak up computing power.
Probably more computing power would have also allowed faster process evolution (basically meaning smaller feature sizes) but I was never a process designer so I can't really generate a firm opinion on that. A lot of physical experimentation is required and much of that wouldn't go faster. So I'm going to assume very conservatively that the increased or decreased computing power would have no effect on process development.
The number of transistors on a chip is limited by process considerations, so adding computing power doesn't directly enable more complex chips. Leaving the number of devices the same and just cycling the design of chips with more or less the same architecture hundreds of times faster doesn't make much economic sense. Maybe instead Intel would create hundreds of times as many chip designs, but that implies a completely different corporate strategy so I won't pursue that.
In this scenario, experimentation via computing gets hundreds of times "cheaper" than in our world, so it would get used much more heavily. Given these cheap experiments, I'd guess Intel would have adopted much more radical designs.
Examples of more radical approaches would be self-clocked chips, much more internal parallelism (right now only about 1/10 of the devices change state on any clock), chips that directly use more of the quantum properties of the material, chips that work with values other than 0 and 1, direct use of probabilistic computing, etc. In other words, designers would have pushed much further out into the micro-architectural design space, to squeeze more function out of the devices. Some of this (e.g. probabilistic or quantum-enhanced computing) could propagate up to the instruction set level.
(This kind of weird design is exactly what we get when evolutionary search is applied directly to a gate array, which roughly approximates the situation Intel would be in.)
Conversely, if Intel had hundreds of times less computing power, they'd have to be extremely conservative. Designs would have to stay further from any possible timing bugs, new designs would appear much more slowly, they'd probably make the transition to multiple cores much sooner because scaling processor designs to large numbers of transistors would be intractable, there's be less fine grained internal parallelism, etc.
If we assumed that progress in process design was also more or less proportional to computing power available, then in effect we'd just be changing the exponent on the curve; to a first approximation we could assume no qualitative changes in design. However as I say this is a very big "if".
Now however we have to contend with an interesting feedback issue. Suppose we start importing computing from ten years in the future in the mid-1980s. If it speeds everything up proportionally, the curve gets a lot steeper, because that future is getting faster faster than ours. Conversely if Intel had to run on ten year old technology the curve would be a lot flatter.
On the other hand if there is skew between different aspects of the development process (as above with chip design vs. process design) we could go somewhere else entirely. For example if Intel develops some way to use quantum effects in 2000 due to faster simulations from 1985 on, and then that gets imported (in a black box) back to 1990, things could get pretty crazy.
I think that's all for now. Maybe I'll have more later. Further questions welcome.
I did work at Intel, and two years of that was in the process engineering area (running the AI lab, perhaps ironically).
The short answer is that more computing power leads to more rapid progress. Probably the relationship is close to linear, and the multiplier is not small.
Specifically with reference to Phil Goetz's comment about scaling, the physics is not invariant under scaling (obviously) and the critical paths change in not entirely predictable ways. So again optimal "shrinks" are hostage to simulation performance.
This is a special case of the general pattern that progress in making the process better and the chips faster typically depends on modeling, analyzing, collecting data, etc. in new ways, and the limits are often how quickly humans can try out and evolve computer mediated tools. Scaling to larger data sets, using less efficient but more easily modified software, running simulations faster, etc. all pay big dividends.
Intel can't in general substitute more processors in a cluster for faster processors, since writing software that gets good speedups on large numbers of processors is hard, and changing such software is much harder than changing single-processor software. The pool of people who can do this kind of development is also small and can't easily be increased.
So I don't really know what difference it makes, but I think Eliezer's specific claim here is incorrect.
On the one hand, Eliezer is right in terms of historical and technical specifics.
On the other hand neural networks for many are a metoynym for continuous computations vs. the discrete computations of logic. This was my reaction when the two PDP volumes came out in the 80s. It wasn't "Here's the Way." It was "Here's and example of how to do things differently that will certainly work better."
Note also that the GOFAI folks were not trying to use just one point in logic space. In the 70s we already knew that monotonic logic was not good enough (due to the frame problem among other things) so there was an active exploration of different types of non-monotonic logic. That's in addition to all the modal logics, etc.
So the dichotomy Eliezer refers to should be viewed as more of a hyperplane separator in intelligence model space. From that point of view I think it is fairly valid -- the subspace of logical approaches is pretty separate from the subspace of continuous approaches, though Detlef and maybe others have shown you can build bridges.
The two approaches were even more separate culturally at the time. AI researchers didn't learn or use continuous mathematics, and didn't want to see it in their papers. That probably has something to do with the 17 years. Human brains and human social groups aren't very good vehicles for this kind of search.
So yes, treating this as distinction between sharp points is wrong. But treating it as a description of a big cultural transition is right.
The "500 bits" only works if you take a hidden variable or Bohmian position on
quantum mechanics. If (as the current consensus would say) non-linear dynamics can amplify quantum noise then enormous amounts of new information are being "produced" locally everywhere all the time. The current state of the universe incorporates much or all of that information. (Someone who understands the debates about black holes and the holographic principle should chime in with more precise analysis.)
I couldn't follow the whole argument so I'm not sure how this affects it, but given that Eliezer keeps referring to this claim I guess it is important.
Poke's comment is interesting and I agree with his / her discussion of cultural evolution. But it also is possible to turn this point around to indicate a possible sweet spot in the fitness landscape that we are probably approaching. Conversely, however, I think the character of this sweet spot indicates scant likelihood of a very rapidly self-bootstrapping AGI.
Probably the most important and distinctive aspect of humans is our ability and desire to coordinate (express ourselves to others, imitate others, work with others, etc.). That ability and desire is required to engage in the sort of cultural evolution that Poke describes. It underlies the individual acquisition of language, cultural transmission, long term research programs, etc.
But as Eric Raymond points out, we are just good enough at this to make it work at all. A bunch of apes trying to coordinate world-wide culture, economy and research is a marginal proposition.
Furthermore we can observe that major creative works come from a very small number of people in "hot" communities -- e.g. Florence during the Renaissance. As Paul Graham points out, this can't be the result of a collection of uniquely talented individuals, it must be some function of the local cultural resources and incentives. Unfortunately I don't know of any fine grained research on what these situations have in common -- we probably don't even have the right concepts to express those characteristics.
A mundane version of this is the amazing productivity of a "gelled team", in software development and other areas. There is some interesting research on the fine grained correlates of team productivity but not much.
So I conjecture that there is a sweet spot for optimized "thinking systems" equivalent to highly productive human teams or larger groups.
Of course we already have such systems, combining humans and digital systems; the digital parts compensate for human limitations and decrease coordination costs in various ways, but they are still extremely weak -- basically networked bookkeeping mechanisms of various sorts.
The natural direction of evolution here is that we improve the fit between the digital parts and the humans, tweak the environment to increase human effectiveness, and gradually increase the capabilities of the digital environment, until the human are no longer needed.
As described this is just incremental development. However it is self-accelerating; these systems are good tools for improving themselves. I expect we'll see the usual sigmoid curve, where these "thinking systems" relatively quickly establish a new level, but then development slows down as they run into intrinsic limitations -- though it is hard to predict what these will be, just as Ada Lovelace couldn't predict the difficulties of massively parallel software design.
From here, we can see a sweet spot that is inhabited by systems with the abilities of "super teams", perhaps with humans as components. In this scenario any super team emerges incrementally in a landscape with many other similar teams in various stages of development. Quite likely different teams will have different strengths and weaknesses. However nothing in this scenario gives us any reason to believe in super teams that can bootstrap themselves to virtual omniscience or omnipotence.
This development will also give us deep insight into how humans coordinate and how to facilitate and guide that coordination. This knowledge is likely to have very large consequences outside the development of the super teams.
Unfortunately, none of this thinking gives us much of a grip on the larger implications of moving to this sweet spot, just as Ada Lovelace (or Thomas Watson) didn't anticipate the social implications of the computer, and Einstein and Leo Szilard didn't anticipate the social implications of control over nuclear energy.
I largely agree with Robin's point that smaller incremental steps are necessary.
But Eliezer's point about big jumps deserves a reply. The transitions to humans and to atomic bombs do indicate something to think about -- and for that matter, so does the emergence of computers.
These all seem to me to be cases where the gradually rising or shifting capacities encounter a new "sweet spot" in the fitness landscape. Other examples are the evolution of flight, or of eyes, both of which happened several times. Or trees, a morphological innovation that arises in multiple botanical lineages.
Note that even for innovations that fit this pattern, e.g. computers and atomic bombs, enormous amounts of incremental development are required before we can get to the sweet spot and start to expand there. (This is also true for biological evolution of course.)
I think most human innovations (tall building, rockets, etc.) are due to incremental accumulation of this sort, rather than finding any big sweet spots.
I should also note that decades before the atomic bomb, the actual production of energy from nuclear fission (geothermal) and fusion (the sun) was clear, if not understood in detail. Similarly the potential of general purpose computers was sensed (e.g. by Ada Lovelace) far before we could build them. This foreknowledge was quite concrete -- it involved detailed physical accounts of existing sources of energy, automation of existing computing techniques, etc. So this sort of sweet spot can be understood in quite detailed ways well before we have the technical skills to reach it.
Using this model, if AGI arrives rapidly, it will be because we found a sweet spot, over and above computing. If AGI is feasible in the near future, that implies that we are near such a sweet spot now. If we are near such a sweet spot, we should be able to understand some of its specific form (beyond "it uses Bayesian reasoning") and the limitations that keep us from getting to it immediately.
I agree with Eliezer that Bayesian methods are "forced", and I also feel the "Good Old Fashioned AI" folks (certainly including Shank and McCarthy) are not good forecasters, for many reasons.
However Bayesian approaches are at the root of existing impressive AI, such as Thrun's work on autonomous vehicles. I have been watching this work fairly closely, and it is making the normal sort of incremental progress. If there's a big sweet spot nearby in the fitness landscape, these practitioners should be able to sense it. They would be well qualified to comment on the prospects for AI, and AGI in particular. I would be very interested in what they have to say.
PK, Phil Goetz, and Larry D'Anna are making a crucial point here but I'm afraid it is somewhat getting lost in the noise. The point is (in my words) that lookup tables are a philosophical red herring. To emulate a human being they can't just map external inputs to external outputs. They also have to map a big internal state to the next version of that big external state. (That's what Larry's equations mean.)
If there was no internal state like this, a GLUT couldn't emulate a person with any memory at all. But by hypothesis, it does emulate a person (perfectly). So it must have this internal state.
And given that a GLUT is maintaining a big internal state it is equivalent to a Turing machine, as Phil says.
But that means that is can implement any computationally well defined process. If we believe that consciousness can be a property of some computation then GLUTs can have consciousness. This isn't even a stretch, it is totally unavoidable.
The whole reason that philosopher talk about GLUTs, or that Searle talks about the Chinese room, is to try to trick the reader into being overwhelmed by the intuition that "that can't possibly be conscious" and to STOP THINKING.
Looking at this discussion, to some extent that works! Most people didn't say "Hmmm, I wonder how a GLUT could emulate a human..." and then realize it would need internal state, and the internal state would be supporting a complex computational process, and that the GLUT would in effect be a virtual machine, etc.
This is like an argument where someone tries to throw up examples that are so scary, or disgusting, or tear jerking, or whatever that we STOP THINKING and vote for whatever they are trying to sneak through. In other words it does not deserve the honor of being called an argument.
This leaves the very interesting question of whether a computational process can support consciousness. I think yes, but the discussion is richer. GLUTs are a red herring and don't lead much of anywhere.