Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.


Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger


EY was not in fact bullish on neural networks leading to impressive AI capabilities. Eliezer said this directly:

I'm no fan of neurons; this may be clearer from other posts.[1]

I think this is strong evidence for my interpretation of the quotes in my parent comment: He's not just mocking the local invalidity of reasoning "because humans have lots of neurons, AI with lots of neurons -> smart", he's also mocking neural network-driven hopes themselves. 

  1. ^

    More quotes from Logical or Connectionist AI?:

    Not to mention that neural networks have also been "failing" (i.e., not yet succeeding) to produce real AI for 30 years now. I don't think this particular raw fact licenses any conclusions in particular. But at least don't tell me it's still the new revolutionary idea in AI.

    This is the original example I used when I talked about the "Outside the Box" box - people think of "amazing new AI idea" and return their first cache hit, which is "neural networks" due to a successful marketing campaign thirty goddamned years ago. I mean, not every old idea is bad - but to still be marketing it as the new defiant revolution? Give me a break.

    In this passage, he employs well-scoped and well-hedged language via "this particular raw fact." I like this writing because it points out an observation, and then what inferences (if any) he draws from that observation. Overall, his tone is negative on neural networks.

    Let's open up that "Outside the Box" box:

    In Artificial Intelligence, everyone outside the field has a cached result for brilliant new revolutionary AI idea—neural networks, which work just like the human brain!  New AI Idea: complete the pattern:  "Logical AIs, despite all the big promises, have failed to provide real intelligence for decades—what we need are neural networks!"

    This cached thought has been around for three decades.  Still no general intelligence.  But, somehow, everyone outside the field knows that neural networks are the Dominant-Paradigm-Overthrowing New Idea, ever since backpropagation was invented in the 1970s.  Talk about your aging hippies.

    This is more incorrect mockery.

I worry that this comment dances around the basic update to be made. 

This post makes fun of people who were excited about neural networks. Neural network-based approaches have done extremely well. Eliezer's example wasn't just "unfortunately timed." Eliezer was wrong.

Here are some of my disagreements with List of Lethalities. I'll quote item one:

“Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.  This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again”

(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups. 

I think several things here, considering the broader thread: 

  1. You've done a great job in communicating several reactions I also had:
    1. There are signs of serious mispredictions and mistakes in some of the 2008 posts.
    2. There are ways to read these posts as not that bad in hindsight, but we should be careful in giving too much benefit of the doubt.
    3. Overall these observations constitute important evidence on EY's alignment intuitions and ability to make qualitative AI predictions.
  2. I did a bad job of marking my interpretations of what Eliezer wrote, as opposed to claiming he did dismiss ANNs. Hopefully my edits have fixed my mistakes.

But he only admitted that two other methods would work - builting a mechanical duplicate of the human brain and evolving AI via natural selection.

To be fair, he said that those two will work, and (perhaps?) admitted the possibility of "run advanced neural network algorithms" eventually working. Emphasis mine:

What do all these proposals have in common?

They are all ways to make yourself believe that you can build an Artificial Intelligence, even if you don't understand exactly how intelligence works.

Now, such a belief is not necessarily false!

Responding to part of your comment:

In that quote, he only rules out a large class of modern approaches to alignment, which again is nothing new; he's been very vocal about how doomed he thinks alignment is in this paradigm.

I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.

That claim seems to be advanced due to... there not being enough similarities between ANNs and human brains -- that without enough similarity in mechanisms wich were selected for by evolution, you simply can't get the AI to generalize in the mentioned human-like way. Not as a matter of the AI's substrate, but as a matter of the AI's policy not generalizing like that. 

I think this is a dubious claim, and it's made based off of analogies to evolution / some unknown importance of having evolution-selected mechanisms which guide value formation (and not SGD-based mechanisms).

From the Alexander/Yudkowsky debate:


Okay, then let me try to directly resolve my confusion. My current understanding is something like - in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we've screened off evolution, the size of the genome, etc - all of that is going into the "with certain structural parameters" part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")?


The answer to that seems sufficiently obviously "no" that I want to check whether you also think the answer is obviously no, but want to hear my answer, or if the answer is not obviously "no" to you.


Then I'm missing something, I expected the answer to be yes, maybe even tautologically (if it's the same structural parameters and the same training data, what's the difference?)


Maybe I'm failing to have understood the question. Evolution got human brains by evaluating increasingly large blobs of compute against a complicated environment containing other blobs of compute, got in each case a differential replication score, and millions of generations later you have humans with 7.5MB of evolution-learned data doing runtime learning on some terabytes of runtime data, using their whole-brain impressive learning algorithms which learn faster than evolution or gradient descent.

Your question sounded like "Well, can we take one blob of compute the size of a human brain, and expose it to what a human sees in their lifetime, and do gradient descent on that, and get a human?" and the answer is "That dataset ain't even formatted right for gradient descent."

There's some assertion like "no, there's not a way to get an ANN, even if incorporating structural parameters and information encoded in human genome, to actually unfold into a mind which has human-like values (like 'don't steal')." (And maybe Eliezer comes and says "no that's not what I mean", but, man, I sure don't know what he does mean, then.) 

Here's some more evidence along those lines:


I mean, the evolutionary builtin part is not "humans have morals" but "humans have an internal language in which your Nice Morality, among other things, can potentially be written"...

Humans, arguably, do have an imperfect unless-I-get-caught term, which is manifested in children testing what they can get away with? Maybe if nothing unpleasant ever happens to them when they're bad, the innate programming language concludes that this organism is in a spoiled aristocrat environment and should behave accordingly as an adult? But I am not an expert on this form of child developmental psychology since it unfortunately bears no relevance to my work of AI alignment.


Do you feel like you understand very much about what evolutionary builtins are in a neural network sense? EG if you wanted to make an AI with "evolutionary builtins", would you have any idea how to do it?


Well, for one thing, they happen when you're doing sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, not when you're doing gradient descent relative to a loss function on much larger neural networks.

Again, why is this true? This is an argument that should be engaging in technical questions about inductive biases, but instead seems to wave at (my words) "the original way we got property P was by sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, and good luck trying to get it otherwise."

Hopefully this helps clarify what I'm trying to critique?

Here's another attempt at one of my contentions. 

Consider shard theory of human values. The point of shard theory is not "because humans do RL, and have nice properties, therefore AI + RL will have nice properties." The point is more "by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let's test that and see if the generators are similar." 

And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human. 

But then I'm like "this other time you said 'AI is not human, stop expecting good property P from superficial similarities', you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn't be so quick to assume that the brain's qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical." 

Here's a colab notebook (it takes a while to load the data, be warned). We'll have a post out later. 

Edited to modify confidences about interpretations of EY's writing / claims.

In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.

This is a valid point, and that's not what I'm critiquing in that portion of the comment. I'm critiquing how -- on my read -- he confidently dismisses ANNs; in particular, using non-mechanistic reasoning which seems similar to some of his current alignment arguments.

On its own, this seems like a substantial misprediction for an intelligence researcher in 2008 (especially one who claims to have figured out most things in modern alignment, by a very early point in time -- possibly that early, IDK). Possibly the most important prediction to get right, to date.

Airplanes don't fly like birds, they fly like airplanes. So indeed you can't just ape one thing about birds[*] to get avian flight. I don't think this is a super revealing technicality but it seemed like you thought it was important.

Indeed, you can't ape one thing. But that's not what I'm critiquing. Consider the whole transformed line of reasoning:

avian flight comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.

The important part is the last part. It's invalid. Finding a design X which exhibits property P, doesn't mean that for design Y to exhibit property P, Y must be very similar to X. 

Which leads us to:

Maybe most importantly I don't think Eliezer thinks you need to mimic the human brain super closely to get human-like intelligence with human-friendly wants

Reading the Alexander/Yudkowsky debate, I surprisingly haven't ruled out this interpretation, and indeed suspect he believes some forms of this (but not others).

Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't.

Didn't he? He at least confidently rules out a very large class of modern approaches.

because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.

Like, in particular with respect to "learn 'don't steal' rather than 'don't get caught'."


  1. I don't know what proponents were claiming when proponing neural networks. I do know that neural networks ended up working, big time.
  2. I don't think loose analogies are powerful. I think they lead to sloppy thinking. 
Load More