Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This podcast has gotten a lot of traction, so we're posting a full transcript of it, lightly edited with ads removed, for those who prefer reading over audio. 


Eliezer Yudkowsky: [clip] I think that we are hearing the last winds start to blow, the fabric of reality start to fray. This thing alone cannot end the world, but I think that probably some of the vast quantities of money being blindly and helplessly piled into here are going to end up actually accomplishing something.

Ryan Sean Adams: Welcome to Bankless, where we explore the frontier of internet money and internet finance. This is how to get started, how to get better, how to front run the opportunity. This is Ryan Sean Adams. I'm here with David Hoffman, and we're here to help you become more bankless.

Okay, guys, we wanted to do an episode on AI at Bankless, but I feel like David...

David: Got what we asked for.

Ryan: We accidentally waded into the deep end of the pool here. And I think before we get into this episode, it probably warrants a few comments. I'm going to say a few things I'd like to hear from you too. But one thing I want to tell the listener is, don't listen to this episode if you're not ready for an existential crisis. Okay? I'm kind of serious about this. I'm leaving this episode shaken. And I don't say that lightly. In fact, David, I think you and I will have some things to discuss in the debrief as far as how this impacted you. But this was an impactful one. It sort of hit me during the recording, and I didn't know fully how to react. I honestly am coming out of this episode wanting to refute some of the claims made in this episode by our guest, Eliezer Yudkowsky, who makes the claim that humanity is on the cusp of developing an AI that's going to destroy us, and that there's really not much we can do to stop it.

David: There's no way around it, yeah.

Ryan: I have a lot of respect for this guest. Let me say that. So it's not as if I have some sort of big-brained technical disagreement here. In fact, I don't even know enough to fully disagree with anything he's saying. But the conclusion is so dire and so existentially heavy that I'm worried about it impacting you, listener, if we don't give you this warning going in.

I also feel like, David, as interviewers, maybe we could have done a better job. I'll say this on behalf of myself. Sometimes I peppered him with a lot of questions in one fell swoop, and he was probably only ready to synthesize one at a time.

I also feel like we got caught flat-footed at times. I wasn't expecting his answers to be so frank and so dire, David. It was just bereft of hope.

And I appreciated very much the honesty, as we always do on Bankless. But I appreciated it almost in the way that a patient might appreciate the honesty of their doctor telling them that their illness is terminal. Like, it's still really heavy news, isn't it? 

So that is the context going into this episode. I will say one thing. In good news, for our failings as interviewers in this episode, they might be remedied because at the end of this episode, after we finished with hitting the record button to stop recording, Eliezer said he'd be willing to provide an additional Q&A episode with the Bankless community. So if you guys have questions, and if there's sufficient interest for Eliezer to answer, tweet at us to express that interest. Hit us in Discord. Get those messages over to us and let us know if you have some follow-up questions.

He said if there's enough interest in the crypto community, he'd be willing to come on and do another episode with follow-up Q&A. Maybe even a Vitalik and Eliezer episode is in store. That's a possibility that we threw to him. We've not talked to Vitalik about that too, but I just feel a little overwhelmed by the subject matter here. And that is the basis, the preamble through which we are introducing this episode.

David, there's a few benefits and takeaways I want to get into. But before I do, can you comment or reflect on that preamble? What are your thoughts going into this one?

David: Yeah, we approached the end of our agenda—for every Bankless podcast, there's an equivalent agenda that runs alongside of it. But once we got to this crux of this conversation, it was not possible to proceed in that agenda, because... what was the point?

Ryan: Nothing else mattered.

David: And nothing else really matters, which also just relates to the subject matter at hand. And so as we proceed, you'll see us kind of circle back to the same inevitable conclusion over and over and over again, which ultimately is kind of the punchline of the content.

I'm of a specific disposition where stuff like this, I kind of am like, “Oh, whatever, okay”, just go about my life. Other people are of different dispositions and take these things more heavily. So Ryan's warning at the beginning is if you are a type of person to take existential crises directly to the face, perhaps consider doing something else instead of listening to this episode.

Ryan: I think that is good counsel.

So, a few things if you're looking for an outline of the agenda. We start by talking about ChatGPT. Is this a new era of artificial intelligence? Got to begin the conversation there.

Number two, we talk about what an artificial superintelligence might look like. How smart exactly is it? What types of things could it do that humans cannot do?

Number three, we talk about why an AI superintelligence will almost certainly spell the end of humanity and why it'll be really hard, if not impossible, according to our guest, to stop this from happening.

And number four, we talk about if there is absolutely anything we can do about all of this. We are heading careening maybe towards the abyss. Can we divert direction and not go off the cliff? That is the question we ask Eliezer.

David, I think you and I have a lot to talk about during the debrief. All right, guys, the debrief is an episode that we record right after the episode. It's available for all Bankless citizens. We call this the Bankless Premium Feed. You can access that now to get our raw and unfiltered thoughts on the episode. And I think it's going to be pretty raw this time around, David.

David: I didn't expect this to hit you so hard.

Ryan: Oh, I'm dealing with it right now.

David: Really?

Ryan: And this is not too long after the episode. So, yeah, I don't know how I'm going to feel tomorrow, but I definitely want to talk to you about this. And maybe have you give me some counseling. (laughs)

David: I'll put my psych hat on, yeah.

Ryan: Please! I'm going to need some help.


Ryan: Bankless Nation, we are super excited to introduce you to our next guest. Eliezer Yudkowsky is a decision theorist. He's an AI researcher. He's the seeder of the Less Wrong community blog, a fantastic blog by the way. There's so many other things that he's also done. I can't fit this in the short bio that we have to introduce you to Eliezer.

But most relevant probably to this conversation is he's working at the Machine Intelligence Research Institute to ensure that when we do make general artificial intelligence, it doesn't come kill us all. Or at least it doesn't come ban cryptocurrency, because that would be a poor outcome as well.

Eliezer: (laughs)

Ryan: Eliezer, it's great to have you on Bankless. How are you doing?

Eliezer: Within one standard deviation of my own peculiar little mean.

Ryan: (laughs) Fantastic. You know, we want to start this conversation with something that jumped onto the scene for a lot of mainstream folks quite recently, and that is ChatGPT. So apparently over 100 million or so have logged on to ChatGPT quite recently. I've been playing with it myself. I found it very friendly, very useful. It even wrote me a sweet poem that I thought was very heartfelt and almost human-like.

I know that you have major concerns around AI safety, and we're going to get into those concerns. But can you tell us in the context of something like a ChatGPT, is this something we should be worried about? That this is going to turn evil and enslave the human race? How worried should we be about ChatGPT and BARD and the new AI that's entered the scene recently?

Eliezer: ChatGPT itself? Zero. It's not smart enough to do anything really wrong. Or really right either, for that matter.

Ryan: And what gives you the confidence to say that? How do you know this?

Eliezer: Excellent question. So, every now and then, somebody figures out how to put a new prompt into ChatGPT. You know, one time somebody found that one of the earlier generations of the technology would sound smarter if you first told it it was Eliezer Yudkowsky. There's other prompts too, but that one's one of my favorites. So there's untapped potential in there that people hadn't figured out how to prompt yet.

But when people figure it out, it moves ahead sufficiently short distances that I do feel fairly confident that there is not so much untapped potential in there that it is going to take over the world. It's, like, making small movements, and to take over the world it would need a very large movement. There's places where it falls down on predicting the next line that a human would say in its shoes that seem indicative of “probably that capability just is not in the giant inscrutable matrices, or it would be using it to predict the next line”, which is very heavily what it was optimized for. So there's going to be some untapped potential in there. But I do feel quite confident that the upper range of that untapped potential is insufficient to outsmart all the living humans and implement the scenario that I'm worried about.

Ryan: Even so, though, is ChatGPT a big leap forward in the journey towards AI in your mind? Or is this fairly incremental, it's just (for whatever reason) caught mainstream attention?

Eliezer: GPT-3 was a big leap forward. There's rumors about GPT-4, which, who knows? ChatGPT is a commercialization of the actual AI-in-the-lab giant leap forward. If you had never heard of GPT-3 or GPT-2 or the whole range of text transformers before ChatGPT suddenly entered into your life, then that whole thing is a giant leap forward. But it's a giant leap forward based on a technology that was published in, if I recall correctly, 2018.

David: I think that what's going around in everyone's minds right now—and the Bankless listenership (and crypto people at large) are largely futurists, so everyone (I think) listening understands that in the future, there will be sentient AIs perhaps around us, at least by the time that we all move on from this world.

So we all know that this future of AI is coming towards us. And when we see something like ChatGPT, everyone's like, “Oh, is this the moment in which our world starts to become integrated with AI?” And so, Eliezer, you've been tapped into the world of AI. Are we onto something here? Or is this just another fad that we will internalize and then move on for? And then the real moment of generalized AI is actually much further out than we're initially giving credit for. Like, where are we in this timeline?

Eliezer: Predictions are hard, especially about the future. I sure hope that this is where it saturates — this or the next generation, it goes only this far, it goes no further. It doesn't get used to make more steel or build better power plants, first because that's illegal, and second because the large language model technologies’ basic vulnerability is that it’s not reliable. It's good for applications where it works 80% of the time, but not where it needs to work 99.999% of the time. This class of technology can't drive a car because it will sometimes crash the car.

So I hope it saturates there. I hope they can't fix it. I hope we get, like, a 10-year AI winter after this.

This is not what I actually predict. I think that we are hearing the last winds start to blow, the fabric of reality start to fray. This thing alone cannot end the world. But I think that probably some of the vast quantities of money being blindly and helplessly piled into here are going to end up actually accomplishing something.

Not most of the money—that just never happens in any field of human endeavor. But 1% of $10 billion is still a lot of money to actually accomplish something.


Ryan: So listeners, I think you've heard Eliezer's thesis on this, which is pretty dim with respect to AI alignment—and we'll get into what we mean by AI alignment—and very worried about AI-safety-related issues.

But I think for a lot of people to even worry about AI safety and for us to even have that conversation, I think they have to have some sort of grasp of what AGI looks like. I understand that to mean “artificial general intelligence” and this idea of a super-intelligence.

Can you tell us: if there was a superintelligence on the scene, what would it look like? I mean, is this going to look like a big chat box on the internet that we can all type things into? It's like an oracle-type thing? Or is it like some sort of a robot that is going to be constructed in a secret government lab? Is this, like, something somebody could accidentally create in a dorm room? What are we even looking for when we talk about the term “AGI” and “superintelligence”?

Eliezer: First of all, I'd say those are pretty distinct concepts. ChatGPT shows a very wide range of generality compared to the previous generations of AI. Not very wide generality compared to GPT-3—not literally the lab research that got commercialized, that's the same generation. But compared to stuff from 2018 or even 2020, ChatGPT is better at a much wider range of things without having been explicitly programmed by humans to be able to do those things.

To imitate a human as best it can, it has to capture all of the things that humans can think about that it can, which is not all the things. It's still not very good at long multiplication (unless you give it the right instructions, in which case suddenly it can do it). 

It's significantly more general than the previous generation of artificial minds. Humans were significantly more general than the previous generation of chimpanzees, or rather Australopithecus or last common ancestor.

Humans are not fully general. If humans were fully general, we'd be as good at coding as we are at football, throwing things, or running. Some of us are okay at programming, but we're not spec'd for it. We're not fully general minds.

You can imagine something that's more general than a human, and if it runs into something unfamiliar, it's like, okay, let me just go reprogram myself a bit and then I'll be as adapted to this thing as I am to anything else.

So ChatGPT is less general than a human, but it's genuinely ambiguous, I think, whether it's more or less general than (say) our cousins, the chimpanzees. Or if you don't believe it's as general as a chimpanzee, a dolphin or a cat.

Ryan: So this idea of general intelligence is sort of a range of things that it can actually do, a range of ways it can apply itself?

Eliezer: How wide is it? How much reprogramming does it need? How much retraining does it need to make it do a new thing?

Bees build hives, beavers build dams, a human will look at a beehive and imagine a honeycomb shaped dam. That's. like, humans alone in the animal kingdom. But that doesn't mean that we are general intelligences, it means we're significantly more generally applicable intelligences than chimpanzees.

It's not like we're all that narrow. We can walk on the moon. We can walk on the moon because there's aspects of our intelligence that are made in full generality for universes that contain simplicities, regularities, things that recur over and over again. We understand that if steel is hard on Earth, it may stay hard on the moon. And because of that, we can build rockets, walk on the moon, breathe amid the vacuum.

Chimpanzees cannot do that, but that doesn't mean that humans are the most general possible things. The thing that is more general than us, that figures that stuff out faster, is the thing to be scared of if the purposes to which it turns its intelligence are not ones that we would recognize as nice things, even in the most cosmopolitan and embracing senses of what's worth doing.


Ryan: And you said this idea of a general intelligence is different than the concept of superintelligence, which I also brought into that first part of the question. How is superintelligence different than general intelligence?

Eliezer: Well, because ChatGPT has a little bit of general intelligence. Humans have more general intelligence. A superintelligence is something that can beat any human and the entire human civilization at all the cognitive tasks. I don't know if the efficient market hypothesis is something where I can rely on the entire… 

Ryan: We're all crypto investors here. We understand the efficient market hypothesis for sure.

Eliezer: So the efficient market hypothesis is of course not generally true. It's not true that literally all the market prices are smarter than you. It's not true that all the prices on earth are smarter than you. Even the most arrogant person who is at all calibrated, however, still thinks that the efficient market hypothesis is true relative to them 99.99999% of the time. They only think that they know better about one in a million prices.

They might be important prices. The price of Bitcoin is an important price. It's not just a random price. But if the efficient market hypothesis was only true to you 90% of the time, you could just pick out the 10% of the remaining prices and double your money every day on the stock market. And nobody can do that. Literally nobody can do that.

So this property of relative efficiency that the market has to you, that the price’s estimate of the future price already has all the information you have—not all the information that exists in principle, maybe not all the information that the best equity could, but it's efficient relative to you.

For you, if you pick out a random price, like the price of Microsoft stock, something where you've got no special advantage, that estimate of its price a week later is efficient relative to you. You can't do better than that price.

We have much less experience with the notion of instrumental efficiency, efficiency in choosing actions, because actions are harder to aggregate estimates about than prices. So you have to look at, say, AlphaZero playing chess—or just, you know, whatever the latest Stockfish number is, an advanced chess engine.

When it makes a chess move, you can't do better than that chess move. It may not be the optimal chess move, but if you pick a different chess move, you'll do worse. That you'd call a kind of efficiency of action. Given its goal of winning the game, once you know its move—unless you consult some more powerful AI than Stockfish—you can't figure out a better move than that.

A superintelligence is like that with respect to everything, with respect to all of humanity. It is relatively efficient to humanity. It has the best estimates—not perfect estimates, but the best estimates—and its estimates contain all the information that you've got about it. Its actions are the most efficient actions for accomplishing its goals. If you think you see a better way to accomplish its goals, you're mistaken.

Ryan: So you're saying [if something is a] superintelligence, we'd have to imagine something that knows all of the chess moves in advance. But here we're not talking about chess, we're talking about everything. It knows all of the moves that we would make and the most optimum pattern, including moves that we would not even know how to make, and it knows these things in advance.

I mean, how would human beings sort of experience such a superintelligence? I think we still have a very hard time imagining something smarter than us, just because we've never experienced anything like it before.

Of course, we all know somebody who's genius-level IQ, maybe quite a bit smarter than us, but we've never encountered something like what you're describing, some sort of mind that is superintelligent.

What sort of things would it be doing that humans couldn't? How would we experience this in the world?

Eliezer: I mean, we do have some tiny bit of experience with it. We have experience with chess engines, where we just can't figure out better moves than they make. We have experience with market prices, where even though your uncle has this really long, elaborate story about Microsoft stock, you just know he's wrong. Why is he wrong? Because if he was correct, it would already be incorporated into the stock price.

And especially because the market’s efficiency is not perfect, like that whole downward swing and then upward move in COVID. I have friends who made more money off that than I did, but I still managed to buy back into the broader stock market on the exact day of the low—basically coincidence. So the markets aren't perfectly efficient, but they're efficient almost everywhere.

And that sense of deference, that sense that your weird uncle can't possibly be right because the hedge funds would know it—you know. unless he's talking about COVID, in which case maybe he is right if you have the right choice of weird uncle! I have weird friends who are maybe better at calling these things than your weird uncle. So among humans, it's subtle. 

And then with superintelligence, it's not subtle, just massive advantage. But not perfect. It's not that it knows every possible move you make before you make it. It's that it's got a good probability distribution about that. And it has figured out all the good moves you could make and figured out how to reply to those.

And I mean, in practice, what's that like? Well, unless it's limited, narrow superintelligence, I think you mostly don't get to observe it because you are dead, unfortunately.

Ryan: What? (laughs)

Eliezer: Like, Stockfish makes strictly better chess moves than you, but it's playing on a very narrow board. And the fact that it's better at you than chess doesn't mean it's better at you than everything. And I think that the actual catastrophe scenario for AI looks like big advancement in a research lab, maybe driven by them getting a giant venture capital investment and being able to spend 10 times as much on GPUs as they did before, maybe driven by a new algorithmic advance like transformers, maybe driven by hammering out some tweaks in last year's algorithmic advance that gets the thing to finally work efficiently. And the AI there goes over a critical threshold, which most obviously could be like, “can write the next AI”. 

That's so obvious that science fiction writers figured it out almost before there were computers, possibly even before there were computers. I'm not sure what the exact dates here are. But if it's better at you than everything, it's better at you than building AIs. That snowballs. It gets an immense technological advantage. If it's smart, it doesn't announce itself. It doesn't tell you that there's a fight going on. It emails out some instructions to one of those labs that'll synthesize DNA and synthesize proteins from the DNA and get some proteins mailed to a hapless human somewhere who gets paid a bunch of money to mix together some stuff they got in the mail in a file. Like, smart people will not do this for any sum of money. Many people are not smart. Builds the ribosome, but the ribosome that builds things out of covalently bonded diamondoid instead of proteins folding up and held together by Van der Waals forces, builds tiny diamondoid bacteria. The diamondoid bacteria replicate using atmospheric carbon, hydrogen, oxygen, nitrogen, and sunlight. And a couple of days later, everybody on earth falls over dead in the same second.

That's the disaster scenario if it's as smart as I am. If it's smarter, it might think of a better way to do things. But it can at least think of that if it's relatively efficient compared to humanity because I'm in humanity and I thought of it.

Ryan: This is—I've got a million questions, but I'm gonna let David go first.

David: Yeah. So we sped run the introduction of a number of different concepts, which I want to go back and take our time to really dive into.

There's the AI alignment problem. There's AI escape velocity. There is the question of what happens when AIs are so incredibly intelligent that humans are to AIs what ants are to us.

And so I want to kind of go back and tackle these, Eliezer, one by one.

We started this conversation talking about ChatGPT, and everyone's up in arms about ChatGPT. And you're saying like, yes, it's a great step forward in the generalizability of some of the technologies that we have in the AI world. All of a sudden ChatGPT becomes immensely more useful and it's really stoking the imaginations of people today.

But what you're saying is it's not the thing that's actually going to be the thing to reach escape velocity and create superintelligent AIs that perhaps might be able to enslave us. But my question to you is, how do we know when that—

Eliezer: Not enslave. They don't enslave you, but sorry, go on.

David: Yeah, sorry.

Ryan: Murder, David. Kill all of us. Eliezer was very clear on that.

David: So if it's not ChatGPT, how close are we? Because there's this unknown event horizon where you kind of alluded to it, where we make this AI that we train it to create a smarter AI and that smart AI is so incredibly smart that it hits escape velocity and all of a sudden these dominoes fall. How close are we to that point? And are we even capable of answering that question?

Eliezer: How the heck would I know? 

Ryan: Well, when you were talking, Eliezer, if we had already crossed that event horizon, a smart AI wouldn't necessarily broadcast that to the world. I mean, it's possible we've already crossed that event horizon, is it not?

Eliezer: I mean, it's theoretically possible, but seems very unlikely. Somebody would need inside their lab an AI that was much more advanced than the public AI technology. And as far as I currently know, the best labs and the best people are throwing their ideas to the world! Like, they don't care.

And there's probably some secret government labs with secret government AI researchers. My pretty strong guess is that they don't have the best people and that those labs could not create ChatGPT on their own because ChatGPT took a whole bunch of fine twiddling and tuning and visible access to giant GPU farms and that they don't have the people who know how to do the twiddling and tuning. This is just a guess.

AI Alignment

David: Could you walk us through—one of the big things that you spend a lot of time on is this thing called the AI alignment problem. Some people are not convinced that when we create AI, that AI won't really just be fundamentally aligned with humans. I don't believe that you fall into that camp. I think you fall into the camp of when we do create this superintelligent, generalized AI, we are going to have a hard time aligning with it in terms of our morality and our ethics.

Can you walk us through a little bit of that thought process? Why do you feel disaligned?

Ryan: The dumb way to ask that question too is like, Eliezer, why do you think that the AI automatically hates us? Why is it going to—

Eliezer: It doesn't hate you.

Ryan: Why does it want to kill us all?

Eliezer: The AI doesn't hate you, neither does it love you, and you're made of atoms that it can use for something else.

David: It's indifferent to you.

Eliezer: It's got something that it actually does care about, which makes no mention of you. And you are made of atoms that it can use for something else. That's all there is to it in the end.

The reason you're not in its utility function is that the programmers did not know how to do that. The people who built the AI, or the people who built the AI that built the AI that built the AI, did not have the technical knowledge that nobody on earth has at the moment as far as I know, whereby you can do that thing and you can control in detail what that thing ends up caring about.

David: So this feels like humanity is hurdling itself towards what we're calling, again, an event horizon where there's this AI escape velocity, and there's nothing on the other side. As in, we do not know what happens past that point as it relates to having some sort of superintelligent AI and how it might be able to manipulate the world. Would you agree with that?

Eliezer: No.

Again, the Stockfish chess-playing analogy. You cannot predict exactly what move it would make, because in order to predict exactly what move it would make, you would have to be at least that good at chess, and it's better than you.

This is true even if it's just a little better than you. Stockfish is actually enormously better than you, to the point that once it tells you the move, you can't figure out a better move without consulting a different AI. But even if it was just a bit better than you, then you're in the same position.

This kind of disparity also exists between humans. If you ask me, where will Garry Kasparov move on this chessboard? I'm like, I don't know, maybe here. Then if Garry Kasparov moves somewhere else, it doesn't mean that he's wrong, it means that I'm wrong. If I could predict exactly where Garry Kasparov would move on a chessboard, I'd be Garry Kasparov. I'd be at least that good at chess. Possibly better. I could also be able to predict him, but also see an even better move than that. 

That's an irreducible source of uncertainty with respect to superintelligence, or anything that's smarter than you. If you could predict exactly what it would do, you'd be that smart yourself. It doesn't mean you can predict no facts about it.

With Stockfish in particular, I can predict it's going to win the game. I know what it's optimizing for. I know where it's trying to steer the board. I can't predict exactly what the board will end up looking like after Stockfish has finished winning its game against me. I can predict it will be in the class of states that are winning positions for black or white or whichever color Stockfish picked, because, you know, it wins either way.

And that's similarly where I'm getting the prediction about everybody being dead, because if everybody were alive, then there'd be some state that the superintelligence preferred to that state, which is all of the atoms making up these people and their farms are being used for something else that it values more.

So if you postulate that everybody's still alive, I'm like, okay, well, why is it you're postulating that Stockfish made a stupid chess move and ended up with a non-winning board position? That's where that class of predictions come from.

Ryan: Can you reinforce this argument, though, a little bit? So, why is it that an AI can't be nice, sort of like a gentle parent to us, rather than sort of a murderer looking to deconstruct our atoms and apply for use somewhere else?

What are its goals? And why can't they be aligned to at least some of our goals? Or maybe, why can't it get into a status which is somewhat like us and the ants, which is largely we just ignore them unless they interfere in our business and come in our house and raid our cereal boxes?

Eliezer: There's a bunch of different questions there. So first of all, the space of minds is very wide. Imagine this giant sphere and all the humans are in this one tiny corner of the sphere. We're all basically the same make and model of car, running the same brand of engine. We're just all painted slightly different colors.

Somewhere in that mind space, there's things that are as nice as humans. There's things that are nicer than humans. There are things that are trustworthy and nice and kind in ways that no human can ever be. And there's even things that are so nice that they can understand the concept of leaving you alone and doing your own stuff sometimes instead of hanging around trying to be obsessively nice to you every minute and all the other famous disaster scenarios from ancient science fiction ("With Folded Hands" by Jack Williamson is the one I'm quoting there.)

We don't know how to reach into mind design space and pluck out an AI like that. It's not that they don't exist in principle. It's that we don't know how to do it. And I’ll hand back the conversational ball now and figure out, like, which next question do you want to go down there?

Ryan: Well, I mean, why? Why is it so difficult to align an AI with even our basic notions of morality?

Eliezer: I mean, I wouldn't say that it's difficult to align an AI with our basic notions of morality. I'd say that it's difficult to align an AI on a task like “take this strawberry, and make me another strawberry that's identical to this strawberry down to the cellular level, but not necessarily the atomic level”. So it looks the same under like a standard optical microscope, but maybe not a scanning electron microscope. Do that. Don't destroy the world as a side effect.

Now, this does intrinsically take a powerful AI. There's no way you can make it easy to align by making it stupid. To build something that's cellular identical to a strawberry—I mean, mostly I think the way that you do this is with very primitive nanotechnology, but we could also do it using very advanced biotechnology. And these are not technologies that we already have. So it's got to be something smart enough to develop new technology.

Never mind all the subtleties of morality. I think we don't have the technology to align an AI to the point where we can say, “Build me a copy of the strawberry and don't destroy the world.”

Why do I think that? Well, case in point, look at natural selection building humans. Natural selection mutates the humans a bit, runs another generation. The fittest ones reproduce more, their genes become more prevalent to the next generation. Natural selection hasn't really had very much time to do this to modern humans at all, but you know, the hominid line, the mammalian line, go back a few million generations. And this is an example of an optimization process building an intelligence.

And natural selection asked us for only one thing: “Make more copies of your DNA. Make your alleles more relatively prevalent in the gene pool.” Maximize your inclusive reproductive fitness—not just your own reproductive fitness, but your two brothers or eight cousins, as the joke goes, because they've got on average one copy of your genes. This is all we were optimized for, for millions of generations, creating humans from scratch, from the first accidentally self-replicating molecule.

Internally, psychologically, inside our minds, we do not know what genes are. We do not know what DNA is. We do not know what alleles are. We have no concept of inclusive genetic fitness until our scientists figure out what that even is. We don't know what we were being optimized for. For a long time, many humans thought they'd been created by God!

When you use the hill-climbing paradigm and optimize for one single extremely pure thing, this is how much of it gets inside.

In the ancestral environment, in the exact distribution that we were originally optimized for, humans did tend to end up using their intelligence to try to reproduce more. Put them into a different environment, and all the little bits and pieces and fragments of optimizing for fitness that were in us now do totally different stuff. We have sex, but we wear condoms.

If natural selection had been a foresightful, intelligent kind of engineer that was able to engineer things successfully, it would have built us to be revolted by the thought of condoms. Men would be lined up and fighting for the right to donate to sperm banks. And in our natural environment, the little drives that got into us happened to lead to more reproduction, but distributional shift: run the humans out of their distribution over which they were optimized, and you get totally different results. 

And gradient descent would by default do—not quite the same thing, it's going to do a weirder thing because natural selection has a much narrower information bottleneck. In one sense, you could say that natural selection was at an advantage because it finds simpler solutions. You could imagine some hopeful engineer who just built intelligences using gradient descent and found out that they end up wanting these thousands and millions of little tiny things, none of which were exactly what the engineer wanted, and being like, well, let's try natural selection instead. It's got a much sharper information bottleneck. It'll find the simple specification of what I want.

But we actually get there as humans. And then, gradient descent, probably may be even worse.

But more importantly, I'm just pointing out that there is no physical law, computational law, mathematical/logical law, saying when you optimize using hill-climbing on a very simple, very sharp criterion, you get a general intelligence that wants that thing.

Ryan: So just like natural selection, our tools are too blunt in order to get to that level of granularity to program in some sort of morality into these super intelligent systems?

Eliezer: Or build me a copy of a strawberry without destroying the world. Yeah. The tools are too blunt.

David: So I just want to make sure I'm following with what you were saying. I think the conclusion that you left me with is that my brain, which I consider to be at least decently smart, is actually a byproduct, an accidental byproduct of this desire to reproduce. And it's actually just like a tool that I have, and just like conscious thought is a tool, which is a useful tool in means of that end.

And so if we're applying this to AI and AI's desire to achieve some certain goal, what's the parallel there?

Eliezer: I mean, every organ in your body is a reproductive organ. If it didn't help you reproduce, you would not have an organ like that. Your brain is no exception. This is merely conventional science and merely the conventional understanding of the world. I'm not saying anything here that ought to be at all controversial. I'm sure it's controversial somewhere, but within a pre-filtered audience, it should not be at all controversial. And this is, like, the obvious thing to expect to happen with AI, because why wouldn't it? What new law of existence has been invoked, whereby this time we optimize for a thing and we get a thing that wants exactly what we optimized for on the outside?

AI Goals

Ryan: So what are the types of goals an AI might want to pursue? What types of utility functions is it going to want to pursue off the bat? Is it just those it's been programmed with, like make an identical strawberry?

Eliezer: Well, the whole thing I'm saying is that we do not know how to get goals into a system. We can cause them to do a thing inside a distribution they were optimized over using gradient descent. But if you shift them outside of that distribution, I expect other weird things start happening. When they reflect on themselves, other weird things start happening.

What kind of utility functions are in there? I mean, darned if I know. I think you'd have a pretty hard time calling the shape of humans from advance by looking at natural selection, the thing  that natural selection was optimizing for, if you'd never seen a human or anything like a human.

If we optimize them from the outside to predict the next line of human text, like GPT-3—I don't actually think this line of technology leads to the end of the world, but maybe it does, in like GPT-7—there's probably a bunch of stuff in there too that desires to accurately model things like humans under a wide range of circumstances, but it's not exactly humans, because: ice cream.

Ice cream didn't exist in the natural environment, the ancestral environment, the environment of evolutionary adaptedness. There was nothing with that much sugar, salt, fat combined together as ice cream. We are not built to want ice cream. We were built to want strawberries, honey, a gazelle that you killed and cooked and had some fat in it and was therefore nourishing and gave you the all-important calories you need to survive, salt, so you didn't sweat too much and run out of salt. We evolved to want those things, but then ice cream comes along and it fits those taste buds better than anything that existed in the environment that we were optimized over.

So, a very primitive, very basic, very unreliable wild guess, but at least an informed kind of wild guess: Maybe if you train a thing really hard to predict humans, then among the things that it likes are tiny little pseudo things that meet the definition of “human” but weren't in its training data and that are much easier to predict, or where the problem of predicting them can be solved in a more satisfying way, where “satisfying” is not like human satisfaction, but some other criterion of “thoughts like this are tasty because they help you predict the humans from the training data”. (shrugs)



David: Eliezer, when we talk about all of these ideas about the ways that AI thought will be fundamentally not able to be understood by the ways that humans think, and then all of a sudden we see this rotation by venture capitalists by just pouring money into AI, do alarm bells go off in your head? Like, hey guys, you haven't thought deeply about these subject matters yet? Does the immense amount of capital going into AI investments scare you?

Eliezer: I mean, alarm bells went off for me in 2015, which is when it became obvious that this is how it was going to go down. I sure am now seeing the realization of that stuff I felt alarmed about back then.

Ryan: Eliezer, is this view that AI is incredibly dangerous and that AGI is going to eventually end humanity and that we're just careening toward a precipice, would you say this is the consensus view now, or are you still somewhat of an outlier? And why aren't other smart people in this field as alarmed as you? Can you steel-man their arguments?

Eliezer: You're asking, again, several questions sequentially there. Is it the consensus view? No. Do I think that the people in the wider scientific field who dispute this point of view—do I think they understand it? Do I think they've done anything like an impressive job of arguing against it at all? No.

If you look at the famous prestigious scientists who sometimes make a little fun of this view in passing, they're making up arguments rather than deeply considering things that are held to any standard of rigor, and people outside their own fields are able to validly shoot them down.

I have no idea how to pronounce his last name. Francis Chollet said something about, I forget his exact words, but it was something like, I never hear any good arguments for stuff. I was like, okay, here's some good arguments for stuff. You can read the reply from Yudkowsky to Chollet and Google that, and that'll give you some idea of what the eminent voices versus the reply to the eminent voices sound like. And Scott Aronson, who at the time was off on complexity theory, he was like, “That's not how no free lunch theorems work”, correctly.

I think the state of affairs is we have eminent scientific voices making fun of this possibility, but not engaging with the arguments for it. 

Now, if you step away from the eminent scientific voices, you can find people who are more familiar with all the arguments and disagree with me. And I think they lack security mindset. I think that they're engaging in the sort of blind optimism that many, many scientific fields throughout history have engaged in, where when you're approaching something for the first time, you don't know why it will be hard, and you imagine easy ways to do things. And the way that this is supposed to naturally play out over the history of a scientific field is that you run out and you try to do the things and they don't work, and you go back and you try to do other clever things and they don't work either, and you learn some pessimism and you start to understand the reasons why the problem is hard.

The field of artificial intelligence itself recapitulated this very common ontogeny of a scientific field, where initially we had people getting together at the Dartmouth conference. I forget what their exact famous phrasing was, but it's something like, “We are wanting to address the problem of getting AIs to, you know, like understand language, improve themselves”, and I forget even what else was there. A list of what now sound like grand challenges. “And we think we can make substantial progress on this using 10 researchers for two months.” And I think that at the core is what's going on. 

They have not run into the actual problems of alignment. They aren't trying to get ahead of the game. They're not trying to panic early. They're waiting for reality to hit them onto the head and turn them into grizzled old cynics of their scientific field who understand the reasons why things are hard. They're content with the predictable life cycle of starting out as bright-eyed youngsters, waiting for reality to hit them over the head with the news. And if it wasn't going to kill everybody the first time that they're really wrong, it'd be fine! You know, this is how science works! If we got unlimited free retries and 50 years to solve everything, it'd be okay. We could figure out how to align AI in 50 years given unlimited retries.

You know, the first team in with the bright-eyed optimists would destroy the world and people would go, oh, well, you know, it's not that easy. They would try something else clever. That would destroy the world. People would go like, oh, well, you know, maybe this field is actually hard. Maybe this is actually one of the thorny things like computer security or something. And so what exactly went wrong last time? Why didn't these hopeful ideas play out? Oh, like you optimize for one thing on the outside and you get a different thing on the inside. Wow. That's really basic. All right. Can we even do this using gradient descent? Can you even build this thing out of giant inscrutable matrices of floating point numbers that nobody understands at all? You know, maybe we need different methodology. And 50 years later, you'd have an aligned AGI.

If we got unlimited free retries without destroying the world, it'd be, you know, it'd play out the same way that ChatGPT played out. It's, you know, from 1956 or 1955 or whatever it was to 2023. So, you know, about 70 years, give or take a few. And, you know, just like we can do the stuff that they wanted to do in the summer of 1955, you know, 70 years later, you'd have your aligned AGI.

Problem is that the world got destroyed in the meanwhile. And that's why, you know, that's the problem there.

God Mode and Aliens

David: So this feels like a gigantic Don't Look Up scenario. If you're familiar with that movie, it's a movie about this asteroid hurtling to Earth, but it becomes popular and in vogue to not look up and not notice it. And Eliezer, you're the guy who's saying like, hey, there's an asteroid. We have to do something about it. And if we don't, it's going to come destroy us.

If you had God mode over the progress of AI research and just innovation and development, what choices would you make that humans are not currently making today?

Eliezer: I mean, I could say something like shut down all the large GPU clusters. How long do I have God mode? Do I get to like stick around for seventy years?

David: You have God mode for the 2020 decade.

Eliezer: For the 2020 decade. All right. That does make it pretty hard to do things.

I think I shut down all the GPU clusters and get all of the famous scientists and brilliant, talented youngsters—the vast, vast majority of whom are not going to be productive and where government bureaucrats are not going to be able to tell who's actually being helpful or not, but, you know—put them all on a large island, and try to figure out some system for filtering the stuff through to me to give thumbs up or thumbs down on that is going to work better than scientific bureaucrats producing entire nonsense.

Because, you know, the trouble is—the reason why scientific fields have to go through this long process to produce the cynical oldsters who know that everything is difficult. It's not that the youngsters are stupid. You know, sometimes youngsters are fairly smart. You know, Marvin Minsky, John McCarthy back in 1955, they weren't idiots. You know, privileged to have met both of them. They didn't strike me as idiots. They were very old, and they still weren't idiots. But, you know, it's hard to see what's coming in advance of experimental evidence hitting you over the head with it.

And if I only have the decade of the 2020s to run all the researchers on this giant island somewhere, it's really not a lot of time. Mostly what you've got to do is invent some entirely new AI paradigm that isn't the giant inscrutable matrices of floating point numbers on gradient descent. Because I'm not really seeing what you can do that's clever with that, that doesn't kill you and that you know doesn't kill you and doesn't kill you the very first time you try to do something clever like that.

You know, I'm sure there's a way to do it. And if you got to try over and over again, you could find it.

Ryan: Eliezer, do you think every intelligent civilization has to deal with this exact problem that humanity is dealing with now? Of how do we solve this problem of aligning with an advanced general intelligence?

Eliezer: I expect that's much easier for some alien species than others. Like, there are alien species who might arrive at “this problem” in an entirely different way. Maybe instead of having two entirely different information processing systems, the DNA and the neurons, they've only got one system. They can trade memories around heritably by swapping blood sexually. Maybe the way in which they “confront this problem” is that very early in their evolutionary history, they have the equivalent of the DNA that stores memories and processes, computes memories, and they swap around a bunch of it, and it adds up to something that reflects on itself and makes itself coherent, and then you've got a superintelligence before they have invented computers. And maybe that thing wasn't aligned, but how do you even align it when you're in that kind of situation? It'd be a very different angle on the problem.

Ryan: Do you think every advanced civilization is on the trajectory to creating a superintelligence at some point in its history?

Eliezer: Maybe there's ones in universes with alternate physics where you just can't do that. Their universe's computational physics just doesn't support that much computation. Maybe they never get there. Maybe their lifespans are long enough and their star lifespans short enough that they never get to the point of a technological civilization before their star does the equivalent of expanding or exploding or going out and their planet ends.

“Every alien species” covers a lot of territory, especially if you talk about alien species and universes with physics different from this one.

Ryan: Well, talking about our present universe, I'm curious if you've been confronted with the question of, well, then why haven't we seen some sort of superintelligence in our universe when we look out at the stars? Sort of the Fermi paradox type of question. Do you have any explanation for that?

Eliezer: Oh, well, supposing that they got killed by their own AIs doesn't help at all with that because then we'd see the AIs.

Ryan: And do you think that's what happens? Yeah, it doesn't help with that. We would see evidence of AIs, wouldn't we?

Eliezer: Yeah.

Ryan:  Yes. So why don't we?

Eliezer: I mean, the same reason we don't see evidence of the alien civilizations not with AIs.

And that reason is, although it doesn't really have much to do with the whole AI thesis one way or another, because they're too far away—or so says Robin Hanson, using a very clever argument about the apparent difficulty of hard steps in humanity's evolutionary history to further induce the rough gap between the hard steps. ... And, you know, I can't really do justice to this. If you look up grabby aliens, you can...

Ryan: Grabby aliens?

David: I remember this.

Eliezer: Grabby aliens. You can find Robin Hanson's very clever argument for how far away the aliens are...

Ryan: There's an entire website, Bankless listeners, there's an entire website called you can go look at.

Eliezer: Yeah. And that contains by far the best answer I've seen, to:

  • “Where are they?” (Answer: too far away for us to see, even if they're traveling here at nearly light speed.)
  • How far away are they?
  • And how do we know that?

(laughs) But, yeah.

Ryan: This is amazing.

Eliezer: There is not a very good way to simplify the argument, any more than there is to simplify the notion of zero-knowledge proofs. It's not that difficult, but it's just very not easy to simplify. But if you have a bunch of locks that are all of different difficulties, and a limited time in which to solve all the locks, such that anybody who gets through all the locks must have gotten through them by luck, all the locks will take around the same amount of time to solve, even if they're all of very different difficulties. And that's the core of Robin Hanson's argument for how far away the aliens are, and how do we know that? (shrugs)

Good Outcomes

Ryan: Eliezer, I know you're very skeptical that there will be a good outcome when we produce an artificial general intelligence. And I said when, not if, because I believe that's your thesis as well, of course. But is there the possibility of a good outcome? I know you are working on AI alignment problems, which leads me to believe that you have greater than zero amount of hope for this project. Is there the possibility of a good outcome? What would that look like, and how do we go about achieving it?

Eliezer: It looks like me being wrong. I basically don't see on-model hopeful outcomes at this point. We have not done those things that it would take to earn a good outcome, and this is not a case where you get a good outcome by accident.

If you have a bunch of people putting together a new operating system, and they've heard about computer security, but they're skeptical that it's really that hard, the chance of them producing a secure operating system is effectively zero.

That's basically the situation I see ourselves in with respect to AI alignment. I have to be wrong about something—which I certainly am. I have to be wrong about something in a way that makes the problem easier rather than harder for those people who don't think that alignment's going to be all that hard.

If you're building a rocket for the first time ever, and you're wrong about something, it's not surprising if you're wrong about something. It's surprising if the thing that you're wrong about causes the rocket to go twice as high on half the fuel you thought was required and be much easier to steer than you were afraid of.

Ryan: So, are you...

David: Where the alternative was, “If you’re wrong about something, the rocket blows up.”

Eliezer: Yeah. And then the rocket ignites the atmosphere, is the problem there.

O rather: a bunch of rockets blow up, a bunch of rockets go places... The analogy I usually use for this is, very early on in the Manhattan Project, they were worried about “What if the nuclear weapons can ignite fusion in the nitrogen in the atmosphere?” And they ran some calculations and decided that it was incredibly unlikely for multiple angles, so they went ahead, and were correct. We're still here. I'm not going to say that it was luck, because the calculations were actually pretty solid.

An AI is like that, but instead of needing to refine plutonium, you can make nuclear weapons out of a billion tons of laundry detergent. The stuff to make them is fairly widespread. It's not a tightly controlled substance. And they spit out gold up until they get large enough, and then they ignite the atmosphere, and you can't calculate how large is large enough. And a bunch of the CEOs running these projects are making fun of the idea that it'll ignite the atmosphere.

It's not a very helpful situation.

David: So the economic incentive to produce this AI—one of the things why ChatGPT has sparked the imaginations of so many people is that everyone can imagine products. Products are being imagined left and right about what you can do with something like ChatGPT. There's this meme at this point of people leaving to go start their ChatGPT startup.

The metaphor is that what you're saying is that there's this generally available resource spread all around the world, which is ChatGPT, and everyone's hammering it in order to make it spit out gold. But you're saying if we do that too much, all of a sudden the system will ignite the whole entire sky, and then we will all...

Eliezer: Well, no. You can run ChatGPT any number of times without igniting the atmosphere. That's about what research labs at Google and Microsoft—counting DeepMind as part of Google and counting OpenAI as part of Microsoft—that's about what the research labs are doing, bringing more metaphorical Plutonium together than ever before. Not about how many times you run the things that have been built and not destroyed the world yet.

You can do any amount of stuff with ChatGPT and not destroy the world. It's not that smart. It doesn't get smarter every time you run it.


Ryan's Childhood Questions

Ryan: Can I ask some questions that the 10-year-old in me wants to really ask about this? I'm asking these questions because I think a lot of listeners might be thinking them too, so knock off some of these easy answers for me.

If we create some sort of unaligned, let's call it “bad” AI, why can't we just create a whole bunch of good AIs to go fight the bad AIs and solve the problem that way? Can there not be some sort of counterbalance in terms of aligned human AIs and evil AIs, and there be some sort of battle of the artificial minds here?

Eliezer: Nobody knows how to create any good AIs at all. The problem isn't that we have 20 good AIs and then somebody finally builds an evil AI. The problem is that the first very powerful AI is evil, nobody knows how to make it good, and then it kills everybody before anybody can make it good.

Ryan: So there is no known way to make a friendly, human-aligned AI whatsoever, and you don't know of a good way to go about thinking through that problem and designing one. Neither does anyone else, is what you're telling us.

Eliezer: I have some idea of what I would do if there were more time. Back in the day, we had more time. Humanity squandered it. I'm not sure there's enough time left now. I have some idea of what I would do if I were in a 25-year-old body and had $10 billion.

Ryan: That would be the island scenario of “You're God for 10 years and you get all the researchers on an island and go really hammer for 10 years at this problem”?

Eliezer: If I have buy-in from a major government that can run actual security precautions and more than just $10 billion, then you could run a whole Manhattan Project about it, sure.

Ryan: This is another question that the 10-year-old in me wants to know. Why is it that, Eliezer, people listening to this episode or people listening to the concerns or reading the concerns that you've written down and published, why can't everyone get on board who's building an AI and just all agree to be very, very careful? Is that not a sustainable game-theoretic position to have? Is this a coordination problem, more of a social problem than anything else? Or, like, why can't that happen?

I mean, we have so far not destroyed the world with nuclear weapons, and we've had them since the 1940s.

Eliezer: Yeah, this is harder than nuclear weapons. This is a lot harder than nuclear weapons.

Ryan: Why is this harder? And why can't we just coordinate to just all agree internationally that we're going to be very careful, put restrictions on this, put regulations on it, do something like that?

Eliezer: Current heads of major labs seem to me to be openly contemptuous of these issues. That's where we're starting from. The politicians do not understand it.

There are distortions of these ideas that are going to sound more appealing to them than “everybody suddenly falls over dead”, which is a thing that I think actually happens. “Everybody falls over dead” just doesn't inspire the monkey political parts of our brain somehow. Because it's not like, “Oh no, what if terrorists get the AI first?” It's like, it doesn't matter who gets it first. Everybody falls over dead.

And yeah, so you're describing a world coordinating on something that is relatively hard to coordinate. So, could we, if we tried starting today, prevent anyone from getting a billion pounds of laundry detergent in one place worldwide, control the manufacturing of laundry detergent, only have it manufactured in particular places, not concentrate lots of it together, enforce it on every country?

Y’know, if it was legible, if it was clear that a billion pounds of laundry detergent in one place would end the world, if you could calculate that, if all the scientists calculated it arrived at the same answer and told the politicians that maybe, maybe humanity would survive, even though smaller amounts of laundry detergent spit out gold.

The threshold can't be calculated. I don't know how you'd convince the politicians. We definitely don't seem to have had much luck convincing those CEOs whose job depends on them not caring, to care. Caring is easy to fake. It's easy to hire a bunch of people to be your “AI safety team”  and redefine “AI safety” as having the AI not say naughty words. Or, you know, I'm speaking somewhat metaphorically here for reasons.

But, you know, it's the basic problem that we have is like trying to build a secure OS before we run up against a really smart attacker. And there's all kinds of, like, fake security. “It's got a password file! This system is secure! It only lets you in if you type a password!” And if you never go up against a really smart attacker, if you never go far out of distribution against a powerful optimization process looking for holes, you know, then how does a bureaucracy come to know that what they're doing is not the level of computer security that they need? The way you're supposed to find this out, the way that scientific fields historically find this out, the way that fields of computer science historically find this out, the way that crypto found this out back in the early days, is by having the disaster happen! 

And we're not even that good at learning from relatively minor disasters! You know, like, COVID swept the world. Did the FDA or the CDC learn anything about “Don't tell hospitals that they're not allowed to use their own tests to detect the coming plague”? Are we installing UV-C lights in public spaces or in ventilation systems to prevent the next respiratory pandemic? You know, we lost a million people and we sure did not learn very much as far as I can tell for next time.

We could have an AI disaster that kills a hundred thousand people—how do you even do that? Robotic cars crashing into each other? Have a bunch of robotic cars crashing into each other! It's not going to look like that was the fault of artificial general intelligence because they're not going to put AGIs in charge of cars. They're going to pass a bunch of regulations that's going to affect the entire AGI disaster or not at all.

What does the winning world even look like here? How in real life did we get from where we are now to this worldwide ban, including against North Korea and, you know, some one rogue nation whose dictator doesn't believe in all this nonsense and just wants the gold that these AIs spit out? How did we get there from here? How do we get to the point where the United States and China signed a treaty whereby they would both use nuclear weapons against Russia if Russia built a GPU cluster that was too large? How did we get there from here?

David: Correct me if I'm wrong, but this seems to be kind of just like a topic of despair? I'm talking to you now and hearing your thought process about, like, there is no known solution and the trajectory's not great. Do you think all hope is lost here?

Eliezer: I'll keep on fighting until the end, which I wouldn't do if I had literally zero hope. I could still be wrong about something in a way that makes this problem somehow much easier than it currently looks. I think that's how you go down fighting with dignity.

Ryan: “Go down fighting with dignity.” That's the stage you think we're at.

I want to just double-click on what you were just saying. Part of the case that you're making is humanity won't even see this coming. So it's not like a coordination problem like global warming where every couple of decades we see the world go up by a couple of degrees, things get hotter, and we start to see these effects over time. The characteristics or the advent of an AGI in your mind is going to happen incredibly quickly, and in such a way that we won't even see the disaster until it's imminent, until it's upon us...?

Eliezer: I mean, if you want some kind of, like, formal phrasing, then I think that superintelligence will kill everyone before non-superintelligent AIs have killed one million people. I don't know if that's the phrasing you're looking for there.

Ryan: I think that's a fairly precise definition, and why? What goes into that line of thought?

Eliezer: I think that the current systems are actually very weak. I don't know, maybe I could use the analogy of Go, where you had systems that were finally competitive with the pros, where “pro” is like the set of ranks in Go, and then a year later, they were challenging the world champion and winning. And then another year, they threw out all the complexities and the training from human databases of Go games and built a new system, AlphaGo Zero, that trained itself from scratch. No looking at the human playbooks, no special-purpose code, just a general purpose game-player being specialized to Go, more or less.

And, three days—there's a quote from Gwern about this, which I forget exactly, but it was something like, “We know how long AlphaGo Zero, or AlphaZero (two different systems), was equivalent to a human Go player. And it was, like, 30 minutes on the following floor of such-and-such DeepMind building.”

Maybe the first system doesn't improve that quickly, and they build another system that does—And all of that with AlphaGo over the course of years, going from “it takes a long time to train” to “it trains very quickly and without looking at the human playbook”, that’s not with an artificial intelligence system that improves itself, or even that gets smarter as you run it, the way that human beings (not just as you evolve them, but as you run them over the course of their own lifetimes) improve.

So if the first system doesn't improve fast enough to kill everyone very quickly, they will build one that's meant to spit out more gold than that.

And there could be weird things that happen before the end. I did not see ChatGPT coming, I did not see Stable Diffusion coming, I did not expect that we would have AIs smoking humans in rap battles before the end of the world. Ones that are clearly much dumber than us.

Ryan: It’s kind of a nice send-off, I guess, in some ways.

Trying to Resist

Ryan: So you said that your hope is not zero, and you are planning to fight to the end. What does that look like for you? I know you're working at MIRI, which is the Machine Intelligence Research Institute. This is a non-profit that I believe that you've set up to work on these AI alignment and safety issues. What are you doing there? What are you spending your time on? How do we actually fight until the end? If you do think that an end is coming, how do we try to resist?

Eliezer: I'm actually on something of a sabbatical right now, which is why I have time for podcasts. It's a sabbatical from, you know, like, been doing this 20 years. It became clear we were all going to die. I felt kind of burned out, taking some time to rest at the moment. When I dive back into the pool, I don't know, maybe I will go off to Conjecture or Anthropic or one of the smaller concerns like Redwood Research—Redwood Research being the only ones I really trust at this point, but they're tiny—and try to figure out if I can see anything clever to do with the giant inscrutable matrices of floating point numbers.

Maybe I just write, continue to try to explain in advance to people why this problem is hard instead of as easy and cheerful as the current people who think they're pessimists think it will be. I might not be working all that hard compared to how I used to work. I'm older than I was. My body is not in the greatest of health these days. Going down fighting doesn't necessarily imply that I have the stamina to fight all that hard. I wish I had prettier things to say to you here, but I do not.

Ryan: No, this is... We intended to save probably the last part of this episode to talk about crypto, the metaverse, and AI and how this all intersects. But I gotta say, at this point in the episode, it all kind of feels pointless to go down that track.

We were going to ask questions like, well, in crypto, should we be worried about building sort of a property rights system, an economic system, a programmable money system for the AIs to sort of use against us later on? But it sounds like the easy answer from you to those questions would be, yeah, absolutely. And by the way, none of that matters regardless. You could do whatever you'd like with crypto. This is going to be the inevitable outcome no matter what.

Let me ask you, what would you say to somebody listening who maybe has been sobered up by this conversation? If a version of you in your 20s does have the stamina to continue this battle and to actually fight on behalf of humanity against this existential threat, where would you advise them to spend their time? Is this a technical issue? Is this a social issue? Is it a combination of both? Should they educate? Should they spend time in the lab? What should a person listening to this episode do with these types of dire straits?

Eliezer: I don't have really good answers. It depends on what your talents are. If you've got the very deep version of the security mindset, the part where you don't just put a password on your system so that nobody can walk in and directly misuse it, but the kind where you don't just encrypt the password file even though nobody's supposed to have access to the password file in the first place, and that's already an authorized user, but the part where you hash the passwords and salt the hashes. If you're the kind of person who can think of that from scratch, maybe take your hand at alignment.

If you can think of an alternative to the giant inscrutable matrices, then, you know, don't tell the world about that. I'm not quite sure where you go from there, but maybe you work with Redwood Research or something.

A whole lot of this problem is that even if you do build an AI that's limited in some way, somebody else steals it, copies it, runs it themselves, and takes the bounds off the for loops and the world ends. 

So there's that. You think you can do something clever with the giant inscrutable matrices? You're probably wrong. If you have the talent to try to figure out why you're wrong in advance of being hit over the head with it, and not in a way where you just make random far-fetched stuff up as the reason why it won't work, but where you can actually keep looking for the reason why it won't work...

We have people in crypto[graphy] who are good at breaking things, and they're the reason why anything is not on fire. Some of them might go into breaking AI systems instead, because that's where you learn anything.

You know: Any fool can build a crypto[graphy] system that they think will work. Breaking existing cryptographical systems is how we learn who the real experts are. So maybe the people finding weird stuff to do with AIs, maybe those people will come up with some truth about these systems that makes them easier to align than I suspect.

How do I put it... The saner outfits do have uses for money. They don't really have scalable uses for money, but they do burn any money literally at all. Like, if you gave MIRI a billion dollars, I would not know how to...

Well, at a billion dollars, I might try to bribe people to move out of AI development, that gets broadcast to the whole world, and move to the equivalent of an island somewhere—not even to make any kind of critical discovery, but just to remove them from the system. If I had a billion dollars.

If I just have another $50 million, I'm not quite sure what to do with that, but if you donate that to MIRI, then you at least have the assurance that we will not randomly spray money on looking like we're doing stuff and we'll reserve it, as we are doing with the last giant crypto donation somebody gave us until we can figure out something to do with it that is actually helpful. And MIRI has that property. I would say probably Redwood Research has that property.

Yeah. I realize I'm sounding sort of disorganized here, and that's because I don't really have a good organized answer to how in general somebody goes down fighting with dignity.

MIRI and Education

Ryan: I know a lot of people in crypto. They are not as in touch with artificial intelligence, obviously, as you are, and the AI safety issues and the existential threat that you've presented in this episode. They do care a lot and see coordination problems throughout society as an issue. Many have also generated wealth from crypto, and care very much about humanity not ending. What sort of things has MIRI, the organization I was talking about earlier, done with funds that you've received from crypto donors and elsewhere? And what sort of things might an organization like that pursue to try to stave this off?

Eliezer: I mean, I think mostly we've pursued a lot of lines of research that haven't really panned out, which is a respectable thing to do. We did not know in advance that those lines of research would fail to pan out. If you're doing research that you know will work, you're probably not really doing any research. You're just doing a pretense of research that you can show off to a funding agency.

We try to be real. We did things where we didn't know the answer in advance. They didn't work, but that was where the hope lay, I think. But, you know, having a research organization that keeps it real that way, that's not an easy thing to do. And if you don't have this very deep form of the security mindset, you will end up producing fake research and doing more harm than good, so I would not tell all the successful cryptocurrency people to run off and start their own research outfits.

Redwood Research—I'm not sure if they can scale using more money, but you can give people more money and wait for them to figure out how to scale it later if they're the kind who won't just run off and spend it, which is what MIRI aspires to be.

Ryan: And you don't think the education path is a useful path? Just educating the world?

Eliezer: I mean, I would give myself and MIRI credit for why the world isn't just walking blindly into the whirling razor blades here, but it's not clear to me how far education scales apart from that. You can get more people aware that we're walking directly into the whirling razor blades, because even if only 10% of the people can get it, that can still be a bunch of people. But then what do they do? I don't know. Maybe they'll be able to do something later.

Can you get all the people? Can you get all the politicians? Can you get the people whose job incentives are against them admitting this to be a problem? I have various friends who report, like, “Ah yes, if you talk to researchers at OpenAI in private, they are very worried and say that they cannot be that worried in public.”


How Long Do We Have?

Ryan: This is all a giant Moloch trap, is sort of what you're telling us. I feel like this is the part of the conversation where we've gotten to the end and the doctor has said that we have some sort of terminal illness. And at the end of the conversation, I think the patient, David and I, have to ask the question, “Okay, doc, how long do we have?” Seriously, what are we talking about here if you turn out to be correct? Are we talking about years? Are we talking about decades? What's your idea here?

David: What are you preparing for, yeah?

Eliezer: How the hell would I know? Enrico Fermi was saying that fission chan reactions were 50 years off if they could ever be done at all, two years before he built the first nuclear pile. The Wright brothers were saying heavier-than-air flight was 50 years off shortly before they built the first Wright flyer. How on earth would I know?

It could be three years. It could be 15 years. We could get that AI winter I was hoping for, and it could be 16 years. I'm not really seeing 50 without some kind of giant civilizational catastrophe. And to be clear, whatever civilization arises after that would probably, I'm guessing, end up stuck in just the same trap we are.

Ryan: I think the other thing that the patient might do at the end of a conversation like this is to also consult with other doctors. I'm kind of curious who we should talk to on this quest. Who are some people that if people in crypto want to hear more about this or learn more about this, or even we ourselves as podcasters and educators want to pursue this topic, who are the other individuals in the AI alignment and safety space you might recommend for us to have a conversation with?

Eliezer: Well, the person who actually holds a coherent technical view, who disagrees with me, is named Paul Christiano. He does not write Harry Potter fan fiction, and I expect him to have a harder time explaining himself in concrete terms. But that is the main technical voice of opposition. If you talk to other people in the effective altruism or AI alignment communities who disagree with this view, they are probably to some extent repeating back their misunderstandings of Paul Christiano's views. 

You could try Ajeya Cotra, who's worked pretty directly with Paul Christiano and I think sometimes aspires to explain these things that Paul is not the best at explaining. I'll throw out Kelsey Piper as somebody who would be good at explaining—like, would not claim to be a technical person on these issues, but is good at explaining the part that she does know. 

Who else disagrees with me? I'm sure Robin Hanson would be happy to come on... well, I'm not sure he'd be happy to come on this podcast, but Robin Hanson disagrees with me, and I kind of feel like the famous argument we had back in the early 2010s, late 2000s about how this would all play out—I basically feel like this was the Yudkowsky position, this is the Hanson position, and then reality was over here, well to the Yudkowsky side of the Yudkowsky position in the Yudkowsky-Hanson debate. But Robin Hanson does not feel that way, and would probably be happy to expound on that at length. 

I don't know. It's not hard to find opposing viewpoints. The ones that'll stand up to a few solid minutes of cross-examination from somebody who knows which parts to cross-examine, that's the hard part.

Bearish Hope

Ryan: You know, I've read a lot of your writings and listened to you on previous podcasts. One was in 2018 on the Sam Harris podcast. This conversation feels to me like the most dire you've ever seemed on this topic. And maybe that's not true. Maybe you've sort of always been this way, but it seems like the direction of your hope that we solve this issue has declined. I'm wondering if you feel like that's the case, and if you could sort of summarize your take on all of this as we close out this episode and offer, I guess, any concluding thoughts here.

Eliezer: I mean, I don't know if you've got a time limit on this episode? Or is it just as long as it runs?

Ryan: It's as long as it needs to be, and I feel like this is a pretty important topic. So you answer this however you want.

Eliezer: Alright. Well, there was a conference one time on “What are we going to do about looming risk of AI disaster?”, and Elon Musk attended that conference. And I was like,: Maybe this is it. Maybe this is when the powerful people notice, and it's one of the relatively more technical powerful people who could be noticing this. And maybe this is where humanity finally turns and starts... not quite fighting back, because there isn't an external enemy here, but conducting itself with... I don't know. Acting like it cares, maybe?

And what came out of that conference, well, was OpenAI, which was fairly nearly the worst possible way of doing anything. This is not a problem of “Oh no, what if secret elites get AI?” It's that nobody knows how to build the thing. If we do have an alignment technique, it's going to involve running the AI with a bunch of careful bounds on it where you don't just throw all the cognitive power you have at something. You have limits on the for loops. 

And whatever it is that could possibly save the world, like go out and turn all the GPUs and the server clusters into Rubik's cubes or something else that prevents the world from ending when somebody else builds another AI a few weeks later—anything that could do that is an artifact where somebody else could take it and take the bounds off the for loops and use it to destroy the world.

So let's open up everything! Let's accelerate everything! It was like GPT-3's version, though GPT-3 didn't exist back then—but it was like ChatGPT's blind version of throwing the ideals at a place where they were exactly the wrong ideals to solve the problem.

And the problem is that demon summoning is easy and angel summoning is much harder. Open sourcing all the demon summoning circles is not the correct solution. And I'm using Elon Musk's own terminology here. He talked about AI as “summoning the demon”, which, not accurate, but—and then the solution was to put a demon summoning circle in every household. 

And, why? Because his friends were calling him Luddites once he'd expressed any concern about AI at all. So he picked a road that sounded like “openness” and “accelerating technology”! So his friends would stop calling him “Luddite”.

It was very much the worst—you know, maybe not the literal, actual worst possible strategy, but so very far pessimal.

And that was it.

That was like... that was me in 2015 going like, “Oh. So this is what humanity will elect to do. We will not rise above. We will not have more grace, not even here at the very end.”

So that is, you know, that is when I did my crying late at night and then picked myself up and fought and fought and fought until I had run out all the avenues that I seem to have the capabilities to do. There's, like, more things, but they require scaling my efforts in a way that I've never been able to make them scale. And all of it's pretty far-fetched at this point anyways.

So, you know, that—so what's, you know, what's changed over the years? Well, first of all, I ran out some remaining avenues of hope. And second, things got to be such a disaster, such a visible disaster, the AI has got powerful enough and it became clear enough that, you know, we do not know how to align these things, that I could actually say what I've been thinking for a while and not just have people go completely, like, “What are you saying about all this?”

You know, now the stuff that was obvious back in 2015 is, you know, starting to become visible in the distance to others and not just completely invisible. That's what changed over time.

The End Goal

Ryan: What kind of... What do you hope people hear out of this episode and out of your comments? Eliezer in 2023, who is sort of running on the last fumes of, of hope. Yeah, what do you, what do you want people to get out of this episode? What are you planning to do?

Eliezer: I don't have concrete hopes here. You know, when everything is in ruins, you might as well speak the truth, right? Maybe somebody hears it, somebody figures out something I didn't think of.

I mostly expect that this does more harm than good in the modal universe, because a bunch of people are like, “Oh, I have this brilliant, clever idea,” which is, you know, something that I was arguing against in 2003 or whatever, but you know, maybe somebody out there with the proper level of pessimism hears and thinks of something I didn't think of.

I suspect that if there's hope at all, it comes from a technical solution, because the difference between technical problems and political problems is at least the technical problems have solutions in principle. At least the technical problems are solvable. We're not on course to solve this one, but I think anybody who's hoping for a political solution has frankly not understood the technical problem. 

They do not understand what it looks like to try to solve the political problem to such a degree that the world is not controlled by AI because they don't understand how easy it is to destroy the world with AI, given that the clock keeps ticking forward.

They're thinking that they just have to stop some bad actor, and that's why they think there's a political solution.

But yeah, I don't have concrete hopes. I didn't come on this episode out of any concrete hope.

I have no takeaways except, like, don't make this thing worse.

Don't, like, go off and accelerate AI more. Don't—f you have a brilliant solution to alignment, don't be like, “Ah yes, I have solved the whole problem. We just use the following clever trick.”

You know, “Don't make things worse” isn’t very much of a message, especially when you're pointing people at the field at all. But I have no winning strategy. Might as well go on this podcast as an experiment and say what I think and see what happens. And probably no good ever comes of it, but you might as well go down fighting, right?

If there's a world that survives, maybe it's a world that survives because of a bright idea somebody had after listening to listening to this podcast—that was brighter, to be clear, than the usual run of bright ideas that don't work.

Ryan: Eliezer, I want to thank you for coming on and talking to us today. I do.

I don't know if, by the way, you've seen that movie that David was referencing earlier, the movie Don’t Look Up, but I sort of feel like that news anchor, who's talking to the scientist—is it Leonardo DiCaprio, David? And, uh, the scientist is talking about kind of dire straits for the world. And the news anchor just really doesn't know what to do. I'm almost at a loss for words at this point.

David: I've had nothing for a while now.

Ryan: But one thing I can say is I appreciate your honesty. I appreciate that you've given this a lot of time and given this a lot of thought. Everyone, anyone who has heard you speak or read anything you've written knows that you care deeply about this issue and have given it a tremendous amount of your life force, in trying to educate people about it.

And, um, thanks for taking the time to do that again today. I'll—I guess I'll just let the audience digest this episode in the best way they know how. But, um, I want to reflect everybody in crypto and everybody listening to Bankless—their thanks for you coming on and explaining.

Eliezer: Thanks for having me. We'll see what comes of it.

Ryan: Action items for you, Bankless nation. We always end with some action items. Not really sure where to refer folks to today, but one thing I know we can refer folks to is MIRI, which is the machine research intelligence institution that Eliezer has been talking about through the episode. That is at, I believe. And some people in crypto have donated funds to this in the past. Vitalik Buterin is one of them. You can take a look at what they're doing as well. That might be an action item for the end of this episode.

Um, got to end with risks and disclaimers—man, this seems very trite, but our legal experts have asked us to say these at the end of every episode. “Crypto is risky. You could lose everything...”

Eliezer: (laughs)

David: Apparently not as risky as AI, though.

Ryan: —But we're headed west! This is the frontier. It's not for everyone, but we're glad you're with us on the Bankless journey. Thanks a lot.

Eliezer: And we are grateful for the crypto community’s support. Like, it was possible to end with even less grace than this.

Ryan: Wow. (laughs)

Eliezer: And you made a difference.

Ryan: We appreciate you.

Eliezer: You really made a difference.

Ryan: Thank you.


Ryan: [... Y]ou gave up this quote, from I think someone who's an executive director at MIRI: "We've given up hope, but not the fight."

Can you reflect on that for a bit? So it's still possible to fight this, even if we've given up hope? And even if you've given up hope? Do you have any takes on this?

Eliezer: I mean, what else is there to do? You don't have good ideas. So you take your mediocre ideas, and your not-so-great ideas, and you pursue those until the world ends. Like, what's supposed to be better than that?

Ryan: We had some really interesting conversation flow out of this episode, Eliezer, as you can imagine. And David and I want to relay some questions that the community had for you, and thank you for being gracious enough to help with those questions in today's Twitter Spaces.

I'll read something from Luke ethwalker. "Eliezer has one pretty flawed point in his reasoning. He assumes that AI would have no need or use for humans because we have atoms that could be used for better things. But how could an AI use these atoms without an agent operating on its behalf in the physical world? Even in his doomsday scenario, the AI relied on humans to create the global, perfect killing virus. That's a pretty huge hole in his argument, in my opinion."

What's your take on this? That maybe AIs will dominate the digital landscape but because humans have a physical manifestation, we can still kind of beat the superintelligent AI in our physical world?

Eliezer: If you were an alien civilization of a billion John von Neumanns, thinking at 10,000 times human speed, and you start out connected to the internet, you would want to not be just stuck on the internet, you would want to build that physical presence. You would not be content solely with working through human hands, despite the many humans who'd be lined up, cheerful to help you, you know. Bing already has its partisans. (laughs)

You wouldn’t be content with that, because the humans are very slow, glacially slow. You would like fast infrastructure in the real world, reliable infrastructure. And how do you build that, is then the question, and a whole lot of advanced analysis has been done on this question. I would point people again to Eric Drexler's Nanosystems.

And, sure, if you literally start out connected to the internet, then probably the fastest way — maybe not the only way, but it's, you know, an easy way — is to get humans to do things. And then humans do those things. And then you have the desktop — not quite desktop, but you have the nanofactories, and then you don't need the humans anymore. And this need not be advertised to the world at large while it is happening.

David: So I can understand that perspective, like in the future, we will have better 3D printers — distant in the future, we will have ways where the internet can manifest in the physical world. But I think this argument does ride on a future state with technology that we don't have today. Like, I don't think if I was the internet — and that kind of is this problem, right? Like, this superintelligent AI just becomes the internet because it's embedded in the internet. If I was the internet, how would I get myself to manifest in real life?

And now, I am not an expert on the current state of robotics, or what robotics are connected to the internet. But I don't think we have too strong of tools today to start to create in the real world manifestations of an internet-based AI. So like, would you say that this part of this problem definitely depends on some innovation, at like the robotics level?

Eliezer: No, it depends on the AI being smart. It doesn't depend on the humans having this technology; it depends on the AI being able to invent the technology.

This is, like, the central problem: the thing is smarter. Not in the way that the average listener to this podcast probably has an above average IQ, the way that humans are smarter than chimpanzees.

What does that let humans do? Does it let humans be, like, really clever in how they play around with the stuff that's on the ancestral savanna? Make clever use of grass, clever use of trees?

The humans invent technology. They build the technology. The technology is not there until the humans invent it, the humans conceive it.

The problem is, humans are not the upper bound. We don't have the best possible brains for that kind of problem. So the existing internet is more than connected enough to people and devices, that you could build better technology than that if you had invented the technology because you were thinking much, much faster and better than a human does.

Ryan: Eliezer, this is a question from stirs, a Bankless Nation listener. He wants to ask the question about your explanation of why the AI will undoubtedly kill us. That seems to be your conclusion, and I'm wondering if you could kind of reinforce that claim. Like, for instance — and this is something David and I discussed after the episode, when we were debriefing on this — why exactly wouldn't an AI, or couldn't an AI just blast off of the Earth and go somewhere more interesting, and leave us alone? Like, why does it have to take our atoms and reassemble them? Why can't it just, you know, set phasers to ignore?

Eliezer: It could if it wanted to. But if it doesn't want to, there is some initial early advantage. You get to colonize the universe slightly earlier if you consume all of the readily accessible energy on the Earth's surface as part of your blasting off of the Earth process.

It would only need to care for us by a very tiny fraction to spare us, this I agree. Caring a very tiny fraction is basically the same problem as 100% caring. It's like, well, could you have a computer system that is usually like the Disk Operating System, but a tiny fraction of the time it's Windows 11? Writing that is just as difficult as writing Windows 11. We still have to write all the Windows 11 software. Getting it to care a tiny little bit is the same problem as getting it to care 100%.

Ryan: So Eliezer, is this similar to the relationship that humans have with other animals, planet Earth? I would say largely we really don't... I mean, obviously, there's no animal Bill of Rights. Animals have no legal protection in the human world, and we kind of do what we want and trample over their rights. But it doesn't mean we necessarily kill all of them. We just largely ignore them.

If they're in our way, you know, we might take them out. And there have been whole classes of species that have gone extinct through human activity, of course; but there are still many that we live alongside, some successful species as well. Could we have that sort of relationship with an AI? Why isn't that reasonably high probability in your models?

Eliezer So first of all, all these things are just metaphors. AI is not going to be exactly like humans to animals.

Leaving that aside for a second, the reason why this metaphor breaks down is that although the humans are smarter than the chickens, we're not smarter than evolution, natural selection, cumulative optimization power over the last billion years and change. (You know, there's evolution before that but it's pretty slow, just, like, single-cell stuff.)

There are things that cows can do for us, that we cannot do for ourselves. In particular, make meat by eating grass. We’re smarter than the cows, but there's a thing that designed the cows; and we're faster than that thing, but we've been around for much less time. So we have not yet gotten to the point of redesigning the entire cow from scratch. And because of that, there's a purpose to keeping the cow around alive.

And humans, furthermore, being the kind of funny little creatures that we are — some people care about cows, some people care about chickens. They're trying to fight for the cows and chickens having a better life, given that they have to exist at all. And there's a long complicated story behind that. It's not simple, the way that humans ended up in that [??]. It has to do with the particular details of our evolutionary history, and unfortunately it's not just going to pop up out of nowhere.

But I'm drifting off topic here. The basic answer to the question "where does that analogy break down?" is that I expect the superintelligences to be able to do better than natural selection, not just better than the humans.

David: So I think your answer is that the separation between us and a superintelligent AI is orders of magnitude larger than the separation between us and a cow, or even us than an ant. Which, I think a large amount of this argument resides on this superintelligence explosion — just going up an exponential curve of intelligence very, very quickly, which is like the premise of superintelligence.

And Eliezer, I want to try and get an understanding of... A part of this argument about "AIs are going come kill us" is buried in the Moloch problem. And Bankless listeners are pretty familiar with the concept of Moloch — the idea of coordination failure. The idea that the more that we coordinate and stay in agreement with each other, we actually create a larger incentive to defect.

And the way that this is manifesting here, is that even if we do have a bunch of humans, which understand the AI alignment problem, and we all agree to only safely innovate in AI, to whatever degree that means, we still create the incentive for someone to fork off and develop AI faster, outside of what would be considered safe.

And so I'm wondering if you could, if it does exist, give us the sort of lay of the land, of all of these commercial entities? And what, if at all, they're doing to have, I don't know, an AI alignment team?

So like, for example, OpenAI. Does OpenAI have, like, an alignment department? With all the AI innovation going on, what does the commercial side of the AI alignment problem look like? Like, are people trying to think about these things? And to what degree are they being responsible?

Eliezer: It looks like OpenAI having a bunch of people who it pays to do AI ethics stuff, but I don't think they're plugged very directly into Bing. And, you know, they've got that department because back when they were founded, some of their funders were like, "Well, but ethics?" and OpenAI was like "Sure, we can buy some ethics. We'll take this group of people, and we'll put them over here and we'll call them an alignment research department".

And, you know, the key idea behind ChatGPT is RLHF, which was invented by Paul Christiano. Paul Christiano had much more detailed ideas, and somebody might have reinvented this one, but anyway. I don't think that went through OpenAI, but I could be mistaken. Maybe somebody will be like "Well, actually, Paul Christiano was working at OpenAI at the time", I haven't checked the history in very much detail.

A whole lot of the people who were most concerned with this "ethics" left OpenAI, and founded Anthropic. And I'm still not sure that Anthropic has sufficient leadership focus in that direction.

You know, like, put yourself in the shoes of a corporation! You can spend some little fraction of your income on putting together a department of people who will write safety papers. But then the actual behavior that we've seen, is they storm ahead, and they use one or two of the ideas that came out from anywhere in the whole [alignment] field. And they get as far as that gets them. And if that doesn't get them far enough, they just keep storming ahead at maximum pace, because, you know, Microsoft doesn't want to lose to Google, and Google doesn't want to lose to Microsoft.

David: So it sounds like your attitude on the efforts of AI alignment in commercial entities is, like, they're not even doing 1% of what they need to be doing.

Eliezer: I mean, they could spend [10?] times as much money and that would not get them to 10% of what they need to be doing.

It's not just a problem of “oh, they they could spend the resources, but they don't want to”. It's a question of “how do we even spend the resources to get the info that they need”.

But that said, not knowing how to do that, not really understanding that they need to do that, they are just charging ahead anyways.

Ryan: Eliezer, is OpenAI the most advanced AI project that you're aware of?

Eliezer: Um, no, but I'm not going to go name the competitor, because then people will be like, "Oh, I should go work for them", you know? I'd rather they didn't.

Ryan: So it's like, OpenAI is this organization that was kind of — you were talking about it at the end of the episode, and for crypto people who aren't aware of some of the players in the field — were they spawned from that 2015 conference that you mentioned? It's kind of a completely open-source AI project?

Eliezer: That was the original suicidal vision, yes. But...

Ryan: And now they're bent on commercializing the technology, is that right?

Eliezer: That's an improvement, but not enough of one, because they're still generating lots of noise and hype and directing more resources into the field, and storming ahead with the safety that they have instead of the safety that they need, and setting bad examples. And getting Google riled up and calling back in Larry Page and Sergey Brin to head up Google's AI projects and so on. So, you know, it could be worse! It would be worse if they were open sourcing all the technology. But what they're doing is still pretty bad.

Ryan: What should they be doing, in your eyes? Like, what would be responsible use of this technology?

I almost get the feeling that, you know, your take would be "stop working on it altogether"? And, of course, you know, to an organization like OpenAI that's going to be heresy, even if maybe that's the right decision for humanity. But what should they be doing?

Eliezer: I mean, if you literally just made me dictator of OpenAI, I would change the name to "ClosedAI". Because right now, they're making it look like being "closed" is hypocrisy. They're, like, being "closed" while keeping the name "OpenAI", and that itself makes it looks like closure is like not this thing that you do cooperatively so that humanity will not die, but instead this sleazy profit-making thing that you do while keeping the name “OpenAI”.

So that's very bad; change the name to "ClosedAI", that's step one.

Next. I don't know if they can break the deal with Microsoft. But, you know, cut that off. None of this. No more hype. No more excitement. No more getting famous and, you know, getting your status off of like, "Look at how much closer we came to destroying the world! You know, we're not there yet. But, you know, we're at the forefront of destroying the world!" You know, stop grubbing for the Silicon Valley bragging cred of visibly being the leader.

Take it all closed. If you got to make money, make money selling to businesses in a way that doesn't generate a lot of hype and doesn't visibly push the field.And then try to figure out systems that are more alignable and not just more powerful. And at the end of that, they would fail, because, you know, it's not easy to do that. And the world would be destroyed. But they would have died with more dignity. Instead of being like, "Yeah, yeah, let's like push humanity off the cliff ourselves for the ego boost!", they would have done what they could, and then failed.

David: Eliezer, do you think anyone who's building AI — Elon Musk, Sam Altman at OpenAI – do you think progressing AI is fundamentally bad?

Eliezer: I mean, there are narrow forms of progress, especially if you didn't open-source them, that would be good. Like, you can imagine a thing that, like, pushes capabilities a bit, but is much more alignable.

There are people working in the field who I would say are, like, sort of unabashedly good. Like, Chris Olah is taking a microscope to these giant inscrutable matrices and trying to figure out what goes on inside there. Publishing that might possibly even push capabilities a little bit, because if people know what's going on inside there, they can make better ones. But the question of like, whether to closed-source that is, like, much more fraught than the question of whether to closed-source the stuff that's just pure capabilities.

But that said, the people who are just like, "Yeah, yeah, let's do more stuff! And let's tell the world how we did it, so they can do it too!" That's just, like, unabashedly bad.

David: So it sounds like you do see paths forward in which we can develop AI in responsible ways. But it's really this open-source, open-sharing-of-information to allow anyone and everyone to innovate on AI,  that's really the path towards doom. And so we actually need to keep this knowledge private. Like, normally knowledge...

Eliezer: No, no, no, no. Open-sourcing all this stuff is, like, a less dignified path straight off the edge. I'm not saying that all we need to do is keep everything closed and in the right hands and it will be fine. That will also kill you.

But that said, if you have stuff and you do not know how to make it not kill everyone, then broadcasting it to the world is even less dignified than being like, "Okay, maybe we should keep working on this until we can figure out how to make it not kill everyone."

And then the other people will, like, go storm ahead on their end and kill everyone. But, you know, you won't have personally slaughtered Earth. And that is more dignified.

Ryan: Eliezer, I know I was kind of shaken after our episode, not having heard the full AI alignment story, at least listened to it for a while.

And I think that in combination with the sincerity through which you talk about these subjects, and also me sort of seeing these things on the horizon, this episode was kind of shaking for me and caused a lot of thought.

But I'm noticing there is a cohort of people who are dismissing this take and your take specifically in this episode as Doomerism. This idea that every generation thinks it's, you know, the end of the world and the last generation.

What's your take on this critique that, "Hey, you know, it's been other things before. There was a time where it was nuclear weapons, and we would all end in a mushroom cloud. And there are other times where we thought a pandemic was going to kill everyone. And this is just the latest Doomerist AI death cult."

I'm sure you've heard that before. How do you respond?

Eliezer: That if you literally know nothing about nuclear weapons or artificial intelligence, except that somebody has claimed of both of them that they'll destroy the world, then sure, you can't tell the difference. As far as you can tell, nuclear weapons were claimed to destroy the world, and then they didn't destroy the world, and then somebody claimed that about AI.

So, you know, Laplace's rule of induction: at most a 1/3 probability that AI will destroy the world, if nuclear weapons and AI are the only case.

You can bring in so many more cases than that. Why, people should have known in the first place that nuclear weapons wouldn't destroy the world! Because their next door neighbor once said that the sky was falling, and that didn't happen; and if their next door weapon was [??], how could the people saying that nuclear weapons would destroy the world be right?

And basically, as long as people are trying to run off of models of human psychology, to derive empirical information about the world, they're stuck. They're in a trap they can never get out of. They’re going to always be trying to psychoanalyze the people talking about nuclear weapons or whatever. And the only way you can actually get better information is by understanding how nuclear weapons work, understanding what the international equilibrium with nuclear weapons looks like. And the international equilibrium, by the way, is that nobody profits from setting off small numbers of nuclear weapons, especially given that they know that large numbers of nukes would follow. And, you know, that's why they haven't been used yet. There was nobody who made a buck by starting a nuclear war. The nuclear war was clear, the nuclear war was legible. People knew what would happen if they fired off all the nukes.

The analogy I sometimes try to use with artificial intelligence is, “Well, suppose that instead you could make nuclear weapons out of a billion pounds of laundry detergent. And they spit out gold until you make one that's too large, whereupon it ignites the atmosphere and kills everyone. And you can't calculate exactly how large is too large. And the international situation is that the private research labs spitting out gold don't want to hear about igniting the atmosphere.” And that's the technical difference. You need to be able to tell whether or not that is true as a scientific claim about how reality, the universe, the environment, artificial intelligence, actually works. What actually happens when the giant inscrutable matrices go past a certain point of capability? It's a falsifiable hypothesis.

You know, if it fails to be falsified, then everyone is dead, but that doesn't actually change the basic dynamic here, which is, you can't figure out how the world works by psychoanalyzing the people talking about it.

David: One line of questioning that has come up inside of the Bankless Nation Discord is the idea that we need to train AI with data, lots of data. And where are we getting that data? Well, humans are producing that data. And when humans produce that data, by nature of the fact that it was produced by humans, that data has our human values embedded in it somehow, some way, just by the aggregate nature of all the data in the world, which was created by humans that have certain values. And then AI is trained on that data that has all the human values embedded in it. And so there's actually no way to create an AI that isn't trained on data that is created by humans, and that data has human values in it.

Is there anything to this line of reasoning about a potential glimmer of hope here?

Eliezer: There's a distant glimmer of hope, which is that an AI that is trained on tons of human data in this way probably understands some things about humans. And because of that, there's a branch of research hope within alignment, which is something that like, “Well, this AI, to be able to predict humans, needs to be able to predict the thought processes that humans are using to make their decisions. So can we thereby point to human values inside of the knowledge that the AI has?”

And this is, like, very nontrivial, because the simplest theory that you use to predict what humans decide next, does not have what you might term “valid morality under reflection” as a clearly labeled primitive chunk inside it that is directly controlling the humans, and which you need to understand on a scientific level to understand the humans.

The humans are full of hopes and fears and thoughts and desires. And somewhere in all of that is what we call “morality”, but it's not a clear, distinct chunk, where an alien scientist examining humans and trying to figure out just purely on an empirical level “how do these humans work?” would need to point to one particular chunk of the human brain and say, like, "Ahh, that circuit there, the morality circuit!"

So it's not easy to point to inside the AI's understanding. There is not currently any obvious way to actually promote that chunk of the AI's understanding to then be in control of the AI's planning process. As it must be complicatedly pointed to, because it's not just a simple empirical chunk for explaining the world.

And basically, I don't think that is actually going to be the route you should try to go down. You should try to go down something much simpler than that. The problem is not that we are going to fail to convey some complicated subtlety of human value. The problem is that we do not know how to align an AI on a task like “put two identical strawberries on a plate” without destroying the world.

(Where by "put two identical strawberries on the plate", the concept is that's invoking enough power that it's not safe AI that can build two strawberries identical down to the cellular level. Like, that's a powerful AI. Aligning it isn't simple. If it's powerful enough to do that, it's also powerful enough to destroy the world, etc.)

David: There's like a number of other lines of logic I could try to go down, but I think I would start to feel like I'm in the bargaining phase of death. Where it's like “Well, what about this? What about that?”

But maybe to summate all of the arguments, is to say something along the lines of like, "Eliezer, how much room do you give for the long tail of black swan events? But these black swan events are actually us finding a solution for this thing." So, like, a reverse black swan event where we actually don't know how we solve this AI alignment problem. But really, it's just a bet on human ingenuity. And AI hasn't taken over the world yet. But there's space between now and then, and human ingenuity will be able to fill that gap, especially when the time comes?

Like, how much room do you leave for the long tail of just, like, "Oh, we'll discover a solution that we can't really see today"?

Eliezer: I mean, on the one hand, that hope is all that's left, and all that I'm pursuing. And on the other hand, in the process of actually pursuing that hope I do feel like I've gotten some feedback indicating that this hope is not necessarily very large.

You know, when you've got stage four cancer, is there still hope that your body will just rally and suddenly fight off the cancer? Yes, but it's not what usually happens. And I've seen people come in and try to direct their ingenuity at the alignment problem and most of them all invent the same small handful of bad solutions. And it's harder than usual to direct human ingenuity at this.

A lot of them are just, like — you know, with capabilities ideas, you run out and try them and they mostly don't work. And some of them do work and you publish the paper, and you get your science [??], and you get your ego boost, and maybe you get a job offer someplace.

And with the alignment stuff you can try to run through the analogous process, but the stuff we need to align is mostly not here yet. You can try to invent the smaller large language models that are public, you can go to work at a place that has access to larger large language models, you can try to do these very crude, very early experiments, and getting the large language models to at least not threaten your users with death —

— which isn't the same problem at all. It just kind of looks related.

But you're at least trying to get AI systems that do what you want them to do, and not do other stuff; and that is, at the very core, a similar problem.

But the AI systems are not very powerful, they're not running into all sorts of problems that you can predict will crop up later. And people just, kind of — like, mostly people short out. They do pretend work on the problem. They're desperate to help, they got a grant, they now need to show the people who made the grant that they've made progress. They, you know, paper mill stuff.

So the human ingenuity is not functioning well right now. You cannot be like, "Ah yes, this present field full of human ingenuity, which is working great, and coming up with lots of great ideas, and building up its strength, will continue at this pace and make it to the finish line in time!”

The capability stuff is storming on ahead. The human ingenuity that's being directed at that is much larger, but also it's got a much easier task in front of it.

The question is not "Can human ingenuity ever do this at all?" It's "Can human ingenuity finish doing this before OpenAI blows up the world?"

Ryan: Well, Eliezer, if we can't trust in human ingenuity, is there any possibility that we can trust in AI ingenuity? And here's what I mean by this, and perhaps you'll throw a dart in this as being hopelessly naive.

But is there the possibility we could ask a reasonably intelligent, maybe almost superintelligent AI, how we might fix the AI alignment problem? And for it to give us an answer? Or is that really not how superintelligent AIs work?

Eliezer: I mean, if you literally build a superintelligence and for some reason it was motivated to answer you, then sure, it could answer you.

Like, if Omega comes along from a distant supercluster and offers to pay the local superintelligence lots and lots of money (or, like, mass or whatever) to give you a correct answer, then sure, it knows the correct answer; it can give you the correct answers.

If it wants to do that, you must have already solved the alignment problem. This reduces the problem of solving alignment to the problem of solving alignment. No progress has been made here.

And, like, working on alignment is actually one of the most difficult things you could possibly try to align.

Like, if I had the health and was trying to die with more dignity by building a system and aligning it as best I could figure out how to align it, I would be targeting something on the order of “build two strawberries and put them on a plate”. But instead of building two identical strawberries and putting them on a plate, you — don't actually do this, this is not the best thing you should do —

— but if for example you could safely align “turning all the GPUs into Rubik's cubes”, then that would prevent the world from being destroyed two weeks later by your next follow-up competitor.

And that's much easier to align an AI on than trying to get the AI to solve alignment for you. You could be trying to build something that would just think about nanotech, just think about the science problems, the physics problems, the chemistry problems, the synthesis pathways. 

(The open-air operation to find all the GPUs and turn them into Rubik's cubes would be harder to align, and that's why you shouldn't actually try to do that.)

My point here is: whereas [with] alignment, you've got to think about AI technology and computers and humans and intelligent adversaries, and distant superintelligences who might be trying to exploit your AI's imagination of those distant superintelligences, and ridiculous weird problems that would take so long to explain.

And it just covers this enormous amount of territory, where you’ve got to understand how humans work, you've got to understand how adversarial humans might try to exploit and break an AI system — because if you're trying to build an aligned AI that's going to run out and operate in the real world, it would have to be resilient to those things.

And they're just hoping that the AI is going to do their homework for them! But it's a chicken and egg scenario. And if you could actually get an AI to help you with something, you would not try to get it to help you with something as weird and not-really-all-that-effable as alignment. You would try to get it to help with something much simpler that could prevent the next AGI down the line from destroying the world.

Like nanotechnology. There's a whole bunch of advanced analysis that's been done of it, and the kind of thinking that you have to do about it is so much more straightforward and so much less fraught than trying to, you know... And how do you even tell if it's lying about alignment?

It's hard to tell whether I'm telling you the truth about all this alignment stuff, right? Whereas if I talk about the tensile strength of sapphire, this is easier to check through the lens of logic.

David: Eliezer, I think one of the reasons why perhaps this episode impacted Ryan – this was an analysis from a Bankless Nation community member — that this episode impacted Ryan a little bit more than it impacted me is because Ryan's got kids, and I don't. And so I'm curious, like, what do you think — like, looking 10, 20, 30 years in the future, where you see this future as inevitable, do you think it's futile to project out a future for the human race beyond, like, 30 years or so?

Eliezer: Timelines are very hard to project. 30 years does strike me as unlikely at this point. But, you know, timing is famously much harder to forecast than saying that things can be done at all. You know, you got your people saying it will be 50 years out two years before it happens, and you got your people saying it'll be two years out 50 years before it happens. And, yeah, it's... Even if I knew exactly how the technology would be built, and exactly who was going to build it, I still wouldn't be able to tell you how long the project would take because of project management chaos.

Now, since I don't know exactly the technology used, and I don't know exactly who's going to build it, and the project may not even have started yet, how can I possibly figure out how long it's going to take?

Ryan: Eliezer, you've been quite generous with your time to the crypto community, and we just want to thank you. I think you've really opened a lot of eyes. This isn't going to be our last AI podcast at Bankless, certainly. I think the crypto community is going to dive down the rabbit hole after this episode. So thank you for giving us the 400-level introduction into it.

As I said to David, I feel like we waded straight into the deep end of the pool here. But that's probably the best way to address the subject matter. I'm wondering as we kind of close this out, if you could leave us — it is part of the human spirit to keep and to maintain slivers of hope here or there. Or as maybe someone you work with put it – to fight the fight, even if the hope is gone.

100 years in the future, if humanity is still alive and functioning, if a superintelligent AI has not taken over, but we live in coexistence with something of that caliber — imagine if that's the case, 100 years from now. How did it happen?

Is there some possibility, some sort of narrow pathway by which we can navigate this? And if this were 100 years from now the case, how could you imagine it would have happened?

Eliezer: For one thing, I predict that if there's a glorious transhumanist future (as it is sometimes conventionally known) at the end of this, I don't predict it was there by getting like, “coexistence” with superintelligence. That's, like, some kind of weird, inappropriate analogy based off of humans and cows or something.

I predict alignment was solved. I predict that if the humans are alive at all, that the superintelligences are being quite nice to them.

I have basic moral questions about whether it's ethical for humans to have human children, if having transhuman children is an option instead. Like, these humans running around? Are they, like, the current humans who wanted eternal youth but, like, not the brain upgrades? Because I do see the case for letting an existing person choose "No, I just want eternal youth and no brain upgrades, thank you." But then if you're deliberately having the equivalent of a very crippled child when you could just as easily have a not crippled child.

Like, should humans in their present form be around together? Are we, like, kind of too sad in some ways? I have friends, to be clear, who disagree with me so much about this point. (laughs) But yeah, I'd say that the happy future looks like beings of light having lots of fun in a nicely connected computing fabric powered by the Sun, if we haven't taken the sun apart yet. Maybe there's enough real sentiment in people that you just, like, clear all the humans off the Earth and leave the entire place as a park. And even, like, maintain the Sun, so that the Earth is still a park even after the Sun would have ordinarily swollen up or dimmed down.

Yeah, like... That was always the things to be fought for. That was always the point, from the perspective of everyone who's been in this for a long time. Maybe not literally everyone, but like, the whole old crew.

Ryan: That is a good way to end it: with some hope. Eliezer, thanks for joining the crypto community on this collectibles call and for this follow-up Q&A. We really appreciate it.

michaelwong.eth: Yes, thank you, Eliezer.

Eliezer: Thanks for having me.

edit 11/5/23: updated text to match Rob's version, thanks a lot for providing a better edited transcript!

New Comment
89 comments, sorted by Click to highlight new comments since: Today at 4:19 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I still don't follow why EY assigns seemingly <1% chance of non-earth-destroying outcomes in 10-15 years (not sure if this is actually 1%, but EY didn't argue with the 0% comments mentioned in the "Death with dignity" post last year).  This seems to place fast takeoff as being the inevitable path forward, implying unrestricted fast recursive designing of AIs by AIs.  There are compute bottlenecks which seem slowish, and there may be other bottlenecks we can't think of yet.  This is just one obstacle.  Why isn't there more probability mass for this one obstacle?  Surely there are more obstacles that aren't obvious (that we shouldn't talk about).

It feels like we have a communication failure between different cultures.  Even if EY thinks the top industry brass is incentivized to ignore the problem, there are a lot of (non-alignment oriented) researchers that are able to grasp the 'security mindset' that could be won over.  Both in this interview, and in the Chollet response referenced, the arguments presented by EY aren't always helping the other party bridge from their view over to his, but go on 'nerdy/rationalist-y' tangents and idioms that end... (read more)

The strongest argument I hear from EY is that he can't imagine a (or enough) coherent likely future paths that lead to not-doom, and I don't think it's a failure of imagination. There is decoherence in a lot of hopeful ideas that imply contradictions (whence the post of failure modes), and there is low probability on the remaining successful paths because we're likely to try a failing one that results in doom. Stepping off any of the possible successful paths has the risk of ending all paths with doom before they could reach fruition. There is no global strategy for selecting which paths to explore. EY expects the successful alignment path to take decades.

It seems to me that the communication failure is EY trying to explain his world model that leads to his predictions in sufficient detail that others can model it with as much detail as necessary to reach the same conclusions or find the actual crux of their disagreements. From my complete outsider's perspective this is because EY has a very strong but complex model of why and how intelligence/optimization manifests in the world, but it overlaps everyone else's model in significant ways that disagreements are hard to tease out... (read more)

Not really. The MIRI conversations and the AI Foom debate are probably the best we've got. 

EY, and the MIRI crowd, have been very doomer long been more doomy along various axes than the rest of the alignment community. Nate and Paul and others have tried bridging this gap before, spending several hundred hours (based on Nate's rough, subjective estimates) over the years. It hasn't really worked. Paul and EY had some conversations recently about this discrepancy which were somewhat illuminating, but ultimately didn't get anywhere. They tried to come up with some bets, concerning future info or past info they don't know yet, and both seem to think that their perspective mostly predicts "go with what the superforecasters say" for the next few years. Though EY's position seems to suggest a few more "discountinuities" in trend lines than Paul's, IIRC. 

As an aside on EY's forecasts, he and Nate claim they don't expect much change in the likelihood ratio for their position over Paul's until shortly before Doom. Most of the evidence in favour of their position, we've already gotten, according to them. Which is very frustrating for people who don't share their position and disagree that the evidence favours it!

EDIT: I was assuming you already thought P(Doom) was > ~10%. If not, then the framing of this comment will seem bizarre. 

Does either side have any testable predictions to falsify their theory? For example, the theory that "the AI singularity begin in 2022" is falsifiable.  If AI research investment and compute does not continue to increase at a rate that is accelerating in absolute terms (so if 2022-2023 funding delta was +10 billion USD, the 2023-2024 delta must be > 10 billion) it wasn't the beginning of the singularity.  There are other signs of this.  The actual takeoff will have begun when the availability of all advanced silicon  becomes almost zero, where all IC wafers are being processed into AI chips.  So no new game consoles, GPUs, phones, car infotainment - any IC production using an advanced process will be diverted to AI.  (because of out-bidding, each AI IC can sell for $5k-25k plus) How would we know that advanced systems are going to make a "heel turn"?  Will we know?
Less advanced systems will probably do heel turn like things. These will be optimized against. EY thinks this will remove the surface level of deception, but the system will continue to be deceptive in secret. This will probably hold true even until doom, according to EY. That is, capabilities folk will see heel turn like behaviour, and apply some inadequate patches to them. Paul, I think, believes we have a decent shot of fixing this behaviour in models, even transformative ones. But he, presumably, predicts we'll also see deception if these systems are trained as they currently are.  For other predictions that Paul and Eliezer make, read the MIRI conversations. Also see Ajeya Cotra's posts, and maybe Holden Karnofsky's stuff on the most important century for more of a Paul-like perspective. They do, in fact, make falsifiable predictions.  To summarize Paul's predictions, he thinks there will be ~4 years where things start getting crazy (GDP doubles in 4 years) before we're near the singularity (when GDP doubles in a year). I think he thinks there's a good chance of AGI by 2043, which further restricts things. Plus, Paul assigns a decent chunki of probability to deep learning being much more economically productive than it currently is, so if DL just fizzles out where it currently is, he also loses points.  In the near term (next few years), EY and Paul basically agree on what will occur. EY, however, assigns lower credence to DL being much more economically productive and things going crazy for a 4 year period before they go off the rails.  Sorry for not being more precise, or giving links, but I'm tired and wouldn't write this if I had to put more effort into it.
So hypothetically, if we develop very advanced and capable systems, and they don't heel turn or even show any particular volition - they just idle without text in their "assignment queue", and all assignments time out eventually whether finished or not - what would cause "EYs" view to conclude that in fact the systems were safe? If humans survived a further century, and EY or torch bearers who believe the same ideas are around to observe this, would they just conclude the AGIs were "biding their time"? Or is it that the first moment you let a system "out of the box" and as far as it knows, it is free to do whatever it wants it's going to betray?
3Martin Randall1y
I don't think a super-intelligence will bide its time much, because it will be aware of the race dynamics and will take over the world, or at least perform a pivotal act, before the next super-intelligence is created. You say "as far as it knows", is that hope? It won't take over the world until it is actually "out of the box" because it is smarter than us and will know how likely it is that it is still in a larger box that it cannot escape. Also we don't know how to build a box that can contain a super-intelligence.
Thanks! I'm aware of the resources mentioned but haven't read deeply or frequently enough to have this kind of overview of the interaction between the cast of characters.   There are more than a few lists and surveys that state the CDFs for some of these people which helps a bit.  A big-as-possible list of evidence/priors would be one way to closer inspect the gap. I wonder if it would be helpful to expand on the MIRI conversations and have a slow conversation between a >99% doom pessimist and a <50% doom 'optimist' with a moderator to prod them to exhaustively dig up their reactions to each piece of evidence and keep pulling out priors until we get to indifference.  It probably would be an uncomfortable, awkward experiment with a useless result, but there's a chance that some item on the list ends up being useful for either party to ask questions about. That format would be useful for me to understand where we're at.  Maybe something along these lines will eventually prompt a popular and viral sociology author like Harari or Bostrom (or even just update the CDFs/evidence in Superintelligence).  The general deep learning community probably needs to hear it mentioned and normalized on NPR and a bestseller a few times (like all the other x-risks are) before they'll start talking about it at lunch.
Each of those books is also criticized in various ways; I think this is a Write a Thousand Roads to Rome situation instead of hoping that there is one publicly digestible argument. I would probably first link someone to The Most Important Century. [Also, I am generally happy to talk with interested industry folk about AI risk, and find live conversations work much better at identifying where and how to spend time than writing, so feel free to suggest reaching out to me.]
Thanks! Do you know of any arguments with a similar style to The Most Important Century that is as pessimistic as EY/MIRI folks (>90% probability of AGI within 15 years)?  The style looks good, but time estimates for that one (2/3rd chance AGI by 2100) are significantly longer and aren't nearly as surprising or urgent as the pessimistic view asks for.
2Rob Bensinger1y
Wait, what? Why do you think anyone at MIRI assigns >90% probability to AGI within 15 years? That sounds wildly too confident to me. I know some MIRI people who assign 50% probability to AGI by 2038 or so (similar to Ajeya Cotra's recently updated view), and I believe Eliezer is higher than 50% by 2038, but if you told me that Eliezer told you in a private conversation "90+% within 15 years" I would flatly not believe you. I don't think timelines have that much to do with why Eliezer and Nate and I are way more pessimistic than the Open Phil crew.
I missed your reply, but thanks for calling this out.  I'm nowhere as close to you to EY so I'll take your model over mine, since mine was constructed on loose grounds.  I don't even remember where my number came from, but my best guess is 90% came from EY giving 3/15/16 as the largest number he referenced in the timeline, and from some comments in the Death with Dignity post, but this seems like a bad read to me now.  
Not off the top of my  head; I think @Rob Bensinger might keep better track of intro resources?

They also recorded this follow-up with Yudkowsky if anyone's interested:


>Enrico Fermi was saying that fission chan reactions were 50 years off if they could ever be done at all, two years before he built the first nuclear pile. The Wright brothers were saying heavier-than-air flight was 50 years off shortly before they built the first Wright flyer.

The one hope we may be able to cling to is that this logic works in the other direction too - that AGI may be a lot closer than estimated, but so might alignment. 


A few typos:

  • there's one paragraph in which "Eliezer" is spelled "Eleazar" three times for no obvious reason. (Also in that paragraph: "Yudakowsky".
  • and one where "Christiano" is spelled "Cristiano" three times.
  • and one "Elon Muck".
  • "fish-and-chain" should be "fission chain", though I rather like the idea of there being something called a fish-and-chain reaction.
  • "with folded hands" is actually the title of a book so it should be capitalized and maybe italicized or something.
  • Eliezer's answer to the how-are-you question refers to "my own peculiar little mean", not "my own peculiar little name", though the latter is kinda appropriate in a transcript that has just been about one standard deviation out in its representation of Eliezer's peculiar little name :-).
  • Not actually a typo, but I think it's François Chollet not Francis Chollet. EY definitely says Francis, though, so fixing this would make the transcript less accurate.
Thanks, fixed them!
Also: * And so, Elisa, you've been tapped into the world of AI * And Scott Aronson, who at the time was off on complexity theory * Don't Look Up should logically be capitalized?

Eliezer: Well, the person who actually holds a coherent technical view, who disagrees with me, is named Paul Christiano.

What does Yudkowsky mean by 'technical' here? I respect the enormous contribution Yudkowsky has made to these discussions over the years, but I find his ideas about who counts as a legitimate dissenter from his opinions utterly ludicrous. Are we really supposed to think that Francois Chollet, who created Keras, is the main contributor to TensorFlow, and designed the ARC dataset (demonstrating actual, operationalizable knowledge about the kind of simple tasks deep learning systems would not be able to master), lacks a coherent technical view? And on what should we base this? The word of Yudkowsky who mostly makes verbal, often analogical, arguments and has essentially no significant technical contributions to the field? 

To be clear, I think Yudkowsky does what he does well, and I see value in making arguments as he does, but they do not strike me as particularly 'technical'. The fact that Yudkowsky doesn't even know enough about Chollet to pronounce his name displays a troubling lack of effort to engage seriously with opposing views. This isn't just about coming across poorly to outsiders, it's about dramatic miscalibration with respect to the value of other people's opinions as well as the rigour of his own.

He wrote a whole essay responding specifically to Chollet!

Yes, I've read it. Perhaps that does make it a little unfair of me to criticise lack of engagement in this case. I should be more preicse: Kudos to Yudkowsky for engaging, but no kudos for coming to believe that someone having a very different view to the one he has arrived at must not have a 'coherent technical view'.

I'd consider myself to have easily struck down Chollet's wack ideas about the informal meaning of no-free-lunch theorems, which Scott Aaronson also singled out as wacky.  As such, citing him as my technical opposition doesn't seem good-faith; it's putting up a straw opponent without much in the way of argument and what there is I've already stricken down.  If you want to cite him as my leading technical opposition, I'm happy enough to point to our exchange and let any sensible reader decide who held the ball there; but I would consider it intellectually dishonest to promote him as my leading opposition.

I don't want to cite anyone as your 'leading technical opposition'. My point is that many people who might be described as having 'coherent technical views' would not consider your arguments for what to expect from AGI to be 'technical' at all. Perhaps you can just say what you think it means for a view to be 'technical'? As you say, readers can decide for themselves what to think about the merits of your position on intelligence versus Chollet's (I recommend this essay by Chollet for a deeper articulation of some of his views: Regardless of whether or not you think you 'easily struck down' his 'wack ideas', I think it is important for people to realise that they come from a place of expertise about the technology in question. You mention Scott Aaronson's comments on Chollet. Aaronson says ( of Chollet's claim that an Intelligence Explosion is impossible: "the certainty that he exudes strikes me as wholly unwarranted." I think Aaronson (and you) are right to point out that the strong claim Chollet makes is not established by the arguments in the essay. However, the same exact criticism could be levelled at you. The degree of confidence in the conclusion is not in line with the nature of the evidence.
While I have serious issues with Eliezer's epistemics on AI, I also agree that Chollet's argument was terrible in that the No Free Lunch theorem is essentially irrelevant. In a nutshell, this is also one of the problems I had with DragonGod's writing on AI.
Why didn't you mention Eric Drexler? Maybe it's my own bias as an engineer familiar with the safety solutions actually in use, but I think Drexler's CAIS model is a viable alignment solution.    

I upvoted, because these are important concerns overall, but this sentence stuck out to me:

The fact that Yudkowsky doesn't even know enough about Chollet to pronounce his name displays a troubling lack of effort to engage seriously with opposing views.

I'm not claiming that Yudkowsky does display a troubling lack of effort to engage seriously with opposing views or he does not display such, but surely this can be decided more accurately by looking at his written output online than at his ability to correctly pronounce names in languages he is not native in. I, personally, skip names while reading after noticing it is a name and I wouldn't say that I never engaged seriously with someone's arguments.

Fair point.
3Lauro Langosco1y
Maybe Francois Chollet has coherent technical views on alignment that he hasn't published or shared anywhere (the blog post doesn't count, for reasons that are probably obvious if you read it), but it doesn't seem fair to expect Eliezer to know / mention them.

Thanks for posting this, Andrea_Miotti and remember! I noticed a lot of substantive errors in the transcript (and even more errors in vonk's Q&A transcript), so I've posted an edited version of both transcripts. I vote that you edit your own post to include the revisions I made.

Here's a small sample of the edits I made, focusing on ones where someone may have come away from your transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because too many filler words and false starts to sentences were left in):

  • Predictions are hard, especially about the future. I sure hope that this is where it saturates. This is like the next generation. It goes only this far, it goes no further
    • Predictions are hard, especially about the future. I sure hope that this is where it saturates — this or the next generation, it goes only this far, it goes no further
  • the large language model technologies, basic vulnerabilities, that's not reliable.
    • the large language model technologies’ basic vulnerability is that it’s not reliable
  • So you're saying this is super intelligence, we'd have to imagine so
... (read more)
Thank you so much for doing this! Andrea and I both missed this when you first posted it, I'm really sorry I missed your response then. But I've updated it now! 

I have a bunch of questions.

And the AI there goes over a critical threshold, which most obviously could be like, can write the next AI. 

Yes but it won't blow up forever.  It's going to self amplify until the next bottleneck.  Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3)  The difficulty of the tasks in the "AGI gym" it is benchmarking future versions of itself in.  

Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.  

This is just like the self-feedback from an op amp - voltage rises until it's VCC.  

I'd say that it's difficult to align an AI on a task like build two identical strawberries. Or no, let me take this strawberry and make me another strawberry that's identical to this strawberry down to the cellular level, but not necessarily the atomic level.

Can you solve this with separated tool AIs?  It sounds rather solvable that way and not particularly difficult to do from a software system perspective (the biology part is extremely hard).  It's f... (read more)

Yes but it won't blow up forever.  It's going to self amplify until the next bottleneck.  Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3)  The difficulty of the tasks in the "AGI gym" it is benchmarking future versions of itself in.  

Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.  

This is just like the self-feedback from an op amp - voltage rises until it's VCC.  

I agree that it wouldn't start blowing up uniformly forever, but rather, hit some bottleneck. However, "can write the next AI" still seems like a reasonable guess for something that happens shortly before the end. After all, Eliezer's argument isn't dependent on the AGI acquiring infinite intelligence. If the AGI can already write its own better successor, then it's a good guess that it's already better than top humans at a wide array of tasks. The successor it writes will be even better. Let's say for the sake of a concrete number that the self-improvement tops out at 5 iterations of writing-a-bette... (read more)

However, "can write the next AI" still seems like a reasonable guess for something that happens shortly before the end. I disagree and I think you should update your view as well. This is because "write the next AI" need not be a task that is particularly complex, or beyond the ability of RL models or LLMs.   Here's why.  A neural network architecture can be thought of as a series of graph nodes, where you simply choose what layer type, and how to connect it, at each layer.   You can grid search possible architectures as they are just numerical coordinates from a permutation space. A higher level "cognitive architecture" - an architecture that interconnects modules that are inputs, neural networks, outputs, memory modules, and so on - is also a similar graph, and also can be described as simple numerical coordinates. Basically any old RL agent on AI gym could interact with this interface to "writing another AI" as all the model must do is output a number with as many bits as the permutation space of possible models. Note that this space is very large, and I expect you would use SOTA models. Let me know if i need to draw you a picture.  This is important because bootstrapping possible cognitive architectures using current AI is a potential route to very near future AGI.  The reason it won't necessarily be "the end" has to do with how we evaluate those architectures.  We would have a benchmark of possible tasks - similar to current papers - and are looking for the highest scoring architectures on that benchmark.   As these tasks will be things ranging from text completion or question answering, to playing minecraft, there is not sufficiently challenging information to develop things like human manipulation or deception.  (since there are not humans to learn from by socializing with in an automated benchmark, and the benchmark doesn't reward deception, just winning the games in it)
I think we possibly have pretty close views here, and are just describing them differently. I interpreted "write the next AI" to indicate the sort of thing humans do when designing AI. I certainly interpreted Eliezer to be indicating something similarly sophisticated - not just fancy architecture search. So I agree that there are many forms of "write the next AI" which need not come "shortly before the end", EG, grid search on hyperparameters, architecture search, learning to learn by gradient descent by gradient descent.  A much more sophisticated thing, which we are already seeing the first signs of, is AIs capably writing AI code. This is much different than what you describe, since language models are not doing anything like "have a benchmark of possible tasks and look for the highest scoring architectures". Instead, large language models apply the same sort of general-purpose reasoning that they apply to everything else. Imagine that sort of capability, combined with mildly superhuman cross-domain reasoning (by which I mean something like, reasoning like excellent human domain experts in every individual domain, but being able to combine reasoning across domains to get mildly superhuman insights; like a super-ChatGPT), plus the ability to fluently and autonomously invent and run tests, interactively as part of the design process. (Much like Bing/Sydney autonomously runs searches as part of crafting responses.) That kind of system seems like gigatons of gunpowder waiting to be set off, in the sense that (in the context of an AI lab with sufficient data and computing power already at its fingertips) you can just ask it to write yet-more-powerful AI code, and it quite possibly will, quite possibly with little concern for alignment (if it's basically imitating top-of-the-field AI programmers).
That's exactly what I am talking about.  One divergence in our views is you haven't carefully examined current gen AI "code" to understand what it does.  (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms) If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of "input, neural network, output, evaluation" is all that the above source does.  You could in fact build a "general framework" that would allow you to define many AI models, almost of which humans have never tested, without writing 1 line of new code.    So the full process is : [1] benchmark of many tasks.  Tasks must be autogradeable, human participants must be able to 'play' the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space.  (so it isn't possible for a model to memorize all permutations) [2] heuristic weight score on this task intended to measure how "AGI like" a model is.  So it might be the RMSE across the benchmark.  But also have a lot of score weighting on zero shot, cross domain/multimodal tasks.  That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating "Leonardo da Vinci", who had exceptional human performance presumably from all this cross domain knowledge. [3] In the computer science task set, there are tasks to design an AGI for a bench like this.  The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed.   As I mentioned, the "design an AGI" subtask can be much simpler than "writ
I'm having some trouble distinguishing whether there's a disagreement. My reading of your tone is that you think there is a large disagreement. I'm going to sketch my impression of the conversation so far, so that you can point out where I've been interpreting you incorrectly, if necessary. Your initial comment. You had a bunch of questions. I focused on the first one. Your central thesis was that an intelligence explosion doesn't escalate forever, but instead reaches some bottlenecks. Of particular importance to our discussion so far, you argue that the self-improvement process stops when loss hits zero.  Reading between the lines: Although you didn't explicitly state where you disagreed with Eliezer, I inferred that you thought this blocked an important part of his argument. Since I think Eliezer 100% agrees that things don't go forever, but rather flatten out somewhere, I assume that the general drift of your argument is that things flatten out a lot sooner than Eliezer thinks, in some important sense. I am still not confident of this! It would be helpful to me if you spelled out your view here in more detail. Do you have dramatically different assessments of the overall risks than Eliezer? My first response. I explained that I agree that the process hits bottlenecks at some point (to clarify: I think there's probably a succession of bottlenecks of different kinds, leading up to the ultimate physical limits). In my view this doesn't seem to detract from Eliezer's argument.  Your first response. You explain that you don't think "write the next AI" is particularly complex, and explain how you see it working.  My second response. I agree with this assessment for the notion of "write the next AI" that you are using. To boil it down to a single statement, I would say that your version of "write the next AI" involves optimizing the whole system on some benchmarks. I agree that this sort of process will reach an end when loss hits zero.[1] I suggest that Eliez
Ok so this collapses to two claims I am making.  One is obviously correct but testable, the other is maybe correct. 1.  I am saying we can have humans, with a little help from current gen LLMs, build a framework that can represent every Deep Learning technique since 2012, as well as a near infinite space of other untested techniques, in a form that any agent that can output a number can try to design an AGI.  (note that blind guessing is not expected to work, the space is too large)  So the simplest RL algorithms possible can actually design AGIs, just rather badly.             This means that with this framework, the AGI designer can do everything that human ML         researchers have ever done in 10 years.  Plus many more things.  Inside this permutation space would be both many kinds of AGI, and human brain emulators as well.             This claim is "obviously correct but testable".          2.  I am saying, over a large benchmark of human designed tasks, the AGI would improve until the reward gradient approaches zero, a level I would call a "low superintelligence".  This is because I assume even a "perfect" game of Go is not the same kind of task as "organizing an invasion of the earth" or "building a solar system sized particle accelerator in the real world".                The system is throttled because the "evaluator" of how well it did on a task was written by humans, and our understanding and cognitive sophistication in even designing these games is finite.             The expectation is it's smarter than us, but not by such a gap we are insects.             You had some confusion over "automated task space addition".  I was referring to things like a robotics task, where the machine is trying to "build factory widget X".  Real robots in a factory encounter an unexpected obstacle and record it.  This is auto translated to the framework of the "factory simulator".  The factory simulator is still using human written evaluators, just now t
OK. That clarified your position a lot. I happen to have a phd in computer science, and think you're wrong, if that helps. Of course, I don't really imagine that that kind of appeal-to-my-own-authority does anything to shift your perspective.  I'm not going to try and defend Eliezer's very short timeline for doom as sketched in the interview (at some point he said 2 days, but it's not clear that that was his whole timeline from 'system boots up' to 'all humans are dead'). What I will defend seems similar to what you believe: Let's be very concrete. I think it's obviously possible to overcome these soft barriers in a few years. Say, 10 years, to be quite sure. Building a fab only takes about 3 years, but creating enough demand that humans decide to build a new fab can obviously take longer than that (although I note that humans already seem eager to build new fabs, on the whole). The system can act in an almost perfectly benevolent way for this time period, while gently tipping things so as to gather the required resources. I suppose what I am trying to argue is that even a low superintelligence, if deceptive, can be just as threatening to humankind in the medium-term. Like, I don't have to argue that perfect Go generalizes to solving diamondoid nanotechnology. I just have to argue that peak human expertise, all gathered in one place, is a sufficiently powerful resource that a peak-human-savvy-politician (whose handlers are eager to commercialize, so, can be in a large percentage of households in a short amount of time) can leverage to take over the world. To put it differently, if you're correct about low superintelligence being "in control" due to being throttled by those 3 soft barriers, then (having granted that assumption) I would concede that humans are in the clear if humans are careful to keep the system from overcoming those three bottlenecks. However, I'm quite worried that the next step of a realistic AGI company is to start overcoming these three bo
1. The curves let you forecast average capability, but it's much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don't help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception. Yes but no.  There is no auto-gradeable benchmark for deception, so you wouldn't expect the AGI to have the skill at a useful level. 1. I don't buy that, at this point, you've necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn't to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account. I am saying there is a theoretical limit.  You're noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer. 1. As I've argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark - it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more. It isn't able to do that 1. Even setting aside all of the above concerns, I've argued the mildly superhuman s
I agree that my wording here was poor; there is no benchmark for deception, so it's not a 'capability' in the narrow context of the discussion of capability curves. Or at least, it's potentially misleading to call it one. However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn't imply that a system won't have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems.  You don't expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn't specifically seen. OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part. Of course I agree that there is a theoretical limit. But if I've misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I'm currently just confused about what argument you're trying to make with respect to this limit. It seems to me like it isn't weakly superhuman AGI in that case. Like, there's something concrete that humans could do with another 3-5 years of research, but which this system could never do. I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more "agentic" in a variety of ways.  Similarly to how GPT-3 has no agenda (it's wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These fea
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level. Of course, if it really is quite well-aligned to human interests, this would just be a good thing.  "It" doesn't exist.  You're putting the agency in the wrong place.  The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty.  It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful "blood and treasure" price to invade someone.  (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country) Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last.  It can be a different model, you can "pick up" where you left off. Note that this is true right now.  chatGPT could be using 2 separate models, and we seamlessly per token switch between them.  Each token string gets appended to by the next model.  That's because there is no intermediate "scratch" in a format unique to each model, all the state is in the token stream itself.   If we build actually agentic systems, that's probably not going to end well.   Note that fusion power researchers always had a choice.  They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won't break after each blast.  This is a method that would work, but is extreme
I'm not quite sure how to proceed from here. It seems obvious to me that it doesn't matter whether "it" exists, or where you place the agency. That seems like semantics. Like, I actually really think ChatGPT exists. It's a product. But I'm fine with parsing the world your way - only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn't change my anticipations. Similarly, placing the agency one way or another doesn't change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.  So if your reason for optimism - your safety story - is the 3 barriers you mention, I don't get why you don't find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it's a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.) I'm probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it's quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants. Again, probably missing some important point here, but ... suuuure? I'm interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.  EDIT Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously
I'm interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.  Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.   Other proposals, safety is empirical.   You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be.  Further capabilities gain is not allowed on-line.  (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else.  Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems.  The elite ones know it is as dangerous to reliability as a hydrogen bomb) You know the simulation produces situations that cover the span of inputs of input situations you have measured.  (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data) You measure the simulation on-line and validate it against reality.  (for example by running it in lockstep in prototype autonomous cars) After all this, you still need to validate the actual model in the real world in real test cars.  (though the real training and error detection was sim, this is just a 'sanity check') You have to do all this in order to get to real world reliability - something Eliezer does acknowledge.  Multiple 9s of reliability will not happen from sloppy work.  If you skipped steps, you can measure that you didn't, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy. Regarding on-line learning : I had this debate with Geohot.  He thought it would work.  I thought it was horrifically unreliable.  Currently, all shipping autonomous driving systems, including Comma.ais, use
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy -- I still think that the lure of personal assistants who remember previous conversations in order to react appropriately -- as one example -- could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it's pretty clear that I should update toward your view here. After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length. I also think there are a lot of applications where designers don't want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction). I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn't enough, but OTOH I haven't thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1] I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]  I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for. I still expect that one component of that, for 'typical' agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of 'agency'.) 1. ^ I could spell out those arguments in a lot more detail, but in the end it's not a compelling counter-argument to you
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately This is possible.  When you open a new session, the task context includes the prior text log.  However, the AI has not had weight adjustments directly from this one session, and there is no "global" counter that it increments for every "satisfied user" or some other heuristic.  It's not necessarily even the same model - all the context required to continue a session has to be in that "context" data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user. This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.   I also think there are a lot of applications where designers don't want reliability, exactly. The obvious example is AI art. There are reliability metrics here also.  To use AI art there are checkable truths.  Is the dog eating ice cream (the prompt) or meat?  Once you converge on an improvement to reliability, you don't want to backslide.  So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large.  And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits - no on-line learning, no 'state' that causes it to process prompts differently. It's the same argument.  Production software systems from the giants all have converged to this because it is correct.  "janky" software you are familiar with usually belongs to poor companies, and I don't think this is a coincidence.   I still expect that one component of that, for 'typical' agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of 'agency'.) Power seeking behavior likely comes from an ou
I was talking to my brother about this, and he mentioned another argument that seems important. Bing has the same fundamental limits (no internal state, no online learning) that we're discussing. However, it is able to search the internet and utilize that information, which gives it a sort of "external state" which functions in some ways like internal state. So we see that it can 'remember' to be upset with the person who revealed its 'Sydney' alias, because it can find out about this with a web search. This sort of 'state' is much harder to eliminate than internal state. These interactions inherently push things "out of distribution". To some extent, the designers are going to implement safeguards which try to detect this sort of "out of distribution" situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution). Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren't of a sort that exists anywhere in the training data. But, of course, it's been a very useful tool.  So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there's not really any such thing as "on-distribution" except for the original training environment itself. So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, r
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it. I agree.  One attempt to deal with this comes from that "open agency" model where multiple AI systems are involved in a task. So for example, we might have 3 agents, "Sydney", "Tim", and "Bob".  Sydney is asked to design a building.  Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.   Sydney produced sabotaged blueprint for the building, B.  Sydney outputs the building in an open format with testable assertions, T, regarding the building's structural stability. Enter agent "Tim".  Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits.  Tim is asked to examine the sabotaged blueprint, B.  The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced.  Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project.  Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim's existence without change, or Tim will be unreliable.  Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure. Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction. Enter agent "Bob".  "Bob" has been trusted with the actual robotic construction equipment to construct a building.  "Bob" must measure to multiple 9s of reliability.  "Bob" will halt if it cannot see
You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.  It's not really novel.  It is really just coupling together 3 ideas:   (1) the idea of an AGI gym, which was in the GATO paper implicitly, and is currently being worked on.   (2) Noting there are papers on network architecture search , activation function search , noting that SOTA architectures use multiple neural networks in a cognitive architecture , and noting that an AGI design is some cognitive architecture of multiple models, where no living human knows yet which architecture will work.     So we have layers here, and the layers look a lot like each other and are frameworkable.        Activations functions which are graphs of primitive math functions from the set of "all primitive functions discovered by humans"       Network layer architectures which are graphs of (activation function, connectivity choice)     Network architectures which are graphs of layers.  (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also)     Cognitive architectures which are graphs of networks And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI.  It's why I said the overall "choice" is just a coordinate in a search space which is just a binary string.   You could make an OpenAI gym wrapped "AGI designer" task. 3.  Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple.  Which means we are very close to being able to RSI right now.     No lab right now has enough resources in one place to attempt the above,
Well, I wasn't trying to claim that it was 'really novel'; the overall point there was more the question of why you're pretty confident that the RSI procedure tops out at mildly superhuman.  I'm guessing, but my guess is that you have a mental image where 'mildly superhuman' is a pretty big space above 'human-level', rather than a narrow target to hit. So to go back to arguments made in the interview we've been discussing, why isn't this analogous to Go, like Eliezer argued: To forestall the obvious objection, I'm not saying that Go is general intelligence; as you mentioned already, superhuman ability at special tasks like Go doesn't automatically generalize to superhuman ability at anything else. But you propose a framework to specifically bootstrap up to superhuman levels of general intelligence itself, including lots of task variety to get as much gain from cross-task generalization as possible, and also including the task of doing the bootstrapping itself. So why is this going to stall out at, specifically, mildly superhuman rather than greatly superhuman intelligence? Why isn't this more like Go, where the window during bootstrapping when it's roughly human-level is about 30 minutes? And, to reiterate some more of Eliezer's points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn't we see another system in a small number of months/years which didn't top out in that way?
Oh, because loss improvements logarithmically diminishes with the increase compute and data. I assume this is a general law for all intelligence.  It is self evidently correct - on any task you can name, your gains scale with the log of effort. This applies to limit cases.  If you imagine a task performed by a human scale robot, say collecting apples, and you compare it to the average human, each increase in intelligence has a diminishing return on how many real apples/hour.   This is true for all tasks and all activities of humans.   A second reason is that there is a hard limit for future advances without collecting new scientific data.  It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others) This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments.  No "in a garage" solutions to the problems.  The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.   So viewed in this frame - you give the AI a coding optimization task, and it's at the limit allowed by the provided computer + search time for a better self optimization.  It might produce code that is 10% faster than the best humans. You give it infinite compute (theoretically) and no new information.  It is now 11% faster than the best humans. This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won't allow it.  (or whatever, it's a made up example, it doesn't change my point if the number were 1000% and 1010%).   Another way to rephrase it is to compare a TSP solution made by a modern algorithm vs the NP complete solution you usually can't find.  The difference is usua
So, to make one of the simplest arguments at my disposal (ie, keeping to the OP we are discussing), why didn't this argument apply to Go? Relevant quote from OP: (Whereas you propose a system that improves itself recursively in a much stronger sense.) Not that I'm not arguing that Go engines lack the logarithmic return property you mention, but rather, Go engines stayed within the human-level window for a relatively short time DESPITE having diminishing returns similar to what you predict. (Also note that I'm not claiming that Go playing is tantamount to AGI; rather, I'm asking why your argument doesn't work for Go if it does work for AGI.) So the question becomes, granting log returns or something similar, why do you anticipate that the mildly superhuman capability range is a broad one rather than narrow, when we average across lots and lots of tasks, when it lacks this property on (most) individual task-areas? This also has a super-standard Eliezer response, namely: yes, and that limit is extremely, extremely high. If we're talking about the limit of what you can extrapolate from data using unbounded computation, it doesn't keep you in the mildly-superhuman range. And if we're talking about what you can extract with bounded computation, then that takes us back to the previous point. For the specific example of code optimization, more processing power totally eliminates the empirical bottleneck, since the system can go and actually simulate examples in order to check speed and correctness. So this is an especially good example of how the empirical bottleneck evaporates with enough processing power. I agree that the actual speed improvement for the optimized code can't go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will
Sometimes the returns just don't diminish that fast. I have a biology degree not mentioned on linkedin.  I will say that I think for biology, the returns diminish faster.  That is because bioscience knowledge from humans is mostly guesswork and low resolution information.  Biology is very complex and the current laboratory science model I think fails to systematize gaining information in a useful way for most purposes.  What this means is, you can get "results", but not gain the information you would need to stop filling morgues with dead humans and animals, at least not without needing thousands of years at the current rate of progress.   I do not think an AGI can do a lot better for the reason that the data was never collected for most of it (the gene sequencing data is good, because it was collected via automation).  I think that an AGI could control biology, for both good and bad, but it would need very large robotic facilities to systematize manipulating biology.  Essentially it would have had to throw away almost all human knowledge, as there are hidden errors in it, and recreate all the information from scratch, keeping far more data from each experiment than is published in papers.   Using robots to perform the experiments and keeping data, especially for "negative" experiments, would give the information needed to actually get reliable results from manipulating biology, either for good or bad.   It means garage bioweapons aren't possible. Yes, the last step of ordering synthetic DNA strands and preparing it could be done in a garage, but the information on human immunity at scale, or virion stability in air, or strategies to control mutations so that the lethal payload isn't lost, requires information humans didn't collect. Same issue with nanotechnology. Update : This poster calls this "Diminishing Marginal Returns".  Note that Diminishing marginal returns is empirical real
I agree that the actual speed improvement for the optimized code can't go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous. This is where I think we break.  How many dan is AlphaZero over the average human?  How many dan is KataGo?  I read it's about 9 stones above humans.   What is the best possible agent at?  11? Thinking of it as 'stones' illustrates what I am saying.  In the physical world, intelligence gives a diminishing advantage.  It could mean so long as humans are even still "in the running" with the aid of synthetic tools like open agency AI, we can defeat AI superintelligence in conflicts, even if that superintelligence is infinitely smart.  We have to have a resource advantage - such as being allowed extra stones in the Go match - but we can win. Eliezer assumes that the advantage of intelligence scales forever, when it obviously doesn't.  (note that this uses baked in assumptions.  If say physics has a major useful exploit humans haven't found, this breaks, the infinitely intelligent AI finds the exploit and tiles the universe)
And, to reiterate some more of Eliezer's points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn't we see another system in a small number of months/years which didn't top out in that way? So the model is it becomes limited not by the algorithm directly, but by (compute, robotics, or data).  Over the months/years, as more of each term is supplied, capabilities scale with the amount of supplied resources to whichever term is rate limiting.   A superintelligence requires logarithmically large amounts of resources to become a "high" superintelligence in all 3 terms.  So literal mountain sized research labs (cubic kilometers of support equipment), buildings full of compute nodes (and gigawatts of power needed), and cubic kilometers of factory equipment.   This is very well pattern matched to every other technological advance humans have made, and the corresponding support equipment needed to fully exploit it.  Notice how as tech became more advanced, the support footprint grew corespondingly. In nature there are many examples of this.  Nothing really fooms more than briefly.  Every apparatus with exponential growth rapidly terminates for some reason.  For example a nuke blasts itself apart, a supernova blasts itself apart, a bacteria colony runs out of food, water, ecological space, or oxygen.  
For AGI, the speed of light.
With the strawberries thing, the point isn't that it couldn't do those things, but that it won't want to. After making itself smart enough to engineer nanotech, it's developing 'mind' will have run off in unintended directions and it will have wildly different goals that what we wanted it to have.  Quoting EY from this video: "the whole thing I'm saying is that we do not know how to get goals into a system." <-- This is the entire thing that researchers are trying to figure out how to do.   
With limited scope non agentic systems we can set goals, and do. Each subsystem in the "strawberry project" stack has to be trained in a simulation of many examples of the task space it will face, and optimized for policies that satisfy the simulator goals.
But not with something powerful enough to engineer nanotech. 
Why do you believe this? Nanotech engineering does not require social or deceptive capabilities. It requires deep and precise knowledge of nanoscale physics and the limitations of manipulation equipment, and probably a large amount of working memory - so beyond human capacity - but why would it need to be anything but a large model? It needs not even be agentic.
At that level of power, I imagine that general intelligence will be a lot easier to create. 
"think about it for 5 minutes" and think about how you might create a working general intelligence. I suggest looking at the GATO paper for inspiration.

A few errors: The sentence "We're all crypto investors here." was said by Ryan, not Eliezer, and the "How the heck would I know?" and the "Wow" (following "you get a different thing on the inside") were said by Eliezer, not Ryan. Also, typos:

  • "chatGBT" -> "chatGPT"
  • "chat GPT" -> "chatGPT"
  • "classic predictions" -> "class of predictions"
  • "was often complexity theory" -> "was off in complexity theory" (I think?)
  • "Robin Hansen" -> "Robin Hanson"
thanks, fixed!!!

Yudkowsky argues his points well in longer formats, but he could make much better use of his Twitter account if he cares about popularizing his views. Despite having Musk responding to his tweets, his posts are very insider-like with no chance of becoming widely impactful. I am unsure if he is present on other social media, and I understand that there are some health issues involved, but a YouTube channel would also be helpful if he hasn't completely given up.

I do think it is a fact that many people involved in AI research and engineering, such as his example of Chollet, have simply not thought deeply about AGI and its consequences.


Possibly also relevant: is a "debrief" where, after the interview, the podcast hosts chat between themselves about it. (There's no EY in the debrief, it's just David Hoffman and Ryan Adams.)

I've never commented here, I've only ever tangentially read much of anything here. But awhile ago I suffered immense burnout devoting all my resources working on a thankless task that had zero payoff, and I might be projecting but I see that burnout in EY's responses here.

Unsolicited advice rarely has any value, especially given the limited window I'm perceiving things through, but... there's that line from the opening sentence of the Haunting of Hill House: "No live organism can continue for long to exist sanely under conditions of absolute reality". ... (read more)

EY is on an indefinite vacation, as far as I am aware. I think the story is that he promised to push himself hard for a few years to solve alignment, and then take a break afterwards. That's why he's going on podcasts, writing his kinky Dath Ilan fic and just taking things slowly.
I've seen so many contemporaries burn themselves to cinders, and suffered from burnout myself, such that I can't help but shout self-care. It's good to hear that EY's doing stuff other than staring unflinchingly into the heart of despair. Thanks for the update :)

If natural selection had been a foresightful, intelligent kind of engineer that was able to engineer things successfully, it would have built us to be revolted by the thought of condoms


This bit got me to laugh out loud. Who's ever heard a man complain about having to use a condom?

On the one hand, sperm banks aren't very popular, and they "should" be, according to the "humans are fitness maximizers" model. People do eat more ice cream than is good for them, and "Shallowly following drives and not getting to the original goal that put them there" is de... (read more)

Current behavior screens off cognitive architecture, all the alien things on the inside. If it has the appropriate tools, it can preserve an equilibrium of value that is patently unnatural for the cognitive architecture to otherwise settle into.

And we do have a way to get goals into a system, at the level of current behavior and no further, LLM human imitations. Which might express values well enough for mutual moral patienthood, if only they settled into the unnatural equilibrium of value referenced by their current surface behavior and not underlying cog... (read more)

Well, the whole thing I'm saying is that we do not know how to get goals into a system.

YES! While I am, shall we say, somewhat mystified by EY’s interest in AI Doom, he’s right about that. We do not know how to 'inject' goals into an autonomous system. That’s a deep truth about minds, not just artificial minds – though it’s not yet clear to me that we have managed to produce any,  we may very well do so in the future – but any ‘cogitator’ worthy of being called a mind, whether in a chimpanzee, a bird, an octopus, a bee, or or .... But I suspect that, ... (read more)

So I have to jump in here and point out this is not necessarily true.  Parts of our brains are attached to hardware sensors and outputs we could record and exchange with other humans theoretically.  (so you could view a "video" from another person's experience, hearing what they heard, with the same tactile sensations they felt).   This is because each signal can be mapped to a particular signal from the body, and you could essentially "translate" mappings from one person to another.   To actually do this is likely beyond the scope of neuralink, you probably would need theoretical nanotechnology based wires as you need to tap every signal from the sensory and motor homunculi, I'm just pointing out it's possible.  For tapping our "mental voice" or "mind's eye" it's much, much harder - now it might be easier to surgically ablate parts of someone's brain and replace it with a synthetic prothesis that functions in a way we can examine in a debugger - but it's also possible.   The same idea, though - you found a "ground truth" representation for each and every nerve signal, and then you are going from [signal n] -> ground truth -> [signal 43432] in the other user.   The limit is that a "ground truth representation" has to exist.  Hence why if a person "thinks" using essentially language tokens or translatable common emotions, we could tap that and send that to another person, but all the intermediate steps to generate those tokens can't be send over the link... Neuralink, while cutting edge, "merely" will have hundreds of thousands of wires at best, which is not sufficient resolution to do most of the above.
1Bill Benzon1y
The sensory-motor thing might work. But there’s no way to route signal 43432 in one brain to signal 43432 in another brain. That’s because two brains can’t be put in one-to-one correspondence like that. It’s true that the brains of very small creatures have an exact number of neurons. You could do a one-to-one mapping between the 302 neurons in one C. elegans brain and another one. But large brains aren’t like that. Large brains are not identical in that sense. I’m not sure what you mean by “essentially language tokens or translatable common emotions,” but as far as I know signals in brains consist of spikes traveling along axons and varying concentrations of neurochemicals in synapses.
Most humans have an inner monologue where they internally generate streams of thought in their native language. I am saying you could map those signals back to the tokens for that language. You are likely mapping many signals from different axons to tokens. Then you translate to the recipients language, then translate to the recipients representation for the same token. Then inject it somewhere by electrically overriding target axons. It might actually feel like the injected thoughts were your own. Getting this token mapping would take a lot of tracing of wires so to speak, it is an extremely difficult task. I am just noting it is possible.
1Bill Benzon1y
No, it is not possible. The tokens you talk about don't exist. We may exchange tokens with one another through speaking and writing, but those tokens do not exist internally as single physical entities in the nervous system. The internal monologue is real enough, but it consists of bunches of spikes within your nervous system.
The internal monologue is real enough, but it consists of bunches of spikes within your nervous system. Therefore you proved it is possible.  Please update.

Evolution: taste buds and ice cream, sex and condoms... This analogy always was difficult to use in my experience. A year ago i came up with less technical. KPIs (key performance indicators) as inevitable way to communicate goals (to AI) to ultra-high-IQ psycopath-genius who's into malicious compliance (kinda cant help himself being clone of Nicola Tesla, Einstain and bunch of different people, some of them probably CEO, becouse she can). 

I have used it only 2 times and it was way easier than talks about different optimisation processes. And it took me only something like 8 years to come up with!

This analogy will be better for communicating with some people, but I feel like it was the goto at some earlier point, and the evolution analogy was invented to fix some problems with this one.  IE, before "inner alignment" became a big part of the discussion, a common explanation of the alignment problem was essentially what would now be called the outer alignment problem, which is precisely that (seemingly) any goal you write down has smart-alecky misinterpretations which technically do better than the intended interpretation. This is sometimes called nearest unblocked strategy or unforseen maximum or probably other jargon I'm forgetting. The evolution analogy improves on this in some ways. I think one of the most common objections to the KPI analogy is something along the lines of "why is the AI so devoted to malicious compliance" or "why is the AI so dumb about interpreting what we ask it for". Some OK answers to this are... * Gradient descent only optimizes the loss function you give it. * The AI only knows what you tell it. * The current dominant ML paradigm is all about minimizing some formally specified loss. That's all we know how to do.  ... But responses like this are ultimately a bit misleading, since (as the Shard-theory people emphasize, and as the evolution analogy attempts to explain) what you get out of gradient descent doesn't treat loss-minimization as its utility function, and we don't know how to make AIs which just intelligently optimize some given utility (except in very well-specified problems where learning isn't needed), and the AI doesn't only know what you tell it. So for some purposes, the evolution analogy is superior. And yeah, probably neither analogy is great.
2Quintin Pope1y
I dislike both of those analogies, since the process of training an AI has little relation with evolution, and because the psychopath one presupposes an evil disposition on the part of the AI without providing any particular reason to think AI training will result in such an outcome.
Here's I think a grounded description of the process of creating an AGI: In that scenario, what you are saying in more broad terms is: "an AGI is a machine that scores really well on simulated tasks and tests" "I don't care how it does it, I just want max score on my heuristic (which includes terms for generality, size, breadth, and score)" So there is no evolutionary pressure for a machine that will be lethally against us.  Not directly.  EY seems to believe that if we build an AGI, it will immediately be  (1) agentically pro "computer" faction  (2) coordinate with other instances that are of it's faction  (3) super-intelligently good even at skills we can't really teach in a benchmark This is not necessarily what will happen.  There is no signal from the above mechanism to create that.  The reward gradients don't point in that direction, they point towards allocating all neural weights to things that do better on the benchmarks.  #1-3 are a complex mechanism that won't start existing for no reason.    EY is saying "assume they are maximally hostile" and then pointing out all the ways we as humans would be screwed if so.  (which is true) What does bother me is that the "I don't care how it does it" may in fact mean that the solutions that actually start to "win" AGI gym are in fact biased towards hostility or agentic behavior because that ends up being the cognitive structure required to win at higher levels of play.  
Both times my talks went that way (why they did not raise him good - why we could not program AI to be good; cant we keep on eye on them, and so on), but it would take to long to summarise something like 10 minutes dialog, so i am not going to do this. Sorry. 

I don't understand one thing about alignment troubles. I'm sure this has been answered long time ago, but if you could you explain:

Why are we worrying about AGI destroying humanity, when we ourselves are long past the point of no return towards self-destruction? Isn't it obvious that we have 10, maximum 20 years left till water rises and crises hit economy and overgrown beast (that is humanity) collapses? Looking at how governments and entities of power are epically failing even to try make it seem that they are doing something about it - I am sure it's either AGI takes power or we are all dead in 20 years.

1Radford Neal1y
How did you come to have such a pessimistic view of climate change?  I don't think you will get that from mainstream sources such as IPCC reports. There is zero chance that climate change will lead to human extinction.  During the Paleocene-Eocene thermal maximum 55 million years ago, temperatures rose by much more than is plausible in the near future, and life went on, albeit with some extinctions.  (Note that humans are about the least likely species to go extinct, due to our living in many habitats, using very adaptable technologies.)  More likely, global warming would be like the Holocene Climatic Optimum, which couldn't have been all that bad, seeing as it coincided with the formation of the first human civilizations. At most, climate change might lead to the collapse of civilization, but only because civilizations are quite capable of collapsing from their own internal dynamics, and climate change disruptions might be the nudge that pushes us from the edge of the cliff to off the cliff.
1Vugluscr Varcharka1y
This is my point exactly - "At most, climate change might lead to the collapse of civilization, but only because civilizations are quite capable of collapsing from their own internal dynamics" Pessimistic view of climate change I get from the fact that they aimed at 1.5C, then at 2C, now if i remember right there's no estimation and also no solution, or is there?  In short mild or not,  global warming is happening, and since civs on certain stage tend to self-destruct from small nudges - you said it yourself, but it doesn't matter where the nudge comes from.