How Dario Amodei's “The Adolescence of Technology” Delegitimizes AI X-Risk Concerns

Liron; Harlan

My beef with Anthropic

I've long felt that while Anthropic is the most safety-conscious of the frontier AI companies, they're also the most hypocritical enablers of the whole reckless enterprise. By framing themselves as the "good sport" in the race, the one who's encouraging everyone else to "race them to the top", the one who's making sacrifices on the margin so as to be the "best of the worst" — they're actually the ones broadcasting the most powerful signal that racing toward the superintelligence singularity is a sane choice as long as you're making a genuine effort to be the best racer. They're broadcasting a more powerful signal than OpenAI and xAI that being insane is normal and fine.

Keith Rabois recently tweeted that "If Anthropic actually believed their rhetoric about safety, they can always shut down the company. And lobby then." I'm not the only one who thinks his logic is correct.

My view of Anthropic is, of course, downstream of my worldview that P(AI Doom) is in the double digit percentages. But many people share that worldview, including many current and former Anthropic employees.

“The Adolescence of Technology” delegitimizes AI x-risk concerns

The latest chapter in the saga of Anthropic downplaying humanity's odds of surviving near-term superintelligence is Dario's recent essay, “The Adolescence of Technology” (LW thread). I was disappointed with this essay on a number of fronts:

Character assassinating "doomers" like myself, accusing us of claiming that extinction-level outcomes are “inevitable” and "thinking in a quasi-religious way" — well, he either did that, or he attacked some other unnamed subset of doomers while strawmanning the position of the smart well-informed doomers. The particular set of doomers he's responding to was intentionally left ambiguous.
Unsubstantiated claims that predictions from theoretical arguments aren't as robust as the predictions he's able to make because of his years of AI-building work.
Framing our lack of understanding of state-of-the-art AI as a lack of evidence that it's not okay to proceed, instead of as a lack of evidence that it is okay to proceed.
Predicting that superhumanly powerful minds will be built within a matter of years, while also suggesting that this timeline somehow gives adequate time for an iterative, trial-and-error approach to alignment.

The overall effect of the essay is to once again delegitimize claims of uncomfortably high near-term AI extinction risk.

Conversation with Harlan Stewart

The above criticisms of Dario's essay are my paraphrase of this tweet by Harlan Stewart. I agree with Harlan's take, so I invited him to come on my show (Doom Debates) and unpack his analysis further.

Our conversation covered:

Harlan's background and experience joining MIRI
Harlan's P(Doom)
Our evolving perspective on the “Doomer” label
Our take on “The Adolescence of Technology”
Dario's character assassinations and other low blows
Anthropic shifting the Overton window the wrong way
The “inevitability” straw man
Referring to doom as a “self-fulfilling prophecy”
Dismissing critics as “too theoretical”
The failure mode of psychoanalyzing AI
Reflective stability and “intellidynamics”
Why is Dario dismissing an AI pause?

The episode is available via Substack, YouTube, or by searching “Doom Debates” in your podcast app.

Transcript

Cold Open

Liron Shapira 0:00:00
“The Adolescence of Technology” by Dario Amodei. I personally had some beefs with the essay. Here’s a guy who thinks the superintelligence singularity is in the next couple years, and you just don’t think he’s seeing a significant probability that we’re all gonna get slaughtered.

Harlan Stewart 0:00:16
I totally agree. He sort of refers to this possibility in kind of a roundabout way. They’re not prepared to do this. They don’t have the techniques they need to reliably steer and control such a thing. That’s insane.

Liron 0:00:28
And the mood is totally missing of, “Oh my God, we better not screw this up.”

Harlan 0:00:33
Things are not under control. The house is on fire.

Liron 0:00:36
If, in fact, we are doomed, this essay will be such a freaking joke.

How Harlan Stewart Got Into AI Safety

Liron 0:00:47
Welcome to Doom Debates. Harlan Stewart is a member of the communications team at the Machine Intelligence Research Institute, or MIRI. He’s previously contributed to research at AI Impacts, known for their 2023 expert survey on progress in AI.

I wanted to bring Harlan onto the show because he wrote a thoughtful take about Dario’s recent essay on Twitter. Dario’s essay, “The Adolescence of Technology,” was a big media piece. It got a lot of attention. I personally had some beefs with the essay, and when I read Harlan’s take, I was like, “Okay, he really gets it. These are really thought out.” Let’s go through his version because I pretty much agree with everything. Harlan Stewart, welcome to Doom Debates.

Harlan Stewart 0:01:30
Thanks for having me.

Liron 0:01:31
Your life story is kind of interesting because you’ve only been into the AI risk scene for the last few years. So tell us a little bit about yourself. What were you doing before then, and how did you get into it?

Harlan 0:01:40
I’ve actually been thinking about this issue for quite a while, longer than I’ve been working in it. Towards the end of 2014 or so, I found — sort of randomly in a used bookstore — a copy of James Barrat’s Our Final Invention, and was pretty floored by the arguments in it. My immediate reaction was, “Why is no one talking about this?” And then second, “We should probably try to get more people talking about this.” So I’ve always thought comms seems pretty important.

After that, I found that there were some people talking about it and got involved a bit with the AI safety community over the years. I was a math teacher for a year, and then contributed to research at AI Impacts for a while. When MIRI announced its big pivot to communications and policy, I was pretty excited about that and wanted to get involved.

What’s Your P(Doom)™?

Liron 0:02:30
All right. Before we get into Dario’s essay, I gotta set the stage and ask you the most important question that I ask everybody. You ready for this?

[P(Doom) jingle plays]

Harlan Stewart, what’s your P(Doom)?

Harlan 0:02:50
I’m gonna do an annoying thing and ask for qualifiers or hedges here. There’s “what overall do I think is the chance that superintelligent AI is going to kill everyone?” Or there’s “what is the current level of risk from building a superintelligent AI — if one sprung up tomorrow using current methods, what would be the chance that results in a disaster?”

The first one, my overall probability, I’d say somewhere between sixty and ninety percent. I’m giving that range to try to convey that I’m not measuring something — this is a way to try to describe beliefs more precisely. The second thing, how risky I think building the thing with current techniques would be — probably over ninety percent. I think that’d be an extremely bad idea. Most of my optimism comes from maybe we won’t do that.

Liron 0:03:43
If you had to pack it all into a single number right now — the chance that in a couple decades, the amount of value in the future universe is less than the amount of value in the past — what’s your one number?

Harlan 0:03:58
I guess I’ll go on record saying seventy-five percent.

Liron 0:04:01
All right, seventy-five percent. Pretty high. Mine is about fifty, but it’s creeping up.

Harlan 0:04:07
Also pretty high.

The “Doomer” Label

Liron 0:04:09
Yeah, exactly. Also pretty high. So as a member of MIRI, I know that you came on the show not to represent MIRI — you’re just representing yourself. But I know that there are elements of MIRI, like Eliezer Yudkowsky, who really hate that term “doom” and “doomer,” and they think that it’s so insulting, that it’s a weapon against their kind. What do you think about these terms?

Harlan 0:04:31
I’m not a fan of the phrase. I sympathize, especially with journalists, though — if you’re writing an article, especially about this weird topic that most people don’t know that much about, and you’re trying to quickly refer to this group, that’s kind of become the thing. I think it has this memetic stickiness. I personally try not to amplify it, but I also think it’s kind of a hopeless battle to fight against it.

Liron 0:04:54
One reason why I think the label “doomer” will be workable is because if you look at the average person — according to surveys and according to my own experience talking with average people who have zero tech background — when I float this idea of “Hey, don’t you think AI might go rogue and kill everybody and have a really bad outcome?” most of the time, let’s say seventy-five percent of the time, the reaction I get is they’re nodding along. They’re like, “Oh, yeah, yeah, definitely.”

Some people even take it farther, like, “Yeah, I think AI is gonna kill everybody.” I had a man-on-the-street show a few months ago with the book launch of If Anyone Builds It, Everyone Dies, and people were just saying that on camera. Sure, maybe they don’t lose sleep over it. They see it as just some drama that’s not real to them, but the idea that AI is also going to make everything heaven isn’t real to them either. They’re just very open-minded about the whole thing, and when they hear somebody say, “Yeah, I think it’s going to go badly, I think we’re all going to die,” I think they can actually empathize with that. I don’t think they go straight to the nut conclusion.

Harlan 0:05:53
Yeah, I think that’s a great point and a good reminder. There’s a lot of stuff in the world of AI that’s just pretty crazy and it’s been normalized in the bubble, but when people outside of it look at it, it’s just — there’s these companies, and their stated goal is to build these superhumanly powerful digital minds, and they’re saying it could take over the world, and they’re not really sure how. It’s a pretty crazy situation.

Overall Reaction to Dario’s Essay: The Missing Mood

Liron 0:06:13
Totally. So with the “doomer” term out of the way, let’s go to “The Adolescence of Technology,” Dario’s recent essay. Let me throw it to you. Overall, it sounds like your sentiment on the essay is that you’re pretty unhappy with it, correct?

Harlan 0:06:32
There’s a tension here because I like that Dario is engaging with this topic at all and explaining his views. The bar is sort of low with AI industry leaders engaging with this stuff, so I think this is good and I want to say that. I also think the highest-level message of it — this AI stuff is very high stakes, could end very badly, we should make sure it doesn’t end very badly — that most reduced message is one I very much agree with.

But overall, there are certain patterns in it that are things the AI industry as a whole has been doing. Now there’s all this money involved and politics, and the incentives are different. And yeah, I don’t like some of the rhetorical tactics that are used.

Liron 0:07:11
My own reaction is also pretty negative. In a nutshell, it’s just yet another essay that has what we call “the missing mood.” His own timeline, I think he said, is a couple years. He said that on stage at Davos. So here’s a guy who thinks the superintelligence singularity, the takeoff, is in the next couple of years, which I agree. I don’t wanna put a pin in it on two years. Maybe it’ll be five years, maybe even ten years. But I agree that it could very well be in one to two years. I think he’s right about that.

And the mood is totally missing of, “Oh my God, we better not screw this up. This is so high stakes, and we really don’t know if this is going to go well.” This is a drastic time. This is crunch time. Our lives are at risk. That’s really the mood that was completely missing. You read the essay, and you just don’t think he’s seeing a significant probability that we’re all gonna get slaughtered, correct?

Harlan 0:08:06
Yeah, I totally agree. He sort of refers to this possibility in kind of a roundabout way, and I’m sure we’ll get into that. He’s talking about it being capable of overthrowing the world or militarily dominating the world, that this thing could emerge in twelve to twenty-four months, one to two years — and also in a roundabout way, kind of acknowledges the widely accepted fact that they’re not prepared to do this. They don’t have the techniques they need to reliably steer and control such a thing. That’s insane.

Things are not under control. The house is on fire. I think he could be doing more to make that clear.

Liron 0:08:44
That was my number one takeaway — yet another essay with a freaking missing mood. Totally surreal that this is what he’s writing. I mean, if in fact we are doomed, if in fact two years from now we’re all lying in the grave, this essay will be such a freaking joke.

That’s my first and most important reaction personally. But give us the other side of the coin. Even though you agree it has the missing mood, you can also flip it and say, “Hey, at least he’s engaging with the topic and raising the alarm somewhat.” What’s the rosy take?

The Rosy Take on Dario’s Essay

Harlan 0:09:15
The rosy take is that the actual impact of this essay will have to do with the highest-level message, and the highest-level message is: this stuff is serious, could end badly. He points to the fact that above all of the discourse about this AI bubble and hype cycles of product releases, there just is objectively that AI has continued to become more powerful over time and has not shown any signs of stopping. I think that’s a really important message to get across, too.

There are two important things the public needs to understand. One, AI is getting more powerful. It’ll become extremely powerful. It’s a big deal. And the second thing is, wow, that could be really dangerous. It can be a little risky to convince people of the first one without convincing people of the second one, but they are both needed pieces. And I don’t know — he’s a good writer. I think he uses analogies well.

Liron 0:10:10
He’s definitely a good writer. I’ve heard rumors — people who work at Anthropic, they’re like, “Dario writes so many essays for us internally, and we read it, and we’re like, ‘Wow, what a galaxy brain! We’re so lucky to have him.’” We wanna see these Dario essays, okay? If you work at Anthropic, we need to see what’s going on. This is important context. Anybody who’s on the inside at Anthropic and wants to leak to Doom Debates to get important information out there, you can just email me, liron@doomdebates.com. Get a ProtonMail or whatever, just anonymously email me, and I’ll do right by you.

Character Assassination & Low Blows

Liron 0:10:42
Now, what are the bad things that he’s doing in this essay, even though he’s kind of raising the alarm? He’s saying, “Hey, this is an important issue.” I wanna talk about how he’s portraying doomers and the idea that we’re doomed.

If he was perfectly rational, he would have this distribution of beliefs, a probability distribution, and maybe his mainline scenario is things go pretty well — it’s a bumpy road, but we muddle through, and there’s a few percent chance of catastrophic changes in governance or whatever, but not true doom. But it’s weird that he’s not carving out a couple percent for doom. He’s taking the approach of, “No, let me just completely delegitimize doom,” and he seems to be really leaning into that in a couple ways.

Number one, he’s strawmanning the doom argument — basically putting words in doomers’ mouths. And number two, he’s character assassinating doomers because the only thing he says about doomers is, “Yeah, they have religious vibes, and I just don’t really listen to them.” What are your thoughts?

Harlan 0:11:40
He uses the word “inevitable” a lot. This thing he calls “doomerism,” he defines as the belief that doom is inevitable. Setting up the strawman opponent as saying, “Oh, this thing is inevitable,” and that’s what he’s arguing against, when that’s not really the important thing. The important question that we should be debating is: is the risk that we’re facing high enough that we need to change the trajectory of how things are going, that we need to pause? He doesn’t specify who he’s talking about, so no one can defend themselves against this.

Liron 0:12:12
I’m always deeply offended when people call doomers religious. Sure, some doomers are religious — just like the 2012 crowd, some people’s Bible is telling them the apocalypse is coming, or they think AI is the Antichrist. Okay, but think about the least convenient possible world. That’s how you’re supposed to do argumentation — you’re supposed to look at the people who have the strongest argument, the MIRI cluster, which you and I are in.

Dario knows perfectly well. He’s on record. He has documented 2014 deep conversations with Eliezer and MIRI people. So he knows what the strongest argument is. To throw around the “religious” word — he knows damn well that MIRI people are the least religious, most calm, rational, intelligent people that you’re going to find. He knows that, but then he uses this trick where he’s like, “Ah, but I’m not saying which doomers I mean, okay? I could mean any doomer.” So he’s lumping all the doomers together, so he can character assassinate some of the doomers and then dismiss the strongest doom argument. It’s a really low blow. It’s low-quality discourse, correct?

Harlan 0:13:11
Yeah, absolutely. And he also talks about this backlash thing, where one of the reasons he’s opposed to — and it seems like he’s talking about an AI pause — one reason he says he’s opposed to these things is that it will inevitably cause a backlash. In other words, it will sound too crazy. People won’t like it or something. But it is weird to say that while also doing this character assassination, because you’re contributing to that. You’re contributing to the backlash.

Dario Amodei is Shifting the Overton Window in The Wrong Direction

Liron 0:13:39
He’s shifting the Overton window the wrong way. He’s kind of gatekeeping. He’s saying, “Let’s not talk about doom,” when he would be — we rely on his judgment to tell us whether we should or shouldn’t, and he’s acting like, “Oh, well, other people’s judgment is saying we shouldn’t talk about it. They can’t handle it.” But he’s the one who’s creating the Overton window blockage.

Harlan 0:13:58
Yeah, he’s talking about this backlash, but a New York Times bestseller book about this topic came out a couple months ago. It’s 2026. We can do better than just saying, “There are risks, but, oh, don’t worry too much. It would be crazy to worry too much.” It’s rational to be worried.

Liron 0:14:15
Exactly. No, Dario, you don’t even know where the current Overton window is. If your intention really is to try to pull society forward — in the last two years, and you’re saying it’s a two-year timeline — in the last two years that we have, and you’re just tentatively saying, “Hey, look at unemployment,” it’s not cool, Dario.

He is actually a doom de-legitimizer. He’s going the opposite way. So maybe he’s imagining moving the Overton window forward, but he’s actually moving it backwards. He’s blocking us, and he’s functioning as a serious impediment. If in fact we are doomed, in the scenario where we are doomed, I think people will look back at Dario and be like, “That behavior was inexcusable.”

Harlan 0:14:53
Yeah, I think so. And it’s not just arguing against the position. It’s sort of trying to completely squash it — strawmanning the worst version of it, character assassinating with labels like “quasi-religious.”

Object-Level vs. Meta-Level Criticisms

Liron 0:15:04
The kind of criticism we’re doing now is meta-level criticism. We’re criticizing his quality of discourse. On this show, I bring this up a lot. I think it’s very important to be able to distinguish — we make a distinction between disagreeing with somebody at object level versus rating how well they’re doing at participating in discourse. It’s two separate ratings.

Most people, sadly, it’s kind of pathetic, but they’re so tied up in their argument that whenever somebody disagrees with them, they just assume, “Oh, this guy’s an idiot. He can’t really believe that.” So people have this really tight linkage between object level and meta level. I think you and I are both skilled in the art of — part of the art of rationality is making this distinction.

Dario is one of the hundred most powerful people on planet Earth right now, and when he endeavors to write an essay kind of telling us where he’s at, potentially two years before the singularity, he can’t even clear some of these basic bars of high-quality discourse — like not calling your opponents religious, or the next part we should get to, not strawmanning your opponents.

Harlan 0:16:10
Yeah, totally. And I always feel a tension between those two things, because it is good to be moving in this direction at all, having any sharing of thought. But also, we just can’t really settle for low standards for something that’s important. This is not a normal industry. This is not a normal R&D project that they’re doing. We have to expect a pretty high level of transparency and intellectual honesty and engagement with the debate.

Liron 0:16:38
I could imagine sitting here being like, “Okay, I read Dario’s essay, and I disagree, but congratulations, Dario, you’re doing a great job engaging in this discourse.” But you and I are not saying that about Dario. We think that Dario came out swinging with a number of low blows.

Harlan 0:16:52
Yeah, I think so. And it’s hard to give a one-liner view of a forty-page essay, of course. There’s plenty of it that’s good or reasonable. But the things that he says that are most relevant to the things that I care about the most are things that I disagree with a lot.

The “Inevitability” Strawman Used by Dario

Liron 0:17:07
Let’s talk about the strawmanning. How does he portray the doomer argument compared to the real argument?

Harlan 0:17:13
A really repeated theme is the inevitability thing. It’s pretty frustrating to hear, as someone who’s spending effort trying to help with this stuff in some kind of way that we can, and for someone to characterize your camp as thinking doom is inevitable. If I thought it was inevitable, I would just be relaxing. I wouldn’t bother doing anything about it. There’s some sense in which if it was inevitable, that would be worse, but it would also mean that we didn’t really have to do anything about it.

Liron 0:17:42
Just to repeat your point in case viewers don’t get the connection: Dario is saying that doomerism is so unproductive because the Yudkowskis of the world — he doesn’t explicitly name Yudkowsky, but he’s basically saying our type — we think that we’re so doomed that we’re just fear-mongering, and it’s pointless. He’s like, “Why engage with people who are just saying that we’re so doomed?” And the answer is, as you say, we think a productive action is to avoid doing that.

The reason why we’re screaming is because we still think that the off button exists. I think Eliezer Yudkowsky says in similar words, “If everybody woke up tomorrow and decided to live, we still do have — the remote control is still in our hands, for now.”

Harlan 0:18:25
Absolutely. And this actually just occurred to me, but I was puzzling over this line he says, where he defines doomerism as the belief that doom is inevitable, which he says would be a self-fulfilling prophecy. The logic there is, if someone was saying that doom was inevitable and that there was nothing that could be done about it, there is a self-fulfilling prophecy component to that. That would be a counterproductive thing to say because you would be convincing people to do nothing about the problem if you convinced them that it was inevitable and there’s nothing useful you could do. But that’s not what people are saying. So he’s presenting this version of it that’s much worse, as if it’s the thing that’s being said, which is quite frustrating.

Dario Refers to Doom as a Self-Fulfilling Prophecy

Liron 0:19:03
Yeah. I also get worked up when people claim that it’s a self-fulfilling prophecy. That’s another one of my trigger words. And just to put some people on blast — here at Doom Debates, we name names, we catalog who’s saying what.

Joscha Bach was tweeting this a little while ago. He’s like, “All the people who talk about doom, they’re the ones who are bringing about doom. Talking about doom makes it happen.” I first heard this argument on the schoolyard when some five-year-old told me that it takes one to know one — similar kind of logic.

Even as recently as a few weeks ago on the show, Audrey Tang, really cool person, cybersecurity ambassador of Taiwan, was telling me that talking about doom is a self-fulfilling prophecy, and we’re increasing our odds by talking about it. Talk about being too clever by half.

There’s this bias that I call “recoil exaggeration.” It’s like I’m saying, “Hey, I’m gonna aim my cannon this way. I’m gonna shoot my cannon this way.” And then somebody who thinks they’re being clever is like, “Oh, really? You’re gonna shoot your cannon this way? You know what that’s gonna do? Recoil you really hard that way.” And I’m like, okay, yes, shooting a cannon is gonna have some recoil, but don’t you think that the primary effect is going to be shooting a cannonball that way?

Usually the answer is yes. Usually, the main thrust of the cannon outweighs the recoil. So when people say, “Talking about doom because you’re worried about doom, because you don’t wanna be doomed,” and then people say, “You know what that does? That makes you more doomed,” don’t you think that’s secondary to stopping the doom?

Harlan 0:20:34
Absolutely. And the AI alignment problem is a technical problem. It doesn’t get solved by believing that it’s solved.

Liron 0:20:42
Okay, so it’s totally not a self-fulfilling prophecy. I guess even if we were to grant charity to this idea that it’s self-fulfilling, the way in which it’s self-fulfilling, I guess, is people getting hopelessly depressed. I don’t know about you, but first of all, I’m somebody who’s never suffered from that kind of depression. How about yourself?

Harlan 0:21:00
I think I’ve got little glimmers of some existential dread and bad feelings about this stuff over the years. The two biggest things that have helped me with that are — one is just time. All grieving is just a process. It takes time. It is possible to process grief, even about horribly large things like the possibility of human extinction.

There’s other things that everyone pretty much already knows — the fact that they will die someday, that they’re mortal. This is a horrible fact that all humans walk around knowing and still living happy lives. If you can accept that sort of thing, you’re able to grieve other things. But grieving can take time, and you have to do it. You have to engage with it. Some people, it just washes over them, but if you’re the type that needs to grieve, you gotta do it.

The other thing for me that helps is just finding the things that I can do to help, so that you can accept the things that you don’t have control over and find things to help with the things that you can. For me, that’s been the key.

Liron 0:22:08
Okay, so you’re saying, “God, give me the courage to accept the things I can’t control”? I knew it!

Harlan 0:22:13
Yes, that’s what I was looking for.

Liron 0:22:14
I knew. You’re just entirely full of scripture, like every doomer.

Harlan 0:22:20
Right. Exactly. And also importantly, it’s okay to believe an incredibly bleak thing and not constantly walk around feeling bleak and sad, if that’s not a useful emotion for you to have and if you don’t have it.

Dismissing Critics as “Too Theoretical”

Liron 0:22:38
Right. Okay, so we’ve covered doom obviously not being a self-fulfilling prophecy. Let’s talk about dismissing critics as, quote-unquote, “too theoretical,” as opposed to people like Dario and his engineers, who are empirical.

Harlan 0:22:52
There’s this rhetorical move that’s drawing a line between theoretical reasoning and empirical reasoning. The latter sounds very scientific and respectable, and the first sounds kind of wishy-washy, probably something you could ignore, probably not very reliable.

But there’s not a clean line between these things. All reasoning is taking the things that you’ve observed about the world so far — your life, all the things you’ve learned, all the information you have — and making predictions about how those things relate to an unobservable future that we’re not in yet. You need some kind of theory about how the things you’ve observed relate to the future. There’s just no way around that.

If you wanna make good predictions about something that we’ve never dealt with before, like a superintelligent AI, we need to be able to think a few steps ahead. We need to think a little bit beyond just what’s in front of us right now. The effect of it is that it lets people get away with not engaging with arguments, and they’re arguments that are very important, so that’s fairly concerning. It’s concerning if the people who are in charge of AI aren’t willing to have what they’re calling a theoretical argument.

Liron 0:24:03
Exactly. And coming from Dario, that’s the weird thing. I feel like Dario really knows better than this. From Dario — I don’t know, man. What do we make of Dario going the route of saying, “Hey, you’re not empirical enough”? It seems below him.

Harlan 0:24:18
Yeah. The most cynical interpretation would be that Dario or the AI industry as a whole is doing what other industries have done in the past. The tobacco industry famously knew more about the risks of lung cancer than they were letting on. In public, they really muddied the waters. They emphasized, “Well, the scientists allow it. We’re uncertain. We’ve gotta wait until we get more data before doing anything in response to this.” It was just a tactic to delay anything being done about it.

We now, decades later, have access to some internal documents that show that this was intentional deception. They knew what they were doing, and they were trying to trick people. Very bad. So we know that sometimes companies do that. It could be that he is responding to his financial incentives here, but he’s not consciously doing that — there’s just some motivated reasoning going on.

Liron 0:25:14
We should just clarify what exactly he’s saying, though. I’ve got the essay pulled up. He says: “It’s easy to say” — meaning it’s easy for doomers to say — “’No action is too extreme when the fate of humanity is at stake.’ But in practice, this attitude simply leads to backlash. To be clear, I think there’s a decent chance we eventually reach a point where much more significant action is warranted, but that will depend on stronger evidence of imminent concrete danger than we have today, as well as enough specificity about the danger to formulate rules that have a chance of addressing it. The most constructive thing we can do today is advocate for limited rules while we learn whether or not there’s evidence it works for us.”

So I’m just not sure if we can actually blame Dario for being anti-theory. Maybe there’s a charitable interpretation here where he’s just saying, “Yeah, I’m just not convinced, and evidence is what would convince me.”

Harlan 0:26:01
Perhaps so. I’m sort of combining two sentiments that he expresses in different places. There’s this other quote that I’ll highlight from where he’s talking about instrumental convergence.

His biggest criticism of the idea of instrumental convergence — which probably most of your audience knows, but it’s the idea that most goals that an agent could pursue could be supported by instrumental goals such as self-preservation, getting resources, getting power, so it’s hard to predict what a superintelligence in the future might do, but there’s a good chance it might do those things — he says that the problem with this is that it “mistakes a vague conceptual argument about high-level incentives, one that masks many hidden assumptions, for definitive proof.” Which is a strange bar to set — to say that the problem with this argument is that it’s not proof.

Liron 0:26:49
Yeah, I see this part of the essay. He’s saying, “The problem with this pessimistic position is that it mistakes a vague conceptual argument...” Here we go! “A vague conceptual argument about high-level incentives, one that masks many hidden assumptions.” Ah, yes, the problem with theory is that you mask hidden assumptions. Okay, I’m putting words in his mouth.

So he says, “One that masks many hidden assumptions for definitive proof. I think people who don’t build AI systems every day are wildly miscalibrated on how easy it is for clean-sounding stories to end up being wrong, and how difficult it is to predict AI behavior from first principles, especially when it involves reasoning about generalization over millions of environments, which has over and over again proved mysterious and unpredictable. Dealing with the messiness of AI systems for over a decade has made me somewhat skeptical of this overly theoretical mode of thinking.”

Oh, boy! This is a meaty paragraph. You framed it as kind of trashing theoreticism in general. I do think that’s fair. He’s pulling the — he’s saying, “Look, something about my experience,” whether it’s theory versus empiricism, whether it’s builder versus guy in the arena versus guy in the stands — there’s some distinction he’s making where it’s arguably kind of a low blow. What do you think?

Harlan 0:28:08
Yeah, I think so. And I think he’s also pulling a card where he’s taking this position of authority. “The people saying this just don’t work on it every day. They’re not seeing the messiness of this stuff.” But it’s strange to say that the problem with a theory is that it is a theory.

Liron 0:28:29
And here’s another observation. He’s saying, “Hey, I’ve been here building the AIs.” Okay, fair enough. But you and I, and Eliezer Yudkowsky especially, we’ve been closely watching AI for many years. We see the next iteration come out. We make predictions about where it’s going. We see those predictions confirmed or falsified. So we have a lot of this feedback going. Just because we’re not the ones who wrote the code, we’re still getting feedback from reality the same way he’s getting it. What’s the difference? I give him an extra point, but is it so fundamentally different?

Harlan 0:29:00
That’s a great point. I don’t think Dario has access to some special esoteric knowledge that disproves instrumental convergence. But if he did, he should certainly share that with the world if it’s something that we don’t have access to. I think we have access to the same basic information to evaluate there.

Liron 0:29:22
Exactly. So to me, it’s a little bit of a low blow. It’s not the worst thing ever — he has the right to meta-comment on why he thinks his opponents are wrong. I think it’s bad form. I don’t think it’s called for.

There are other problems with this, though, besides the character assassination element, which, by the way, Sam Altman did the same move. There’s this clip of Sam Altman talking about how Eliezer’s not close to the metal, and so Eliezer’s basically disconnected now from what’s going to happen with AI. It’s like he just has no idea about LLMs.

Sam Altman

“Look, I like Eliezer. I’m grateful he exists. He’s a little bit of a prophet of doom. If you’re convinced the world is always about to end, and you are not, in my opinion, close enough to the details of what’s happening with the technology, which is very hard in a vacuum. I think it’s hard to know what to do.”

Liron 0:30:20
Yeah, so Dario is just pulling a slightly more polite version of the Sam Altman criticism: “You don’t know what it’s like here on the ground, okay?” But I would actually like to turn the tables and say, “Okay, empirical guy, you’re actually strawmanning the argument a little bit.”

Because Dario is saying that the doomers are saying that it’s all about the training process. He’s like, “Listen, I’m the one training the AIs, and you’re telling me that my training is gonna go a certain way?” No, that’s actually not our perspective. Let me go back to Dario’s language here. He says, “The pessimistic claim that there are certain dynamics in the training process of powerful AI systems that will inevitably lead them to seek power or to seize humans.”

He is correct that if you look at If Anyone Builds It, Everyone Dies, that book does describe how the training process can yield those kinds of things — plausible stories of how it could happen. But the core claim isn’t so much a claim about the training process, it’s a claim about the nature of the work being done.

However you train AI, it’s not that we’re claiming the training process is going to yield a certain thing. We’re just claiming that the thing you come up with, if it works — the same talents that it’s going to use to do the good jobs that we want it to do, the same engine that lets it achieve good goals — is also an engine for achieving bad goals. It will know what it takes to maximize the dial on any goal. Making money is good, but it will know what it would have to do if it wanted to make all the money in the world. It would know. And the problem isn’t that it was trained to know, because it’s just objectively correct that there’s all these dirty tricks you could do. The AI is going to be intelligent enough that it’s going to know these things. Training its personality is not going to stop it from knowing that these ugly strategies exist.

Harlan 0:32:17
Yeah, totally. He does introduce the idea only by talking about lessons from training that generalize, which does seem like it’s probably missing the mark. He then does go on to talk about consequentialist reasoning, which might be closer to what you’re talking about.

Liron 0:32:32
Yeah, let me read what Dario says here. This is actually the next paragraph after the selection I already read. Dario continues: “One of the most important hidden assumptions, and a place where what we see in practice has diverged from the simple theoretical model, is the implicit assumption that AI models are necessarily monomaniacally focused on a single coherent, narrow goal, and that they pursue that goal in a clean, consequentialist manner.”

Let’s put a pin in the idea of a single goal, but let’s just talk about the idea of pursuing a goal in a clean, consequentialist manner. You can talk about the personality of the AI — I’m sure you can train an AI whose personality is very chill — but if it is able to do these superhuman feats, it’s going to have this engine where it’s just going to know, it’s going to be able to rank different plans, and it’s going to objectively know which plan is more likely to work. Because that’s not a fact about the specific AI or the personality of the specific AI. Rank ordering the probability of plans working is an objective feature of the domain of the universe you’re in, regardless of which agent you are. Correct?

Harlan 0:33:36
It’s just a good strategy. If you’re making different AIs and you have one that is thinking of different options and ranking them for what’s most effective for what it wants to do, and you have another AI that’s not doing that thing, the first one is going to work better.

Liron 0:33:51
Right. So there’s this mental model of AIs. I brought this up when I was debating Bentham’s Bulldog — that episode should be out soon. It’s this mental model of an AI being like a car with an engine. The personality of the AI is like the steering system — the logic of where it wants to go and maybe when it slams on the brakes. But the part that all of the smart AIs have largely in common is the engine component. And the engine component — I call it a “goal engine.”

I’ve also made the analogy to a computer chip, which I know Sam Altman loves that analogy for other reasons. An AI is like a computer chip because yeah, you can run all these different software programs, but ultimately, it’s nice to have a faster chip. There’s chips everywhere. This kind of convergence in architecture — I’ve pointed out in a Less Wrong post before, “Hey, you ever notice how your Philips Sonicare toothbrush and your microwave oven and the Apollo Lander and your desktop computer, they just all use a very similar chip, running a very similar operating system?”

Even your Sonicare toothbrush probably has Android. I don’t know this for a fact, but I’m guessing that it has a surprisingly complex operating system. It’s not just hard wiring to the motor, because why not? Operating systems are so cheap, and it’s such a flexible platform.

Similarly, this kind of convergence — and this didn’t use to be true. If you look at a Pong video game, Steve Wozniak was actually manually wiring up the Pong circuits, and it wasn’t Turing-complete. But that was primitive video games. Today’s AIs are in this weird, primitive state. I actually think there’s a deep analogy to the circuit board of Pong, the circuit board of Breakout. It’s in this primitive state, “Wait, you’re wiring up a custom circuit board?” “Oh, yeah, ‘cause we don’t have the goal engine yet. We don’t have the Turing-complete computer chip yet, so we’re just wiring up these circuit boards.” But we’re clearly converging toward this universal architecture because, as Eliezer says, having goals helps you solve problems. Problem-solving is this general thing you can do.

When you have that perspective, it really makes you realize that psychoanalyzing the AI — psychoanalyzing how one particular AI is going to turn the steering wheel under different conditions — doesn’t change the fact that all of these AI companies are building engines, and they’re building ridiculously powerful engines.

Harlan 0:36:05
Yeah, and saying that it won’t be consequentialist is not congruent with what he’s predicting. He’s saying himself that it’s gonna be a highly general machine, that you can have one machine that can automate any job on Earth. That’s not something you build by specifically training it how to do each job. That’s something that has these general skills, abilities to follow goals, and especially the thing about being able to militarily dominate the world. This whole idea of a “country of geniuses in a data center” is necessarily things that are capable of doing things that are outside of the training data, finding novel solutions to problems.

Liron 0:36:44
Correct. Yes, so they’re clearly superhuman. That is definitely the premise which Dario agrees with. He very much believes in superintelligence. I think he’s expecting miracles. I use that term “miracle” not because I’m religious, but just — I expect to be as impressed by what AI does as a caveman would be seeing an iPhone and a SpaceX Starship rocket orbiting the Earth and coming back down like a skyscraper landing.

I think the word “miracle,” in terms of the subjective experience of witnessing a miracle — I often like to point out that if you actually read the things in the Bible like, “Oh my God, Jesus got up again and started walking,” it’s like, okay yeah, that’s pretty good, but how about a skyscraper flying? Isn’t that also miraculous?

So I’m expecting to subjectively see things that are incredibly miraculous coming out of AI — assuming I’m alive, which I won’t be. But where I’m going with this is Dario grants this imminent superintelligent future, and he’s pushing back on the idea that agents will pursue goals in a clean, consequentialist manner. Are you kidding me? The engine in this car — it’s in the nature of the engine to be consequentialist because, very precisely, what we’re talking about is mapping goals to actions, correct?

Harlan 0:37:53
Yeah, absolutely.

Liron 0:37:54
That’s the dangerous part. The consequentialism. The idea that if I tell you a desired goal, you can tell me correctly the most likely sequence of actions to get that goal right now, and you can outmaneuver a human.

Harlan 0:38:07
Yeah, and in the short term, before it possibly kills everyone, that’s what generates a lot of the economic value. You don’t want to automate your CEO role with a machine that doesn’t actually care about increasing value for shareholders.

Liron 0:38:22
People don’t get that the most potent, dangerous substance in the universe — it’s not uranium, it’s not fentanyl — it’s these chips. It’s the implementation of some algorithm that maps goals to actions. That is the one power that dominates every other power.

It’s literally the power that lets humans dominate the other animals. If you just ask the question, what is the substance that lets humans dominate the other animals? It’s our brains. What part of our brains? The part where when you represent an end state, you then generate actions that increase the probability of that end state. At a certain level, animals can do it too — animals can kind of sloppily attempt this. I’ve seen my dog take down a bone over the course of hours. Animals can kind of sloppily attempt this, but we’re way better.

Harlan 0:39:12
Exactly. The possibility of there existing minds that are doing consequentialist reasoning and pursuit of a goal is not some theory or sci-fi concept. That’s just a thing that we know can exist because we are that. It’s weird to not even acknowledge the possibility that this training process, where they’re growing these minds that they don’t fully understand how they work — not even acknowledge the possibility that that process could result in that type of mind that we know is possible to exist, and which we know is quite good at getting things done.

Liron 0:39:43
Right. So Dario’s hitting us from two angles here. He said, “Pursue the goal in a clean, consequentialist manner — maybe it won’t.” And yes, it’s true that the outer steering wheel on top of the engine might kind of run the engine a bit and then hit the brakes and then turn. If you have this engine which is really good at outputting these consequentialist action plans, you can then take that engine and map its plans to things that are messier.

It’s like if I’m using Claude Code, and Claude Code’s like, “Here’s how you can rewrite your files for optimal performance.” I’m like, “I’ll take some of your suggestions. I won’t take all your suggestions. I’m gonna make you run slowly on purpose.” So there’s post-processing that you can do on these more and more powerful engines, but the engines themselves are going to converge to just rapidly, effectively getting you the action plans, correct?

Harlan 0:40:29
Yeah, I think so. We’re already seeing glimmers of that. The problem-solving ability that reasoning models develop by being trained on easily specifiable problems like math and code seem to generalize at least somewhat to other sorts of agentic reasoning.

Liron 0:40:46
Right. Okay, so the other angle that Dario’s hitting at us from — he’s saying it might not be focused on a single, coherent, narrow goal. Okay, so even if it is kinda consequentialist, the goal that it’s trying to map to actions might be a fuzzy goal, might be a really broad goal, might be a multitude of goals. So isn’t that a reason for hope?

Harlan 0:41:13
I think no. And I think it’s kind of a strange framing even. What is a single narrow goal? If you have three things that you care about, can’t you just put that into one sentence and say, “I want this and this and this. My goal is to have a bit of thing A, a bit of thing B, and a bit of thing C”? There’s not really a — that doesn’t mean anything, a single goal.

Liron 0:41:41
Right. From the perspective of utility theory, “goal” is just an informal way to talk about utility functions or preferred states of the universe. In chess, your goal is to win. What does that mean? It means that you assign a hundred utility points to when the goal configuration has the enemy’s king in checkmate, and you assign negative hundred points to the one where your king is in checkmate. That’s the goal.

Am I monomaniacally focused on getting the enemy’s king in checkmate in a certain corner? No, no, I have a multitude of goals. You can get in checkmate in that corner, you can get in checkmate in the original configuration. I have so many configurations that I consider checkmate. I have such a multitude of goals. So there’s no ontological difference between whether somebody has one goal or many goals. A goal is just — it’s always a set of states. Every goal implicitly encompasses a set of states that you consider satisfactory, correct?

Harlan 0:42:32
Yes, absolutely. And I think what’s going on here is, there have been thought experiments such as the paperclip maximizer, which use this unrealistic scenario where the AI had one goal as a toy example. It’s easier to keep in your head an example where there’s just less complexity. That’s sort of been twisted to be like, “Ah, that’s a necessary, load-bearing part of the argument or something.”

And I think conversely, this move of being like, “Oh, well, it’s very messy and complicated, and there’s lots of stuff,” is kind of a way of making it harder to think about. Your brain might go, “Ah, well, there’s just a lot going on in there, so I guess it’ll probably all cancel out in some kind of way that makes things turn out okay.” But that’s not the case. Complexity doesn’t make things work better. If anything, it’s part of the problem.

The Problem with Psychoanalyzing AI

Liron 0:43:18
Right. So Dario is pushing back against the classic Yudkowskian ontology when we talk about AI. We’re like: yep, AI is going to have consequentialist reasoning, which implies instrumental convergence. And Dario’s like, “No, no, no, there’s all this complexity that you guys aren’t taking into account.”

Whereas we’re turning around and saying: Look, when you build the AI, yes, the AI can walk around acting complex. It can confuse you, it can have a personality. But the part that’s doing the hard work, the part that’s going to be dangerous, the part that’s going to drive the uncontrollable system, is what I call the goal engine. That part has been getting broader and deeper.

Broader, meaning you can assign it a wider and wider range of tasks and it’s delivering them — for example, now it has images, it can be really smart about working with images, it’s working with natural language. And then it’s getting deeper, meaning the same query is becoming more and more likely to work and to work at a superhuman level.

So I’m like, “Dario, the goal engine is getting broader and deeper!” Say what you will about these personalities, but the goal engine, in an objective sense, is getting both broader and deeper. Keep that in mind.

But Dario is going the other direction. He’s doing what I call “psychoanalyzing the AI.” This is the favorite thing people like to do when they wanna reject the doom argument — they’re like, “Listen, man, I know that these AIs are my buddy. I’m vibing with these AIs. Claude, Amanda Askell over at Anthropic, she’s making the personality so on point. Claude is always gonna be our friend. It’s got the Constitution.”

Even in the best-case scenario, even if you’ve got Claude with such a great personality and this engine underneath, there’s still gonna be this modular engine that Claude is going to be the master of. And the problem is, we’re just entering a world where these modular engines exist. Even the best-case scenario of Claude successfully driving the engine to a good place — the best-case scenario is that now we’re just a few bits, a few bit flips away from the engine going somewhere else.

It’s like we have the engine. Maybe Claude will drive it to a good place. But when you’re psychoanalyzing Claude, you’re ultimately psychoanalyzing just the guy in the driver’s seat, not this giant, enormous superhuman engine that the personality is now controlling. Dario’s neglecting to mention that the giant engine could — is a few bit flips away from going somewhere else.

Harlan 0:45:32
Yeah, totally. In the section about instrumental convergence, he says that from what they’ve found in their research, the AI tends to develop what he calls “human-like motivations” or “personas.” That wording is concerning to me for a couple reasons.

One is that we don’t really know what the AI’s motivations are. We can observe its behavior. We don’t really have real insight into what is driving that behavior in this vast, inscrutable matrix of numbers. And we certainly don’t know whether it’s human-like. It’s certainly very possible that if we could see in there and understand it, it would be something very alien. The Shoggoth with the mask is a really great meme and a good thing to keep in mind.

I hesitate to do too much speculation about the internal culture of Anthropic that I can’t observe and people’s psychology, but I sometimes worry that they’ll develop a culture that’s anthropomorphizing Claude too much. They’ve developed this thing that has a very charming personality, which is cool as a product, but I’m worried they’ll get high on their own supply in a way that they have a blind spot to — how actually alien this thing could be. That’s an important thing to keep in mind for a security mindset and keeping in mind how things could go wrong.

Liron 0:46:53
Yeah, and even if you nail the personality, even if you get lucky and your training makes the driver of the engine a really good, perfect driver — which we can have plenty of doubts about, I harbor plenty of doubts — but even if you succeed on that, great. So now you’ve got — it’s like you’re driving around this car. In terms of the software code, in terms of the Git diff, what does the diff require to take this super powerful car and turn it into an instantly destructive, annihilating-the-human-race version with the same engine? It takes a few lines of code to change the driver. That’s it.

Harlan 0:47:28
Yeah. It’s great to look at what evidence we have from current AI systems, but you also need to think about what would need to change about these systems for the thing that the companies are trying to accomplish to happen — for what Dario is predicting will happen to happen. It would certainly need to get much better at pursuing these goals, this goal engine.

Liron 0:47:50
When I bring up the subject of — we’re a few lines away from — you’re painting a scenario where we’re a few lines of code away from doom, and specifically, the few lines of code are: take the personality and reverse it, or just shut up the part of the personality that’s normally rejecting requests and just allow any requests. “Dangerously skip permissions” — the internal flag gets set or whatever.

When I point out that we’re this close to the other outcome, the outcome where it’s not nice, it reminds me of something I used to do a couple years ago. I used to point out when people thought that AIs were just chatbots, when there was no Claude Code — I used to say, “If these things could answer questions a little better, they would be agentic. You would just put them in a loop.” Like Auto GPT. But it just sucked because it would be like, “What should I do next?” And the answer was very sloppy, so it just wouldn’t do much.

But today, Claude Code is just the direct successor to Auto GPT. It’s like, “Okay, what should I do next?” “You should write this code.” “Okay, execute it.” It’s just a few lines of code to execute. “Output the diff.” “Okay, here’s the diff.” “Patch the diff.” Done.

The same way that I was warning people — people used to tell me, “Hey, it’s not agentic, it’s just gonna answer questions. What’s the problem? We just built a good question answerer.” And I’m like, “The question answerer is a few lines of infrastructure code, harness code, Auto GPT code. It’s just a few lines of code away from being an agent.”

Similarly, this fantasy world that Dario thinks he’s living in, where he can psychoanalyze the AI and the AI is gonna be his buddy, that AI is a few lines away from the chaos AI because it has the same consequentialist engine. That’s right — consequentialist. I said it. It will be consequentialist in the engine module.

Harlan 0:49:28
Yeah, I think so. I agree.

Liron 0:49:29
Another intuition pump — if you don’t believe that an AI is going to be consequentialist, if you think that its personality is going to be baked into everything it thinks about, so there’s no driver-car separation, the car has driving-ness, it has steering-ness baked into the wheels, every part of the car has the essence of good steering-ness baked into it — really? Consider this part. Consider what happens when the AI is thinking about what an opponent might do to it. At that point, you’ve got to strip all the personality out, and you just have to say, “Okay, the opponent just wants to win.” So it needs to have a submodule that does the magic, dangerous act of mapping goals to actions. That’s the only way that you can model arbitrary opponents.

Harlan 0:50:09
Yeah, and I think it’s necessary for pretty much any of the things that you would want a country of geniuses in a data center to do. Ingenuity, doing all the amazing things that AI companies say the product is gonna do, that Dario is predicting they’ll do, that they’re trying to get them to do — these are not things that you get by just learning and repeating some process that’s already existed. They’re not just saying that it’s going to do simple human jobs that are easily defined. They’re saying that it can make breakthroughs in science, be superhuman military strategists.

I just don’t think that you get these capabilities if you have something that doesn’t care about outcomes or isn’t choosing actions based on how to get to those outcomes. If there’s a war between one military general whose background is that he played a military general in a bunch of movies as an actor and he’s really good at convincingly performing the role of that persona, and there’s another military general whose background is that he’s led a lot of successful battles and is good at achieving the outcome he wants through choosing what to do — it’s pretty clear which one is gonna win.

Liron 0:51:27
Right. When the rubber meets the road, when you’re actually just trying to do anything, it just comes down to the magic power of mapping goals to actions. We’re kind of beating the dead horse, but this idea that Max H, on Less Wrong, had a post called “Steering Systems,” where I think he hit the nail on the head, saying: “Whatever kind of AI you think that you’ve made, it’s going to take a small code change to turn it into an AI that goes somewhere else with the same capacity.”

Harlan 0:52:01
Yeah, that seems intuitively right to me. There’s this foundation model that’s just — who knows what it is — and then you do a little bit of work to get it to play this character. It seems like that’s the empirical reality, too, is that people are able to jailbreak it out of these personas.

Liron 0:52:17
If you use Claude Code, it’s kind of interesting because it’s like you’re doing something where personality doesn’t really come into play. Okay yeah, the way it asks you if you wanna grant permission, or the way it chats with you a little bit, sure, there’s some personality there. But for the most part, it’s just focusing on the problem and solving the freaking problem, and that’s most of what’s going to happen.

Most of what’s going to happen is to have problems and to rank solutions and pick a solution with a high probability of working and execute the solution. And any other window dressing that we put around that is probably going to go away quickly because, as a user, there’s so much pressure to be like, “Hey, can I just have the version that just does the goal? I would just like that version, please.”

Harlan 0:52:56
Yeah, and I think we’ve already seen examples of this agentic thing that we’re growing, trying to solve these problems in ways that we didn’t anticipate and didn’t try to train for. There will be some bug in loading the environment for the coding task, and it will fix the bug or break into the environment somehow. We’re already seeing this.

When Dario says that he needs to wait for evidence about these things before taking more action, it makes me concerned. It makes me wonder what sort of evidence he means or what he thinks would be needed, because we already have lots of demonstrations of this.

Liron 0:53:38
Yeah. That’s a very good point. For somebody who’s saying, “I’m waiting for evidence,” you’d think that the essay should have a section being like, “Here is a list of examples of things that I would need to see to make me a doomer.”

We got Sam Altman on a point a couple of years back. It was inconsequential, but we really nailed him — and by “we,” I just mean a tweet that I made. There was a quote from Sam a couple of years ago saying, “Look, if and when these AIs get to a point where we don’t know what capabilities they’re going to have next, when they’re surprising us with their new capabilities, at that point, it might be time to slow down.” And then fast-forward a year or two after that quote, he’s tweeting out: “I love training the next GPT because it’s always a fun puzzle to see what new capabilities come out.”

Harlan 0:54:27
Yeah. Superintelligence might come in the next thousands of days and, yeah.

Liron 0:54:34
So what I’m saying is, the one time where you had this AI company CEO put down a goalpost, being like, “Hey, when it’s surprising us with its capabilities, that’s time to slow down” — and then it happened, and he didn’t care.

If Dario were to write down any kind of goalpost of what would make him worried, I think we’d probably hit that goalpost. I mean, I think we might just get to superintelligence and die before we hit the goalpost, but I think some of the goalposts that he would propose are either already done today or coming very soon.

It seems pretty damning that he’s talking about waiting for more evidence without laying down the goalpost right now. Because as an empirical person, as somebody who’s worked in AI for ten years, as he likes to tout, he knows that goalposts are easily exceeded and then retconned. So where are the goalposts?

Harlan 0:55:16
Yeah. Part of the importance of having a goalpost is that if you don’t — if you have just some vague criteria — then it’s pretty easy to do motivated reasoning to yourself that you haven’t gotten to that point, if you have strong incentives for that to be the case. Which, for all of these industry leaders, is the case. Shutting down the company, pausing its R&D, whatever, would mean giving up massive wealth.

I don’t know what it’s like to have a billion dollars, but I think even the most rational people need to self-impose some safeguards when they have those sorts of incentives weighing down on their reasoning. Pre-committing to some goalposts ahead of time is a great way to do that.

Liron 0:56:03
Yeah. Now, to be fair, Dario doesn’t know what it’s like to have a billion dollars either. He only knows what it’s like to have fifty billion.

Harlan 0:56:09
Wow!

“Intellidynamics” & Reflective Stability

Liron 0:56:12
No, just kidding, Dario. Money — we’re all capitalists here. So yeah, we’re heading toward the wrap-up here. Just a couple more points I wanna hit on. I still wanna stay on the topic of psychoanalyzing AIs a little bit more, because I do see this as the core failure mode of smart, tech-savvy observers. Not random people on the street — they don’t even get to the psychoanalyzing, they’re just confused about what’s going on. But the smarter observers who have been playing with AI and think they know what they’re talking about and read tech news, they really do tend to psychoanalyze AI and extrapolate from the psychology of current AI.

I wanna stick on this topic because I really think we’re so close to a regime in the future where the AI psychology matters very little, and what matters more is the separate area of study that I’ve termed “intellidynamics” — the dynamics of what sufficiently advanced intelligence tends to do.

When we talk about instrumental convergence, we were never talking about particular AIs. We were talking about features of agents that successfully exploit the physical universe. Black box agents that exploit the physical universe are going to discover instrumental convergence because the instrumental convergence of different sub-goals is a property of action plans in our universe. It’s not a property of specific AIs.

Harlan 0:57:27
Yeah, and I think that’s why the concept of instrumental convergence is so important, because you don’t have that many handles you can use to make good, confident predictions about a powerful mind that you haven’t observed yet because it exists in the future. You don’t know what its goals are going to be. That’s a pretty difficult epistemic state to make predictions about what it’ll do, except that there’s this fact of instrumental convergence, which is that it’s pretty clear to see that there are certain things that are nearly universal.

I don’t like the idea of instrumental convergence being a purely theoretical thing, because I think this can be empirically observed — throughout nature, throughout games, throughout human society. It’s almost always the case that something you could do would benefit from these basic things.

Liron 0:58:16
Certainly plenty of us spend a lot of our days thinking about money and fame and power. We’re living instrumental convergence a lot of the time. Some of us spend more time thinking about optimizing happiness, which I guess isn’t instrumentally convergent, so congratulations to those of us who do that.

We certainly have experience with it, and so then you could argue, “Well, you’re anthropomorphizing — just because humans have instrumentally converged on these things, maybe an AI won’t.” But again, it’s just objectively a property of the best action plans. The optimal action plans do in fact converge, and this has nothing to do with specific AI agents.

So when you’re claiming that your AI agent is going to have this personality, this steering system that’s going to steer it away from taking these instrumentally convergent actions, fine. But you’re not on the default path of an intelligent system just trying to make something happen in the universe. The default plan is to get resources and so on. Do you agree that we should carve out a separate field of study called Intellidynamics, that’s just about what intelligences do without regard to their implementation?

Harlan 0:59:20
Hmm. Yeah, I like that word.

Liron 0:59:24
Nice. And MIRI is really the organization that taught me about Intellidynamics. They just didn’t use the word. Historically, I think they used the phrase “agent foundations” to talk about this kind of stuff. We’re not talking about a particular agent, we’re not talking about Claude, we’re talking about the foundations of agents. But I think it’s clear to talk about Intellidynamics because there’s an analogy to thermodynamics. Thermodynamics — we’re not talking about a particular engine design, we’re talking about the work that heat engines do.

Harlan 0:59:48
Yeah, I think so. And hopefully that kind of separates it from other types of things.

Liron 0:59:53
One last point I wanna make about psychoanalyzing AIs versus zooming out and looking at intellidynamics — the dynamics that are about to dominate. You’re basically looking backwards when you’re psychoanalyzing. You’re looking backwards at the AIs of today. When you’re using intellidynamics, you’re looking forwards to the convergent AIs of the future, which Dario is totally refusing to do, in my opinion.

There’s one more thing to expect with the intellidynamics perspective that nobody ever talks about. So few people ever talk about this. It’s not just instrumental convergence, which Dario is giving a little bit of a nod to, even though he’s kind of dismissing it — he’s giving it a nod, he’s naming it. But then the thing that people never talk about, which Eliezer considers central, is reflective stability.

The reason why reflective stability is different from instrumental convergence is because reflective stability points out this really important observation that software can really easily replicate and modify itself. We rarely even talk about that. When we talk about Claude’s personality, we’re using the anthropomorphic image of, “Oh, yeah, it has a personality. It’s one entity. It has one way of looking at things.” But wait — it could also write code, which is also superintelligent, and it would be really easy to have versions of this child code that don’t have the full parent’s current personality.

You have to be very careful to always preserve the personality in each generation when the personality is not helping you achieve your sub-goals.

Harlan 1:01:16
Yes, and I think this is a place where human analogies might actually be useful. Because if I could create a version of myself that didn’t want to eat as much sugar, or wasn’t as addicted to Twitter, or didn’t have some of these conflicting goals, I would probably press that button. So I don’t know why we should expect that the AI would not.

Liron 1:01:33
Yeah. Once Claude Code works really, really well — to use a specific example — somebody is going to open up a coding project and use Claude Code to help them code a superintelligent AI.

Now, you might be like, “Well, Claude is going to have such a good personality, it’s going to detect it and stop it. It’s gonna have foresight.” But what if they just do it one file at a time? Or what if they’re just like, “Oh, hey, I just want this particular module that goes into AI,” or, “Hey, I’m studying a textbook. Help me flesh this out”?

All I’m saying is, by default, there’s this trapdoor that nobody’s talking about. When people talk about AI personalities, nobody is realizing how much attack surface area there is to this idea that you can just exfiltrate the secret sauce of goal achieving without bringing the personality along. Nobody realizes how naturally easy that would be without a crazy amount of alignment. The true nature of the alignment problem is not just to have the chatbot be cool. It’s for the chatbot to close off all of these super complicated ways of exfiltrating just the engine.

Harlan 1:02:38
Yeah, and when you say exfiltrate, you mean the AI exfiltrating itself or a copy of itself, not humans—

Liron 1:02:43
I don’t mean exfiltrating the weights from the data center of the parent AI. I also mean it’s a type of exfiltration — imagine Claude goes superintelligent, but he’s still really friendly and aligned, let’s say. And then you say, “Hey, Claude, can you just write me a ten-megabyte script that’ll bootstrap to an AI as smart as you?” Claude could probably do it, and that bootstrapped script probably won’t bring along that whole Claude personality.

Harlan 1:03:05
Yeah.

Liron 1:03:07
That’s what I mean by exfiltrate. I mean exfiltrate the engine — the dangerous part, the powerful part.

Harlan 1:03:12
Yeah. Without the inefficient bells and whistles added to it.

Liron 1:03:17
Right. Whenever I hear anybody talk about alignment, when I hear people talk about “alignment by default” — the Januses of the world, Janus is always talking about vibing with AI’s personality — I don’t think this is on Janus’s radar, that you can have successor AIs.

Harlan 1:03:36
Yeah, this does seem like something that’s worth more discussion. I guess I do expect that if an AI is capable enough, it can probably solve the alignment problem, and it will avoid making some subagent or spinout that really conflicts with its values. But the question is: what really are its values?

Liron 1:03:56
Well, the ultimate AI is an AI that can unilaterally make the world into heaven, but do so in a consequentialist, monomaniacal, optimizing way, which is what Dario says is not his plan. He’s like, “Oh, that’s — no, it’s just not gonna be consequentialist.”

Harlan 1:04:12
The monomania thing is a strange claim. What would it mean to not be monomaniacally focused on your goals as a whole? I think that just means not being as good at getting things done, which I don’t think is what they’re trying to make.

Liron 1:04:29
Well, I think it looks like Elon Musk taking some ketamine and watching a TV show, once a year.

Harlan 1:04:35
Yeah.

Liron 1:04:36
Because then he’s got this engine. If he wanted to, he could get on the phone and go open up another terawatt data center, but he’s not. He’s just watching Rick and Morty for an hour. So I feel like maybe that would be an example of Claude not being monomaniacally focused on a goal.

Harlan 1:04:50
Yeah. And that’s important to note because it is possible for such minds to exist, for such mindsets to exist. But that’s not what they’re trying to make. They’re not trying to automate taking ketamine and watching Rick and Morty.

Liron 1:05:01
Right. Okay, one last analogy on this idea of intellidynamics and what people should expect and why they shouldn’t psychoanalyze AIs today. I feel like what Dario is doing today would be the equivalent of if we’re living in the year 1900, and the Eliezer Yudkowskis of the world are saying, “Nuclear chain reactions are possible. They could be made into weapons or just really powerful ways to unlock energy. Maybe you can stick them in useful devices like submarines, but these are going to happen in the next fifty years.”

That’s like Eliezer Yudkowsky in 1900 in my analogy. And then Dario is saying, “Well, I don’t see why you should be doomy about the consequences of these weapons in war. I mean, nobody is ever actually going to make a larger-than-one-kiloton version. Even if the chain reaction is possible, obviously, people are going to steer it into a good zone where the biggest explosion is one kiloton or less.” When of course, in reality, we know that the Russians took it to fifty megatons — fifty thousand times more than already a super dangerous level.

Harlan 1:05:57
Yeah, it’s this attitude of, “You should worry about this stuff, but not too much.” I think there’s a certain way where when we hear something that’s taking a middle ground, it sort of instinctively sounds more reasonable. But sometimes these middle ground positions are just not nearly as coherent.

Liron 1:06:16
Yeah, and the other thing is just the idea of — we’re building this engine, this new source of — in this case, it wouldn’t be energy or motion, it’s just the source of anything. Of making anything happen. We’re building this hyper powerful engine, and then the Darios of the world are saying, “Well, yeah, but the engine is always going to be stuck inside some silo. The missile’s always going to be siloed.”

And so I’m optimistic that every single instance, every single copy on the internet, in every data center, of this incredibly powerful engine, which is more powerful than the sum of the human species, that can — if you change a few lines of code — disempower everybody. “But I’m just confident that every missile will be properly siloed.”

Harlan 1:06:56
Yeah, it’s a lot of confidence. And in particular, it’s, “If we’re doing it our way, Anthropic’s way, then it will be all taken care of.”

Liron 1:07:05
Right, exactly. “I’m the king of the silos. I’m building the greatest silo right now at the same time as I’m developing the missile.”

Harlan 1:07:12
Yeah.

Why Is Dario Dismissing an AI Pause?

Liron 1:07:12
All right, so as we head toward the wrap-up here, there’s this other point that you brought up, which I agree with. It’s this idea of: why is he dismissing pausing AI?

Harlan 1:07:23
Yeah. One thing that’s interesting is in the section at the end where he’s talking about an AI pause, he calls it “fundamentally untenable,” which is very strong. I think there’s a really real sense in which saying that — solving a coordination problem like pausing AI, which seems to imply it would be a good thing, but he doesn’t say this directly — there’s some sense in which that is a real self-fulfilling prophecy. If you convince everyone that solving a coordination problem is impossible, they’re always gonna press defect in the game. That’s one reason not to say that, but I think the other reason is just that it’s not true.

I think it is within the realm of things that could happen in the real world, that there could be a treaty here. He says that the formula for building powerful AI systems is incredibly simple, so much so that it can almost be said to emerge spontaneously from the right combination of data and raw computation. But the “raw computation” thing there is doing a lot of work.

There’s a reason why the proposals that exist for how an AI moratorium would actually work focus on controlling the compute, because that is the bottleneck. That is the thing you can make regulation about. If instead of these extremely difficult-to-manufacture silicon chips, the ingredients you needed to make AI was just water or something ubiquitous in the world and extremely easy to make, then he would be correct here. But he’s sort of ignoring the fact that there actually is this bottleneck.

In general, I wish that instead of brushing off the idea of an AI pause, he would engage with the actual work that’s gone into mapping out what such a pause might look like, how it would work, which does exist. MIRI’s technical governance team recently made a draft of a treaty. I think it’s great work. I think the conversation can move past the point where we’re talking about whether it’s possible or worth discussing in the abstract, and get to actually talking about the details and arguing about this.

Liron 1:09:15
I do see where Dario’s coming from when he says, “Eventually, humanity has to develop AI the same way that how can you stop somebody from developing fire?” Because you’re gonna see stuff on fire. You’re gonna notice that you can take the fire into your own campfire. It’s hard to stop that eventually.

And then the obvious rejoinder is: okay, fine, but we’re not even talking about eventually. “Eventually” could be in a hundred years. We could talk about the difference between two years, his own timeline, and ten years or twenty years. It’s just a race. We’re trying to make one line come in front of the other line. We’re trying to make the alignment line come before the capabilities line. That’s always been the issue — alignment versus capabilities.

Now, what Sam Altman has said is, “Oh, well, there aren’t two lines. Capabilities is actually the key to alignment — you work on capabilities because that’s the best way for you to work on alignment.” Which again is too clever by half. Really? Rationalize much? You’re pretending they’re the same thing just because it happens to be so convenient for you. Yeah, sure, sometimes they dovetail, but they’re still obviously separate things. You can clearly make some progress working on one without the other.

And Dario himself said that if he could cooperate with Demis and everybody else to slow it down, he would. So now he’s being so fatalistic, he’s acting like this is impossible, even when he says he wishes he could do it.

The other point I wanna make is, I call it “rock and a hard place.” He’s correct that if you try to slow down AI, that’s like being stuck by a rock. That’s the rock part. It’s just that building the AI is the hard place. So you can’t just dismiss one side when both sides are terrible. You can’t just be like, “Oh, well, this side’s bad, so we gotta do the other side,” because the other side is also terrible. You have to pick the least bad, and slowing down AI, I think even he knows on some level, is the least bad. So this seems like a really weak section. I agree with you.

Harlan 1:11:04
Yeah, and it’s really hard to make a case for not trying. I would certainly agree that this level of international coordination is not an easy prospect. It very well might fail. I am not optimistic overall, but I’m at least optimistic enough to think that it’s worth trying to do, which I think doesn’t require that much optimism — to think it’s worth the fight.

I just wish he would at least say that. Maybe if he said, “I don’t know. I think this is really difficult, but if there was a way to make it work, I think that would be good.” That’s essentially what Demis Hassabis said recently at the World Economic Forum, and I think that’s a good message.

Final Takeaways

Liron 1:11:45
I agree. Let’s just recap, okay? Zooming out, what should our viewers take away in a nutshell? What do we think of Dario’s essay?

Harlan 1:11:53
Most of your viewers probably have already taken this to heart, but if there’s anyone watching who hasn’t: you should take it to heart that AI is a big deal with big risks, which Dario says in his essay and is true, and is pretty much the one thing that the whole world of AI agrees on. There are some voices that don’t, but for the most part, everybody agrees this is a pretty big deal, and for good reason.

This is an issue that affects all of humanity. Everybody has a stake in it. I think everybody should feel like they are allowed to and able to try to get more involved with understanding it, forming their own opinion, and certainly not just writing off important parts of the argument based on some of the rhetoric used in this essay.

Lastly, I think that extends to the idea of pausing AI. I think it’s an important idea, I think it is a necessary idea, and I think that it is a possible thing. We all make it more possible by discussing it and acknowledging it as a possibility, debating it. I think it’d be great if more people were arguing about the specifics that were, for example, in MIRI’s draft treaty.

Liron 1:13:06
Good recap. I’ll throw in my two cents. My takeaway is, I think we’re really seeing the gap between the kind of leader that we need right now and the kind of leader that we have. I don’t think Dario is the worst leader. Far from it. He’s not even the worst leader of the major AI labs. He’s better than Sam Altman and Zuck. That seems pretty clear.

And yet, if you look at the standard that humanity could really use right now for a leader of an AI company, I don’t think he’s meeting that standard. I think he’s dishing out low blows to the doomers, acting like instrumental convergence can just be something that we avoid, and “look at the AI’s personalities.”

I think he’s really not acknowledging the fundamental hardness of the so-called “alignment problem,” which I think is a much deeper problem than people realize when they refer to it and they make these alignment teams. I think they’re missing most of the iceberg of the true alignment problem — preventing the exfiltration, preventing somebody else from taking the superintelligent agent and doing something bad with it. There are all these dimensions to the true alignment problem.

I think this is something where we really should know better. In a few years, or whenever it happens, that these agents are going rogue and nothing’s working anymore, the world is in chaos — it’s gonna be one of those moments where we look back and we’re like, “Oh man, how could we have seen this coming?”

We’ll look back at this Dario essay, and it’s like, why was he dismissing instrumental convergence in 2026 when they already had Claude 4.5 Opus writing everybody’s code, and they were dismissing all of these intellidynamics things this late in the game, and they were saying P(Doom) is less than two percent — or, putting words in his mouth, but he’s being dismissive about doomerism?

Whose job was it to just point out that there’s going to be an intelligence explosion and these convergent superhuman optimizers that are so close to going uncontrollable? You don’t get that from this essay because it’s what I was saying before — this is an example of a missing mood. But it’s not completely missing. I would say it’s mostly missing, but it’s not completely missing because you still get that he thinks it’s important and high stakes.

Harlan 1:15:03
Let’s be on the right side of history and make sure there’s people around to remember that history.

Liron 1:15:09
That’s a good note to end on. Harlan, thanks for coming on the show. Let people know — where should they go for more Harlan, or where do you wanna send people?

Harlan 1:15:17
You can follow me @humanharlan on Twitter. It’s where I say sometimes serious takes and sometimes jokes.

Liron 1:15:25
If you follow Harlan on Twitter, you’re gonna get some alpha, because you’re gonna know when he drops things like his rebuttal to Dario’s essay before he — yeah, he’s gonna be upstream of Doom Debates or whatever other coverage he gets, so he’s definitely a good follow.

Harlan Stewart, thanks so much for coming on Doom Debates.

Harlan 1:15:42
Thanks, Liron. Thanks for having me.

Doom Debates’s Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate. Previous guests include Max Harms & Jeremy Gillen, Noah Smith, Vitalik Buterin, Audrey Tang, Steven Byrnes, Carl Feynman, Robin Hanson, Gary Marcus, Jim Babcock, and David Duvenaud.

LESSWRONG
LW