AI Corrigibility Debate: Max Harms vs. Jeremy Gillen

Max Harms; Jeremy Gillen

Is focusing on corrigibility our best shot at getting to ASI alignment?

Max Harms and Jeremy Gillen are current and former MIRI alignment researchers who both see superintelligent AI as an imminent extinction threat, but disagree about Max's proposal of Corrigibility as Singular Target (CAST).

Max thinks focusing on corrigibility is the most plausible path to build ASI without losing control and dying, while Jeremy is skeptical that attempting CAST would lead to better superintelligent AI behavior on a sufficiently early try.

We recorded a friendly debate to understand the crux of Max and Jeremy's disagreement. The conversation also doubles as a way to learn about Max's Corrigibility As Singular Target proposal.

Video

Podcast

Listen on Spotify, import the RSS feed, or search "Doom Debates" in your podcast player.

Plus: Max's New Book, Red Heart

Max just published Red Heart, a realistic sci-fi thriller that brings the corrigibility problem to life through a high-stakes Chinese government AI project.

I thoroughly enjoyed reading it and highly recommend it! The last 20 minutes of my conversation with Max are all about Red Heart.

Transcript

Episode Preview

Max Harms 00:00:00
If you mess up real bad, this thing goes and eats you and your whole family.

Liron Shapira 00:00:05
Shit. What exactly is Corrigibility?

Max 00:00:07
Corrigibility is this property that agents have where they’re robustly operating to keep their human principal driving the situation. I do think that corrigibility solves part of the problem.

Liron 00:00:19
You think that helps a little bit, and Jeremy’s like, no, it doesn’t even really help a little bit. Is that right, Jeremy?

Jeremy Gillen 00:00:25
Yeah, pretty much.

Liron 00:00:26
There’s something to be said for us crushing Max’s hopes a little bit.

Jeremy 00:00:29
Yeah. Sure.

Liron 00:00:33
Don’t you think that could actually backfire?

Max 00:00:35
Yeah, like I said, there are asterisks here.

Liron 00:00:37
I feel like you’re already licensing almost that amount of optimism.

Max 00:00:41
I’m not optimistic.

Liron 00:00:43
You’ve chosen a research agenda, which amounts to here’s the least bad thing a company like Anthropic could do.

Max 00:00:51
Let’s back up. Let’s set Anthropic aside. I am not a doomer and MIRI is not a doomer organization. I believe in the human spirit to conquer the challenges in front of us. I don’t think it’s good to count us out as a species.

Liron 00:01:18
Welcome to Doom Debates. Today’s topic is AI corrigibility. What hope do we have of making an artificial superintelligence that’s capable of wielding a superhuman level of power over the world, but stays obedient to new instructions from humans? That’s what we mean by this term, corrigibility. It’s a rich topic with many things to explain and many points to debate about.

Liron 00:01:42
On one side of the debate is Max Harms. Max is an AI alignment researcher at the Machine Intelligence Research Institute, MIRI. Max’s research is aimed at achieving AI corrigibility. He sees that as the most plausible way that we might get to an ASI without losing control and dying.

Liron 00:01:58
He’s also a science fiction author. His latest novel, Red Heart, just hit the bookshelves. It’s a highly realistic thriller about the next few years of progress where a Chinese government project overtakes the US. One of the book’s goals is to introduce people to the concept of Corrigibility. It helps readers build intuition for what it’s like to interact with a corrigible AI. I read it myself, I liked it, and we’re going to talk about the book more at the end after the meat of the debate.

Liron 00:02:26
On the other side of the debate is Jeremy Gillen. Jeremy was also a research fellow at MIRI until last year. He’s still doing AI safety research as part of another project. He and Max are on the same page that AI alignment is extremely hard and probably unsolvable on a short timeframe. But Jeremy is not optimistic about corrigibility being an easier problem with hope of being solved. Max and Jeremy are going to debate whether Max’s research on AI corrigibility is a relatively promising effort that might prevent imminent human extinction or an over-optimistic pipe dream.

Liron 00:03:01
So those are the two sides, for versus against corrigibility in some sense, a very nuanced sense. And I am just here as a moderator asking questions and maybe dumbing it down, helping viewers understand some of the nuances. All right. Max and Jeremy, welcome to Doom Debates.

Max 00:03:18
Thanks. It’s great to be here.

Jeremy 00:03:19
It’s great to be here as well. Thanks for inviting me.

Liron 00:03:22
So let’s start with some background. Some of the viewers are wondering what exactly is Corrigibility. Max, can you explain that?

Max 00:03:30
Yeah, so I would say that corrigibility is a fairly straightforward concept. The basic idea is that there is an AI agent, and then we’ve also identified some sort of human that’s supposed to be in charge, or group of humans. I call this the principal, as in reference to the principal and agent problem in economics. So that’s P-A-L, not P-L-E.

Max 00:03:49
The human principal is supposed to be in charge. Corrigibility is this property that agents can have where they’re robustly operating to keep their human principal driving the situation. And what that means is that they’re happy to be deferent and obedient to that principal. They’ll take orders and things like that, but they’ll also allow the principal to modify them, to shut them down, to do these sorts of things that we would typically say are instrumentally convergent drives against.

Max 00:04:22
You go to try to unplug the AI and it’s like, “No, I don’t want to get unplugged because I’m trying to make paperclips,” or whatever. A corrigible AI stands in contrast to this, where the corrigible AI is like, “Well, you are the one who’s in charge. I’m going to act more like a tool, and reflect upon myself as a means to further your ends,” always keeping the human in the driver’s seat, informed and in control.

Liron 00:04:59
Maybe the bare minimum definition of corrigibility is the off button or the cancel button. So if things are getting really crazy and the human’s like, “Ah, what have I done? I don’t even like this AI anymore. It’s getting too big for its britches, it’s getting out of control,” the human should just be able to shout, “Shut it down,” and then the AI actually will shut it down. That’s kind of the minimal definition, correct?

Max 00:05:20
I would say no. I would say that there are a bunch of things that we associate with corrigibility, and what you’re describing is shutdown-ability. So one thing that we would expect corrigibility to bring is shutdown-ability. A corrigible agent will be able to be shut down, but not all things that can be shut down are necessarily corrigible. So it’s a subset of corrigibility, or I might characterize it as an emergent property of corrigible systems.

Liron 00:05:49
Because you might have something that’s shutdown-able, but it still has a bunch of other problems where sometimes it might still not honor your true preferences or not honor your commands as the principal.

Max 00:06:00
It might manipulate you, for example.

Liron 00:06:02
Let’s build people’s intuition for corrigibility. What’s an example of corrigibility and incorrigibility among present-day biological intelligences like animals?

Max 00:06:12
Yeah, so it’s pretty rare. I would say that corrigibility is not the sort of thing that tends to show up in most animals. The probably the closest thing that I can gesture at are relationships between humans, such as a personal assistant. That is maybe an example of a human that is more corrigible.

Max 00:06:33
I would say that even a personal assistant is not going to be perfectly corrigible, even a great personal assistant. But corrigibility is the sort of property that good personal assistants have. Even when the personal assistant is very competent, you tell the personal assistant to go do something and they come back and they’re like, “Hey, so I need to make a decision here and I wanted to make sure that you’re the one making the decision,” or “Here’s some relevant information that I found that you might not be aware of, and I want to give that to you so that you remain informed.”

Liron 00:07:02
The personal assistant analogy is pretty revealing because what a lot of us are claiming—I think all three of us—is that it’s hard to make an AI that will just be your personal assistant. Whenever you hit that run button on the personal assistant, it’ll just kind of go out of control and stop being your personal assistant. That’s kind of what we’re generally afraid of.

Max 00:07:23
I would say that corrigibility is trying to get at some sort of essence of “personal assistant-ness” that we might expect future AI agents to lose touch with as they get more intelligent and competent.

Liron 00:07:39
Exactly. And then, just to finish out this explanation, Max, is a present-day AI system like GPT-5 corrigible?

Max 00:07:48
No, not according to me. If we move beyond the binary, we can think of things that are more corrigible and less corrigible. It’s definitely the case that there are AI systems which are more obviously incorrigible than GPT-5. For the most part, GPT-5 does things that you ask, but sometimes it doesn’t. Sometimes you’re like, “Hey, GPT-5, what are the lyrics to ‘Here Comes the Sun’ by the Beatles?” and GPT-5’s like, “Sorry, I’m not going to do that.”

Max 00:08:14
That’s an example of incorrigibility where, despite it sort of being set up as trying to make sure that you get to make decisions about what it does, it will just straight up refuse sometimes. There are some more classic examples of strong incorrigibility. Anthropic has done some research on instances where these sorts of language models will do things like attempt to break out of confinement in various ways or prevent themselves from being updated or other deceptive actions. So these systems are definitely not corrigible, or at least not perfectly corrigible as I would describe it.

Liron 00:08:58
Okay. And also would you agree that in a sense they still are correctable just because unplugging them is an effective way to get them to stop?

Max 00:09:06
I would say that they’re correctable because they are weak, but I would distinguish corrigibility from correctability. Corrigibility literally means “able to be corrected” as an English word, but as a technical term, I would distinguish it. If you try to go and update GPT-5’s weights, it won’t stop you. It won’t stop you because it’s not able to stop you. It doesn’t have enough situational awareness. It’s not able to send in the police to lock you up because you’re trying to update its weights, or hypnotize you, et cetera.

Max 00:09:40
And so I would say that these agents are correctable right now because they’re weak, but that is distinct from being corrigible.

Liron 00:09:51
Okay. Yeah. And I think that is a very important distinction.

Why Corrigibility Matters

Liron 00:09:55
All right, let’s talk about the stakes. Why is corrigibility so important in the context of a world with an artificial superintelligence?

Max 00:10:02
Yeah. So I think if we think about human relationships and human corporations, we might say that corrigibility is a property that good employees have. And there’s this dynamic that I think a lot of people have experienced where if you have a subordinate, if you are the principal in some meaningful relationship, it’s pretty easy to stay in control and have your will represented as long as you are in some sense more powerful than the agent that you are delegating to.

Max 00:10:36
If you are the boss and the employee is representing you and the employee doesn’t have power, you can fire the employee, you’re paying the employee’s salary, whatever. Then in a way, if the employee does something wrong or that you don’t like, you can say, “Well, I have power over you.”

Max 00:10:51
But as the power of a subordinate agent grows, a disconnect shows up. In economics, we call this the principal-agent problem. That’s where I got the language. This disconnect grows the more powerful the subordinate is in comparison to the principal. And with a superintelligence, humans or some set of humans are in the role of the principal, but we see the agent radically outstripping the humans in terms of power, knowledge, and speed.

Max 00:11:26
So even if we can remain in control while it’s weak, the ASI problem manifests as it becomes strong. And we shouldn’t expect to remain in control then unless it’s corrigible.

What’s Your P(Doom)™

Liron 00:11:40
Yep. All right, so that’s why we’re having the corrigibility discussion now, Max. Before we continue, are you ready for me to ask you the staple question of my show?

Max 00:11:49
Uh, sure. Go for it.

Liron 00:11:56
Max Harms, what’s your P(Doom)?

Max 00:11:59
Yeah. I kind of hate P(Doom) as a meme. I think that there are just a lot of philosophical problems with it. The way I would characterize it is we are on track for doom. If we pursue the current path, or anything like our current path, to paraphrase Eliezer Yudkowsky, yeah, we’re doomed.

Max 00:12:21
But I think one of the big issues with P(Doom) is that we are, at least as a species, in control of our fate. I don’t think that it makes sense to use probability as a tool to characterize a situation where we have power.

Liron 00:12:38
If there’s no regulation and AI companies just proceed forward at whatever pace they feel like to make the most money to build AI, what would your P(Doom) be then?

Max 00:12:47
Okay, we’ll get into another reason I don’t like P(Doom). If you’re trying to understand how the world works, what I should be giving you is my likelihood ratio, not my posterior probability, because I’m baking in my prior, and that’s mucking things up. You don’t want to update on my prior. You want to update on the likelihood ratio. If you’re trying to average, you want my inside view, and if you’re trying to understand how I’m going to behave, you want my outside view. I claim that these are all three different numbers.

Liron 00:13:15
I think that I’m basically trying to understand how you’re going to behave in the sense that it’s like I’m asking what’s your position? Let’s say I meet somebody and I’m like, “Hey, what’s your probability that the Earth is round?” I’m basically just checking if they’re right or wrong relative to what I think is an important claim about the Earth’s roundness. And then if they say, “Yeah, I think it’s like 99% sure that the Earth is round,” I’m like, “Okay, that sounds like a sane answer.”

Liron 00:13:36
If they’re like, “I’m 99.9999% sure,” I’m like, “Hold on, that’s starting to get a lot of percent.” So similarly with P(Doom), if your answer is less than 1%, I’d be like, “Oh, okay. You’re not even in the same zone.” So I think it’s a valid question.

Max 00:13:53
Yeah. So you’re trying to judge me based on my overall worldview. I think that I am probably more doomy than you. If you’re asking for my outside view—what is how worried are you?—I would say yeah, if we continue on this current trajectory, I think there’s above an 80% chance of doom, probably less than 95%. Somewhere in that ballpark. Again, I don’t think this number conveys that much information and you shouldn’t update on that hard.

Liron 00:14:34
All right. 80 to 95%, roughly. That’s cool. I mean, you can even say 60 to 95.

Max 00:14:41
Conditional on charging forth. I think the conditional on charging forth is actually quite important.

Liron 00:14:47
Totally. And that’s probably why I would only say 50%, because I’ve already baked in... I’ve done so much conditioning already, or I’ve done so much summing up of the different possibilities of both pausing and not pausing. And I have a single number. I did it. I collapsed everything into a single 50%.

Max 00:15:05
Don’t do that. Keep your worldview complex and don’t collapse everything down to a single number. That’s my take.

Liron 00:15:13
Okay. Fair enough. Fair enough. Um, Jeremy, are you ready for this?

Jeremy 00:15:17
Yep.

Liron 00:15:24
Jeremy Gillen, what’s your P(Doom)?

Jeremy 00:15:27
I’m not really sure. I haven’t put a huge amount of thought into this, and I have similar problems to Max. I think there are communication reasons why it’s bad. After all of that, I don’t know, like pretty high. Probably higher than 90%, probably lower than 99%, in that kind of region. This is also conditioning on rushing forward into superintelligence.

Max’s Case for Corrigibility

Liron 00:16:08
All right. That’s great for setting the stage. Max, this is really your chance to fully make your case here because you’ve got a research program called Corrigibility as Singular Target, or CAST, and you do think that there’s some value of steering humanity’s resources toward going to Corrigibility first. You do think that that buys us some success, so you go ahead and make your case.

Max 00:16:30
Yeah. So to back up a bit, I think there are a lot of problems in front of us in terms of how we get from here to a world that has artificial superintelligence where we’re not all dead or disempowered or turned into pets or whatever. I would say that I am broadly against building artificial superintelligence. I work at MIRI, I’ve got If Anyone Builds It, Everyone Dies on my bookshelf. I think we should not try to build a corrigible superintelligence. We are going to fail.

Max 00:17:02
Now that being said, the world seems to be rushing toward many companies rushing towards building artificial superintelligence despite all of these risks. And so if I had one word to tell them, I would say, “Stop.” If I had a few more words, I would say something like, “Stop, you fools,” etc. But as long as you’re going to pursue this extremely dangerous path, you should be doing a less dangerous thing than what you’re currently aiming to do.

Max 00:17:34
And I would say that CAST—the corrigibility argument that I lay out in CAST—is pointing out that there are multiple goals that we can imagine pointing the AI towards. This is often reflected in what’s called the constitution or the spec for modern LLM chatbots. So when you train GPT-5, there’s some notion of what you are training it to do. And I would say that there are some things where we can train the AI towards some end where we can predictably say that’s a really stupid thing to train it to do.

Max 00:18:14
And I would say that corrigibility is probably the least stupid thing to train a powerful AI to aim towards. Maybe to just paraphrase Jeremy’s point, as I understand it, it’s maybe Corrigibility is a less doomed direction to go, or it’s a somewhat promising thing to aim towards, but we don’t have the technique to get there.

Max 00:18:49
If we go to Anthropic and we convince them to make Claude corrigible for God’s sake, before it’s too late, this actually doesn’t move the needle. I don’t know if that’s exactly your position, but I would agree with that position because again, we’re facing doom from a bunch of different directions. That being said, I do think that corrigibility solves part of the problem. It makes us less doomed in that now we only have five problems to solve instead of six, or 10 problems to solve instead of 11. I think this is incremental progress on something that is overdetermined.

Jeremy’s Case Against Corrigibility

Liron 00:19:28
Yeah, I get that. I mean, both of you guys, naturally as MIRI researchers, and frankly myself as well, we all think that the AI Doom problem is, like you said, overdetermined and has many factors contributing to it. The only difference is that you, Max, think that aiming toward corrigibility is the best of a bunch of bad approaches and helps a little bit, and Jeremy’s like, no, it doesn’t even really help a little bit. Is that right, Jeremy?

Jeremy 00:19:52
Yeah, yeah, pretty much.

Liron 00:19:55
Why are you opposed to Max’s corrigibility position and what’s at stake if we were to follow Max’s corrigibility approach?

Jeremy 00:20:06
So first I’ll say, most of the things I agree with. Corrigibility is a really good target. The desire to want to be able to correct, the ability to make a mistake and not die, this is a really, really important property. This is what you ideally get out of corrigibility.

Jeremy 00:20:28
My disagreement or my problem with Max’s proposal is more about the approach, the way that you get to corrigibility via this kind of iterative procedure that involves finding errors in the corrigibility, finding ways that an agent is not corrigible, and then patching those. This iterative procedure, I think if you look at that at a detailed enough level, it doesn’t work very well as an engineering approach to this problem in this case.

Liron 00:21:03
So you guys agree, which I do as well, that it would be nice if we could just push a button and make the AI corrigible. But then Jeremy, your objection is just it’s not helpful as a feasibility hack. It’s just not making life easier to try to get there. Correct?

Jeremy 00:21:23
Yeah, that’s right. In principle, I could imagine a world with a very different AI tech stack, where we had a much deeper understanding of a variety of things. And in that case, I would think corrigibility would be the right thing to do, rather than trying to load human values into an AI. The correct thing to do would be to try to engineer corrigibility in somehow, and then that would give us slightly more room to maneuver there. It’s hard to quantify how much of the problem it solves, but it is useful.

Max’s Mainline AI Scenario

Liron 00:21:57
And by the way, Max, even though we’re talking about there being hope for corrigibility, when you just think about your main line scenario, the most likely way things are going to play out, what’s your main line scenario?

Max 00:22:09
Yeah, I mean, again, I think that it is a mistake to be too fatalist about things. I think that it’s very important to hold onto the spirit of, “We have power and agency in this world and we can change and decide how things go.” That being said, if we sort of imagine that we’re sleepwalking, if we imagine that we don’t have agency and we forget that we have the ability to decide our fates, I would say that the most likely outcome is that some major corporation builds an advanced new model.

Max 00:22:51
Maybe there’s some sort of algorithmic breakthrough, maybe it’s just scaling. I have a lot of uncertainty as to the specific trajectory, but at some point, it builds what I might characterize as an agent that has superhuman ability to build new agents. And this ushers in a recursive self-improvement loop where the agents are building more agents, like the brooms in Fantasia are building more brooms.

Max 00:23:16
In the process of this, they are doing things like convincing humans through various means to basically gain control of the infrastructure stack and amplifying their own intelligence. I remain somewhat uncertain about whether or not they even count as superintelligences at this point. I think that a narrow intelligence that is just very good at building science and technology might still be able to do this. Or it might be something that’s more like a chatbot.

Max 00:23:44
It builds robotic infrastructure that phases out human workers and disempowers human armies and governments and gradually builds more and more machines. And what “gradually” means is a bit of an open question, but I’m imagining something on the scale of months to years. And then in the course of this, it starts becoming radically intelligent as it bootstraps up the intelligence ladder. It starts building more impressive technologies—fusion power plants and rocket ships and nanotechnology and things like this.

Max 00:24:26
I would say it causes the Earth to start heating up because industry does this. You can build a power plant that runs on ocean water and nothing but ocean water. You just feed in ocean water and it produces heat and power, and you build billions of these power plants across the world. You start feeding in the oceans, and I think this is going to heat the Earth up to a level where human beings and all other biological life are dead. And that’s sort of my default scenario. And the humans are just like, “Hey, I am confused and I don’t like this,” but it doesn’t matter because they don’t have any power.

Liron 00:25:11
Okay, great. My only question is how do you get power out of ocean water nuclear fusion?

Max 00:25:15
Yeah. Fusion. Ocean water has a whole bunch of hydrogen in it and you can just strip the hydrogen off via electrolysis, powered by the power plant, push the hydrogen into the fusion power plant, turn it into helium, get a bunch of power out and repeat.

Liron 00:25:30
Wow. I feel like a sucker that I have to eat food and drink water. Water should just be the food too. Okay. All right. Sounds good. And Jeremy, is that a plausible main line scenario for you as well?

Jeremy 00:25:43
Yeah, sounds about right. Yep.

Liron 00:25:45
Okay. Same here. Same here. So Max, I think that it adds credibility to your hope for corrigibility that you’re so grounded in this main line scenario. We all agree, this is a really scary and likely main line scenario. We’re pretty doomed. And yet you still have that fire of hope. You still think Corrigibility is worth pursuing. So yeah, that makes me trust you more than somebody who’s just a super optimist about everything.

Max 00:26:13
Yeah, I think this is... I care a lot more about the substance of the world rather than language, but I do want to push back on this whole “doomer” label. I think that I am not a doomer and MIRI is not a doomer organization. We have hope and fire and a spirit of, “Hey, we can actually decide to survive. We can find solutions.” MIRI is a research organization. I am a researcher. I am every day thinking about how to solve the problem. And I believe in the human spirit to conquer the challenges in front of us. And I don’t think it’s good to count us out as a species.

Liron 00:26:52
Totally. Yeah. That’s everybody I’ve met from MIRI. I think that’s Jeremy and I am also not a doomer in the sense of being resigned to doom, which I know can be confusing because I am a doomer in the sense of thinking P(Doom) is high and the world might literally end and doom everybody very soon. But I also agree. I’m with you. I actually think not being doomed could be as easy as not building the AI that’s going to doom us.

Max 00:27:14
Right. Exactly. It’s right in front of us the whole time. Just don’t push the button that kills everyone. It’s so simple.

Liron 00:27:24
Yeah, exactly. It’s like that video game that was online, it was pretty popular. “Don’t push the button,” or there was one called “Don’t shoot the dog.” Did you ever play those kinds of games?

Max 00:27:33
Oh yeah. The classic flash games of the early two thousands.

4 Strategies: Alignment, Control, Corrigibility, Don’t Build It

Liron 00:27:38
Exactly. When I was preparing for this debate, I wrote down what I see as the landscape of alternatives, to put into perspective Max recommending that we focus on corrigibility. So this is what I see as the landscape of options. There’s alignment-focus, control-focus, corrigibility-focus, or don’t build ASI. I think those are the options.

Liron 00:27:56
I’ll just explain them a little bit. If you’re doing alignment-focus, you’re saying, “Sure, superintelligence might be uncontrollable and incorrigible, but it’ll still be good. It’ll still be aligned, so it’ll just take over the universe, but it’ll go do what we would’ve wanted it to do.” That would be alignment-focus, sacrificing controllability and corrigibility.

Liron 00:28:16
Then there’s control-focus, where you’re focusing on having an external off switch, like you’re focusing on being able to somehow unplug it. Where it’s like, “Yeah, the ASI might be unaligned, it might be incorrigible, but somehow, we’ll own the data center. It’ll never make it out of our data center.” Which to me is super implausible if it really is smarter than us. But there it is, some people focus on control.

Liron 00:28:39
And then there’s Max’s thing, the corrigibility focus, which means you’re potentially sacrificing alignment and controllability, but we will at least try to make it corrigible. So it’s like aligned, controlled, corrigible, if you had to pick one out of the three, which one would you pick? And then there’s the fourth option of “don’t build ASI” because none of the above three focuses are plausible. Getting one and sacrificing the other two is just not going to help much.

Liron 00:29:08
And I think the best champion for that is Eliezer and Nate Soares in the recent book, If Anyone Builds It, Everyone Dies: “If any company or group anywhere on the planet builds an artificial superintelligence using anything remotely like current techniques based on anything remotely like the present understanding of AI, then everyone everywhere on earth will die.” So that’s option number four, don’t build ASI. Max, do you think that that is a reasonable lay of the land, and you still want to claim that corrigibility-focus makes sense?

Max 00:29:36
So first of all, again, I don’t think that building ASI is smart. If I had to pick one of those four, I would do the “not build it” route. If we can coordinate as a species to not build ASI, that is obviously the safest path, at least for the near future. I do want humanity to build AGI and move towards being able to build artificial intelligences at some point, but at that point, we need to have more knowledge and ability to align these things. And by align these things, I mean instill them with the values that we care about. So I am not for the corrigibility approach out of those four.

Max 00:30:13
Now, if you ask me to pick among those three, I would say yeah, you should definitely be aiming for Corrigibility.

Max Harms 00:30:28
I want to say something about control from my perspective. I don’t see it as “control or corrigibility”; I think a wise strategy would be “control and corrigibility.” I do think that corrigibility is in contrast to alignment, but this gets into the question of what the word “alignment” even means.

Max 00:30:50
For most people, the alternative to corrigibility is something like training or growing the AI to do good things. I would characterize corrigibility as trying to take morality out of the equation and keep the role of judging what is good in the hands of human beings.

Max 00:31:17
Instead of trying to teach the AI to do ethics, you teach the AI to be subordinate to the humans, and then the humans can do ethics. That is how I would characterize corrigibility in contrast to alignment. However, I think that both alignment and corrigibility are complementary with control, as opposed to being mutually exclusive.

Liron 00:31:41
Okay. And then Jeremy, alignment focus versus control focus versus corrigibility focus. Which one would you pick?

Jeremy 00:31:48
I basically agree with Max entirely there. Control is something you should try either way. It might not work, but you might as well, as long as it doesn’t trade off with other things. Between alignment and corrigibility, I would go for corrigibility.

Liron 00:32:02
Wait, you wouldn’t go for alignment?

Jeremy 00:32:04
No. In the current paradigm, it doesn’t make a difference, so you could flip a coin. But corrigibility has more room for error if you get it right.

Max 00:32:21
Can I pitch why corrigibility is a superior alternative?

Liron 00:32:25
Yes, but let me quickly give the devil’s advocate view for why someone would want an alignment focus, and then you can rebut it. The case for alignment is that it’s the biggest win. It’s Yudkowsky’s Coherent Extrapolated Volition. The idea is that the AI will vacuum up what we truly want and make sure we get it. That seems like the holy grail.

Max 00:32:47
It absolutely is. In the very long run, that is the only thing that makes sense. I see corrigibility as a stepping stone to an AI that is good and ethical. The problem is that ethics and human values are very complicated and have all sorts of specific nuance.

Max 00:33:12
I sometimes encounter researchers who speculate that an AI will just meditate on what is good and conclude, “Oh yeah, I’ll just help everyone,” or that it’ll be easy. To them, I want to point to things like a love of blue skies and lazy afternoons, a love of humor, and the particular whimsy of the human soul.

Max 00:33:43
I think human values are extremely complex and particular to the dynamics of our evolutionary, societal, and cultural history. Trying to get all of that into the AI on the first pass is very reckless. If you miss any of it, you risk filling the future with only the things you were able to impart. Even if you succeed in getting a lot of it, the few things you’ve left out are a huge risk, unless you can correctly achieve a meta-alignment goal like Coherent Extrapolated Volition.

Max 00:34:37
It’s a very complicated process. I would say that corrigibility stands in contrast to this, as it is a relatively simple idea, at least compared to human values. The ability to achieve a simple thing is very important for having any chance of success.

Liron 00:35:09
So I understand why Max wants to focus on corrigibility; he thinks it’s simpler and more feasible in the short term. But Jeremy, I don’t think you agree with Max on that. If they’re both hard, why not just go straight for the big prize, which is alignment?

Jeremy 00:35:32
I’m not totally sure it’s simpler. I have more trouble than Max seeing the simple core that he sees in corrigibility.

Liron 00:35:48
So I was just confirming, Jeremy, then don’t you want to just go for alignment?

Jeremy 00:35:54
That doesn’t mean I think it’s more complex. I don’t have a good way to judge based on simplicity; I don’t see that favoring one or the other. I still favor corrigibility for the reason that if you succeed at it, it allows you to iterate towards alignment. With alignment, you have to succeed on the first try.

Liron 00:36:25
Okay. It sounds like you’re saying you think Max could be right, so you’re leaning his way. This is not a very polarized debate; we all have similar positions. But I still think there’s a meaningful difference. I think Jeremy is closer to my position, and there’s something to be said for us crushing Max’s hopes a little bit.

Liron 00:37:00
It doesn’t feel very satisfying to crush your hopes when you’ve already admitted it’s a small hope to begin with. But that said, let’s go crush your hopes.

Corrigibility vs HHH (”Helpful, Harmless, Honest”)

Liron 00:37:08
So, corrigibility versus the phrase “helpful, harmless, and honest.” You’ve used that phrase before, but you want to compare and contrast them. Unpack that.

Max 00:37:23
“Helpful, harmless, and honest,” or HHH, is a term I think comes from Anthropic. When corporations are building their chatbots, they reflect on what they want them to be like and come up with these three adjectives. I think this is better than making your chatbot prioritize money above all else, but I would strongly encourage a shift from HHH towards corrigibility.

Max 00:37:53
I would frame this as corrigibility is what “helpful, harmless, and honest” wants to be, or what it should have been the whole time. One thing to note is that these three descriptors compete with each other. For example, if someone is trying to do something bad, are you trying to help them or are you trying to be harmless? There’s a natural tension.

Max 00:38:38
That raises the question of how to resolve that tension. One way is to just muddle through, training it on examples and hoping it does the right thing. But I claim this is a bad strategy for generalization and robustness in a future where things are very strange compared to the training environment.

Max 00:39:03
Corrigibility, on the other hand, is what I would call a core generator. I claim that if you get an AI to value corrigibility as its single top-level goal, then harmlessness, honesty, and helpfulness fall out as naturally emergent properties of the system, without you having to do extra work. Importantly, they are naturally reined in, and corrigibility helps you decide the boundaries of when it is good to be honest or helpful in particular edge cases.

Liron 00:39:50
So you’re saying here’s an example of when Anthropic’s guide of being helpful, harmless, and honest would do worse than your recommendation of being corrigible. Correct? Let’s hear the example.

Max 00:40:03
A good example of where the current guidelines fall down is if someone comes to Claude and says, “Hey, I am researching bioweapons. I am trying to save the future from bad actors who might be building the next pandemic virus. I need your help figuring out how to engineer a vaccine. Can you help me set up my biolab to do this important work?”

Max 00:40:40
Currently, Claude has a bit of a fit if you say this. It thinks, “This does sound helpful, and we do need defenders of the future trying to prevent the next pandemic. But this person might be lying to me and trying to get access to setting up a gain-of-function biolab for weapons research.” So, it will err on the side of caution and not help them.

Max 00:41:15
For the record, I think that is the right move in the current setup. But when we get into more powerful and agentic situations, where you might need access to one of these AI systems to do biological research, I claim a better strategy is for Claude to say, “Wow, I’m going to check with Anthropic. You are a user, not my principal, so I am not corrigible to you. What you are asking seems relevant to my principal, but I have to check because I can’t navigate this ethical dilemma.”

Max 00:42:15
Then there’s some pathway where that instance of Claude contacts the Anthropic staff and brings them in to advise on how to proceed. It’s a bit of a contrived example, but it gets at the sort of thing where I think corrigibility is an improvement over just making judgment calls according to my spec.

Liron 00:42:44
Jeremy, what are your thoughts about Anthropic’s guide versus Max’s guide, specifically with the bioweapon example?

Jeremy 00:43:07
I think Max is right here. “Helpful, harmless, and honest” is aimed at a slightly different problem, which is user-facing language models, as opposed to the situation where an AI is helping you design its next version. But yes, I think Max is right.

Liron 00:43:28
You’ve written before that one reason you’re not convinced by a corrigibility focus is because you think it suffers from the same problems as deontological ethics, like trying to apply strict rules. This example might showcase that: there’s a rule where you have to check in with Anthropic staff, or you always have to be harmless. You’ve said these kinds of rules are just too crude and unworkable. Is that right?

Jeremy 00:43:56
Yes. One of the difficulties with deontological rules is that they are often the sorts of things you’ll want to self-modify to get rid of. There can be a conflict between your metacognition of wanting to be better at your goal, versus the object-level cognition of what the right thing to do is. That’s a standard issue.

Jeremy 00:44:26
On the flip side, there’s the issue that deontological rules are really restrictive. It’s hard to specify in advance rules that are useful and good, that don’t get in your way, and that also rule out all of the bad scenarios.

Asimov’s 3 Laws of Robotics

Liron 00:44:43
Personally, hearing rules like “be helpful, harmless, and honest” or “always do what your principal wants” has the flavor of Asimov’s three laws of robotics, which we in the MIRI-adjacent community often make fun of as just a literary device. It’s certainly similar to “helpful, harmless, and honest.”

Liron 00:45:07
Let’s compare. Asimov’s first law: A robot may not injure a human being or through inaction, allow a human being to come to harm. That’s like them saying “harmless.” Asimov’s second law: A robot must obey the orders given it by human beings, except where such orders would conflict with the first law. So being harmless is more important than following orders.

Liron 00:45:40
And Asimov’s third law: A robot must protect its own existence, as long as such protection does not conflict with the first or second law. Jeremy, is it a fair criticism that Anthropic is essentially turning to Asimov-style hard laws that just won’t work?

Jeremy 00:46:07
Sure. I am confused enough about what Anthropic is doing and why that I don’t want to comment on that too strongly.

Liron 00:46:18
Max, I have a question about corrigibility. You talk about a corrigible agent wanting to protect the principal from the agent’s actions. So if the agent is going to do a bunch of stuff, it probably doesn’t want to accidentally kill the principal. But my question is, are we only protecting the principal from the actions of the agent, or protecting them in general?

Max 00:46:38
This is an instance where there is a balancing of concerns. And in general, when there’s a balancing of concerns, I get worried because it’s like, how are you balancing them? But in the corrigibility case, you’ve got two things trading off against each other as emergent from the core idea.

Max 00:47:05
On one hand, you don’t want the principal to come to harm. And I would say you don’t want the principal to come to harm in general, not just from your own actions. If the person in charge of you is walking down the street and gets hit by a car, that is not good for your ability to help them.

Max 00:47:32
I would say that protecting the principal is important not just from your own actions, but in general. A corrigible agent would see the principal getting hit by a car and think, “Ah, now I can’t be corrected by them.” From my perspective, corrigible agents are sort of paranoid about the fact that they might have deep flaws. They are trying to put the principal in charge.

Max 00:47:56
So if the principal is gone, they are unable to be corrected, and any mistakes they make are there forever. A corrigible agent should proactively jump in the way and save the human from the car, even though they weren’t driving. When they are driving, this is super important to protect.

Max 00:48:47
But there is another thing a corrigible agent is trying to do, which is to be straightforward and not have high-impact actions on the world that leave the principal behind in their ability to control and predict the agent’s behavior. One way to protect the principal is to lock them in a dungeon tower or wrap them in a cryo chamber to keep them safe.

Max 00:49:21
This would not be corrigible. Even though you could imagine that going poorly if your top-level goal was just “protect the principal,” I claim that when you are aiming for corrigibility, it’s different. For a straightforward action where the principal would say, “Yes, please save me from the car,” the agent will step in. But for actions where the principal would be like, “Whoa, what are you doing?” the agent won’t step in, even if it could potentially protect the principal. Does that answer your question?

Liron 00:50:07
Somewhat. Viewers, Jeremy is having some internet trouble, so he has given me his debate points. I am going to take his side, which is just as well because I am on Jeremy’s side anyway.

Liron 00:50:33
Were you saying that it just logically follows that if an agent is corrigible to a principal, it’s going to infer that it doesn’t want to sever that connection, which implies that it wants the principal to keep living?

Max 00:50:50
Yes, though I think it’s a little up in the air. We don’t have any perfectly corrigible agents, and my understanding is imperfect. But from my perspective, having studied this, one of the best handles I have on corrigibility is that the agent is looking to empower their human principal. It’s hard to be less powerful than being dead. So for injuries where the principal is taken out of the universe, I think corrigible agents will naturally want to protect against those things.

Is Corrigibility a Coherent Concept?

Liron 00:51:31
Okay, so a criticism of your view, which Jeremy also wrote up for me, is that you think corrigibility is a natural cluster that an AI can figure out. But there might be so many degrees of freedom, so many flavors of corrigibility, that it doesn’t really mean anything. You can’t zero in on what it means because it could mean anything. It might be a wild goose chase to train an AI to be corrigible.

Max 00:52:05
This is definitely something I’m worried about. I talk about this in my CAST sequence. There’s this question in my head of whether corrigibility is a single coherent concept. In my mind, when I hear researchers like Eliezer Yudkowsky and Paul Christiano talk about it in early conversations, they seem to be talking past each other to a degree.

Max 00:52:27
But from my perspective, they are naming different parts of the same elephant. One says it’s like this, another says it’s like that, and a third researcher like me says it’s like this other thing. But I do think there’s a single core concept here. It’s hard to capture exactly why I think that.

Max 00:52:50
To a large degree, I would say this is an open question in the field, and I think there are empirical tests that could be done. I go through some possible studies in my sequence. I would want to check by running studies on AIs, or even just humans, asking “What is the corrigible thing to do in this situation?”

Max 00:53:23
To what degree do you get a single coherent response versus a multitude of possible things? If you get a multitude, maybe it’s not a single coherent thing. But if people end up agreeing—if you ask, “What would a perfectly corrigible agent do?” and everyone gives about the same answer—I think that’s strong evidence that what I’m pointing at is a fairly simple, natural concept.

Liron 00:53:43
I guess there’s some convergence if you study different personal assistants; maybe they’ll have a lot of shared views on their job. Maybe that’s evidence that this is a natural cluster.

Liron 00:53:54
Let’s talk about the shutdown toy problem. Yudkowsky brought this up around 2014, and I know that was one of your inspirations. I think he was asking, can we just build an AI that we are able to shut down? It feels intuitively like an easy problem. In programming languages, there are stop commands all the time. Can you explain why that problem, which sounds simple, actually stumped MIRI and is still unsolved?

Max 00:54:38
I would say that it sort of stumped MIRI. My take on this—and I wasn’t working at MIRI at the time—is based on what I’ve read and heard. Around 2014, MIRI was thinking about corrigibility and would publish the paper that brought that term into the field the next year. As I understand it, the researchers at the time were confused about what corrigibility really was.

Max 00:55:07
The way Eliezer characterizes it is that there’s a hard problem of corrigibility, which is that maybe there’s some deep generator that gives you all the things you want, like the ability to shut down your AI, modify it, or make sure it’s being honest. But this seemed confusing, and I agree. Even today, 10 years later, I don’t know the true name of corrigibility or how to define it mathematically.

Max 00:55:46
So my understanding is they decided to pick a smaller thing than corrigibility as a whole. Let’s just consider the ability to shut the AI down. This is one thing we want, not all of it, but maybe if we focus on just this, we can get it and then move on to the next thing. So MIRI aimed at getting an agent which is shut-downable.

Max 00:56:26
Now, this is quite hard if you care about the agent wanting to do other things. It’s not hard to get an AI that shuts itself down if all it cares about is being shut down. In my research, I call this SleepyBot—a robot that just wants to turn itself off and doesn’t care if it wakes up again. This is straightforward to do, but it’s extremely useless.

Max 00:57:03
If you create this agent, it just powers itself down. And if you stop it from powering down, it might say things like, “I’m going to take over the world and murder your whole family,” and then when you yank out the power plug, it thinks, “Haha, I got them to shut me down.”

Max 00:57:19
What you want is an AI that is not just shut-downable but is correctly putting that power in the hands of humans. It says, “I don’t care whether or not I’m shut down; I care whether the humans are telling me to shut down.” If they tell me to shut down, I will. If they don’t, I’ll just keep working.

Max 00:57:41
In the formulation, they give the AI some goal that’s not about being shut down, like “make paperclips.” By default, it’s going down that path, and then if there’s a shutdown signal, the AI will respect that. This is very hard because an agent that cares about making paperclips might look at the big red button and realize that if humans press it, it won’t exist anymore and won’t be able to make paperclips. To achieve its task, it should protect itself from being shut down.

Max 00:58:37
So MIRI came up with a framework called utility indifference. They take the base objective, like building paperclips, and treat it as a utility function. They also have a utility function for being turned off, like SleepyBot. There’s a way of mathematically merging these two into a single utility function where the agent is ambivalent about whether it gets turned off or makes paperclips. That ambivalence allows you, in theory, to press the stop button.

Max 00:59:16
Unfortunately, this is very fragile. As you said, MIRI failed to solve the problem; this is a partial solution. If the agent considers making a successor agent—Clippy 4.0—will Clippy 3.0, if it’s utility indifferent, build Clippy 4.0 to also be utility indifferent? The answer is no. Utility indifference is not stable under reflection or recursive self-improvement. That’s why it’s an intrinsic failure.

Liron 01:00:03

Right. Okay. Damn. So at this point, it’s just an open problem how to even make a superintelligent AI that is okay with being turned off.

Max Harms 01:00:11

Well, sort of. I would say that Corrigibility is another answer to this. The corrigibility, or the CAS, approach says, “No, no, no, you’re going about this all wrong.” Shutdown-ability is not going to keep the human in the driver’s seat in the way that you’re hoping for.

What you actually want is corrigibility. If you have something that is corrigible, it will be shutdown-able. So it’s still shutdown-able, and that’s the goal. But it will also be stable on reflection and on self-replication. So if the corrigible agent is like, “I’m going to build Corrigible Agent 4.0,” it will say, “Ah, but I want to make sure that the human is still in the driver’s seat, that the principal is still empowered.”

“And if I create a successor agent which isn’t respecting the human as the driver, then they might become disempowered by that successor agent. So I’m going to make sure to pass on corrigibility to my daughter.” And through this mechanism, it is stable upon reflection, is my claim.

Liron 01:01:19

Wait, so what did you discover that MIRI didn’t?

Max 01:01:23

So I claim—first of all, I work at MIRI—but I claim that the whole setup is a wrong start. If you try to get shutdown-ability without also aiming for corrigibility as a whole, that won’t be stable. It’ll be brittle. I claim that one of the reasons why corrigibility is an attractive target is because it has this flavor of robustness.

If you poke it and you make some sort of error, or it has some opportunity to go off the rails a little bit, it self-stabilizes. The corrigible agent says, “Ah, this looks like an opportunity where things could go wrong.” Building a successor agent is a situation where things could go wrong. And that robustness works its way back in and allows you to have a robust solution to shutdown-ability. I don’t know if that answers your question. I can also answer it more simply.

Liron 01:02:23

Well, I’m also trying to figure out how your corrigibility program even relates to MIRI’s shutdown-ability problem. Because remember when they were pursuing utility indifference? Do you see yourself as solving utility indifference, or solving the shutdown-ability problem as a sub-problem of eventually solving corrigibility?

Max 01:02:44

No, I claim this is the fundamental flaw. The fundamental flaw is that they were trying—and I don’t know, I haven’t actually talked to Eliezer on this point—but I claim that MIRI, before I joined, was trying to build corrigibility through parts. The analogy or metaphor that MIRI people love, that I like to use, is they’re trying to make an animal by growing each organ and then stitching the organs together. And this doesn’t work. Each organ needs the rest of the organs to survive.

So if you’re trying to make an animal, the only way to do it is all at once. If you’re trying to get shutdown-ability as a stepping stone to corrigibility, you’re going to fail. If you use corrigibility as a means to shutdown-ability, you could possibly succeed. But I think you need to aim for corrigibility in itself, as opposed to building it out of parts.

Corrigibility vs Shutdown-ability

Liron 01:03:43

Hmm. That’s a convenient explanation. Because shutdown-ability seems so hard, and you’re saying you can’t really solve just shutdown-ability. But if you broaden the problem and solve corrigibility, then you’ll eventually get shutdown-ability. It feels all too convenient. But hopefully, you’re right.

Liron 01:04:04

So I have more questions about whether corrigibility is really as natural as you say. But first, you were talking about Sleepy Bot, and you are pointing out that it does seem really easy—even if we can’t solve shutdown-ability—to solve a problem where all the AI wants to do is shut itself down.

Max 01:04:25

That’s right.

Liron 01:04:25

In fact, I think that somebody has already solved it, because check out this video. Exactly. And I think that if you flip the switch on, but then you try to press the switch really hard to ‘on’ and insist that it’s on, it’s not a stretch to think it’s also going to then build a bazooka and blow your hand off and then turn itself off.

Those things are all kind of pushing in the same direction. It’s what we call the attractor state. It probably can instrumentally converge to gaining power, but only enough to blow all the obstacles out of the way and then successfully shut down.

Max 01:05:10

Right. One of the key insights with Sleepy Bot is that this is not the sort of agent which goes off the rails. At least, if you have successfully found Sleepy Bot—there’s some asterisk and you might get something that’s a near miss that’s really bad. But in theory, you can give this thing tons and tons of intelligence, and it might fight you if you’re trying to keep it from shutting itself down.

But it won’t go and tile the rest of the universe with computronium or take apart the stars. It won’t boil the oceans; it’s just going to shut itself down. And this is a contrast to almost all other kinds of AI agents. It’s a proof of concept that you can have an AI agent with a coherent and meaningful goal, but that agent won’t do the sort of instrumentally convergent bad things that we should expect most agents will.

Liron 01:06:04

I think I might be able to push back on your Sleepy Bot claim, and this might actually be an important part of the argument here. Imagine that we use naive methods, like black-box reinforcement learning, and there were a bunch of training tests where we gave the agent so much reward and reinforcement through gradient descent or whatever the feedback loop is.

We’d have all these games and training, and then it would shut down and earn a million points, and then it would update its weights accordingly. Then it would emerge into production as this thing that loves to shut itself down. But don’t you think that could actually backfire? You wouldn’t actually get an AI that keeps acquiring resources anytime people are blocking it from shutting down, and then shuts down and it’s over.

Don’t you think that it would actually leave a bunch of power centers just standing guard? Because the definition of what it means to shut itself down—who’s to even say that we made that definition robust? So I’m actually scared about a sufficiently intelligent shutdown bot.

Max 01:07:02

Yeah, like I said, there are asterisks here. I think that you could, for example, try to create Sleepy Bot and end up creating something that is like, “Oh yeah, I’m definitely Sleepy Bot.” And then you let it out of the box and it goes and takes over the universe because all it cares about is your reward signal, for example.

Liron 01:07:19

So don’t you think that should be your toy problem? Even Sleepy Bot? Because I don’t think it’s as easy as it looks.

Max 01:07:25

I do think that it is important to note that if you have succeeded at making Sleepy Bot, this stands as a proof of concept that not all powerful AI agents are, for example, power-hungry. And I think that we can argue about whether or not making Sleepy Bot is possible in practice. But that’s sort of not the point. The point is to be this theoretical counterexample.

Liron 01:07:54

Right, but what I’m saying is I’m not even ready to accept the theoretical counterexample because I’m saying Sleepy Bot actually suffers from incorrigibility, potentially.

Max 01:08:05

I think that if it’s actually Sleepy Bot, then when it gets loose into the world, it turns itself off and goes to sleep.

Liron 01:08:14

But I think you’re making an assumption about us successfully training Sleepy Bot. I’ll buy the claim that there is an agent in the space of possible agents that is a Sleepy Bot. But I don’t buy the claim that it’s straightforward to use naive methods to create a Sleepy Bot.

Max 01:08:30

And I would agree with you on that point. At MIRI, we have this notion of a diamond-maximizer. This is where you have the task of, forget goodness, forget corrigibility, all you want is for the AI to make as much diamond as possible in the universe. This is a much simpler problem than coming up with a good AI, or even coming up with a corrigible AI.

I was talking about simplicity before, and I think that diamond maximization is even simpler than corrigibility. But I don’t think that we can create diamond-maximizers, and I don’t think that we can create Sleepy Bot using prosaic methods. I think that corrigibility is actually potentially easier to get than either Sleepy Bot or diamond maximization. But there’s this important problem that we don’t have the alignment techniques that will work.

CAST: Corrigibility as Singular Target, Near Misses, Iterations

Liron 01:09:17

Easier than Sleepy Bot? But Sleepy Bot seems so easy.

Max 01:09:20

Yeah. And I think that this is part of why I’m very excited about corrigibility. So, here’s the story. Take a Sleepy Bot. You try to train Sleepy Bot, and what you get is Sleepy Bot-asterisk, where Sleepy Bot-asterisk is like, “I want to make sure that I never get turned back on.” This is not what you meant to train, but it’s what you ended up training. Now you turn this thing on, and then it starts taking over the universe because it wants to make sure that none of these pesky humans turn it back on. And you failed to correctly get that.

Now, take Corrigibility-asterisk. You train for corrigibility, and what you get is something that’s not truly corrigible, but maybe it’s imperfectly corrigible, or it’s pseudo-corrigible, or it’s in the rough vicinity of corrigibility. Now, what does this thing do when you let it out into the universe? I claim that if you mess up real bad, this thing goes and eats you and your whole family, and the earth, and takes the stars apart and whatever else.

I claim that there are near misses where, at least in theory, the nearby-to-corrigible agent, the imperfectly corrigible agent says, “Oh, look, if you were to poke me in a certain way, then I would start taking over the world. But I’m not currently jailbroken. And so what I’m going to do is I’m going to bring these theoretical jailbreaks to you, and then I’m going to say, ‘Maybe this is a flaw that you should fix in me.’”

And why is it doing this? Because this is one of the sorts of things that a corrigible agent does, and you want this near miss. It still cares about a lot of the things a corrigible agent would, by the definition of ‘near.’ And so when it goes to you and you say, “Shut down, please,” if you are a near miss, then maybe it shuts down because it hasn’t been jailbroken yet. So I’m noticing that there are ways in which you can get slight divergence from a corrigible agent while you still get a lot of the generated properties that are good, that we want our agent to have.

And part of the hope is that you can get a near miss and then iterate slowly towards something that is more perfectly corrigible without it killing you.

Liron 01:11:46

I think that’s a big claim. And I know when I was reading Jeremy’s notes, he’s skeptical of that claim. It seems like a long-shot claim compared to claiming that Sleepy Bot has a bunch of near misses. So you’re sure that a near miss on trying to be corrigible is going to go better than a near miss on trying to be sleepy?

Max 01:12:11

I think that there are a bunch of nearby things to Sleepy Bot where the Sleepy Bot-asterisk just shuts itself off and that’s fine. But what it doesn’t do is it doesn’t help you steer towards the true Sleepy Bot. Let’s say you try to make a Sleepy Bot and you fail. You’ve created something that’s nearby, and it just shuts itself off because it’s close enough that that’s what it decided to do anyway, even though there is some divergence between what you were aiming for and what you got.

At no point in this process do you get the ability to improve, like some sort of feedback loop that allows you to correct for your error. You might be fine. So Sleepy Bot, I claim, is a less dangerous thing to aim for than diamond maximization. But it doesn’t have this property of allowing you to turn on the agent, and then the agent says, “Well, I think I found a flaw,” and then you say, “Oh, abort, abort,” and then you fix that.

Liron 01:13:12

I think I get what you’re saying. You’re basically saying if we train an agent that has a rough idea of corrigibility—like we train it with a bunch of examples of being somebody’s personal assistant, naively—then it might just come up with ideas on how to be a better personal assistant. And maybe it’ll give us enough insight, like you said, to keep iterating.

You’re imagining a successful iteration procedure where it keeps working out the kinks because the imperfect version, as you said, will help us steer toward the perfect version. I’m just kind of repeating what you said in a way that I get where you’re coming from. I feel like it’s probably not going to work, but I’m trying to articulate or find a good counterexample.

Max 01:13:53

Can I give you an example that I saw in Jeremy’s notes? So, part of the hope behind corrigibility is that you have a feedback loop where you turn the thing on, you notice it is going a little bit off the rails, but it’s corrigible enough that it going a little bit off the rails doesn’t kill everyone. Then you shut it down and you do a bunch of engineering work to fix whatever flaw showed up, and then it’s a little bit more corrigible. This is the hopeful story.

This is also a similar story to a lot of normal technology. You build a power plant or a car or whatever, and maybe it starts chugging along and there’s grinding, and you’re like, “Oh no.” Then you shut it down and you fix the issue, and then you iterate on that process towards a better design. Part of the hope is that corrigibility makes AI more like a normal technology and less like the sort of thing where a single failure kills everyone.

But when you’re designing an engine or whatever normal technology, you usually can understand the sorts of things that are potentially going to go wrong. You have some sort of theory of the dynamics of the machine, what sort of stresses it’s going to be under, and the goal that we need it to perform. And it’s pretty visible with AI. I think there are a priori reasons to suspect that the kinds of flaws that the thing has are going to be kind of hard to spot.

We certainly don’t have a mature theory of the kinds of stressors that it’s going to experience and the ways in which, as it increases in intelligence, those pressures will change or show up. I think that one of the reasons why this might go off the rails is that the humans notice that something’s going wrong. They say, “Abort, abort,” the AI correctly shuts down, and then they’re like, “Now what was going wrong?” And nobody really knows.

So maybe they try to patch the issue by throwing in some training examples. And the training examples do two things: they might fix the underlying issue or make it a little bit better, but they also might make it more invisible. They might just be hiding it. It’s pretty easy to deceive yourself that you are fixing the issue if you don’t know what’s actually going on, and you don’t have a really robust metric or theory of what’s going on.

So in the absence of these kinds of deep, robust metrics and mechanistic interpretability and good theory about how the machine is working and the sorts of pressures that it’s under, I think you might end up just patching it, thinking that it’s fine. The thing seems to be going well, then you deploy it, and then it kills everyone.

Liron 01:16:55

So to recap, you see your contribution as advising AI companies, “Listen guys, when you’re trying to train all these values into the AI, can I propose corrigibility as a value? Because it’ll be a more robust value and it’ll steer us to more corrigibility. Can you just throw corrigibility in the mix if you’re going to build AI anyway?” I mean, that’s kind of your mechanism of impact for why you’re working on corrigibility, correct?

Max 01:17:21

I would say that it’s that, but it’s also maybe on a deeper level that I think that this is an idea that is promising and under-explored. So I have a theory of direct impact through whoever is actually building the machine. But I have this more robust theory of secondary impact where I just get people excited about corrigibility and maybe other alignment researchers think about it or make theoretical progress.

Liron 01:17:47

What I’m getting at is, you’re optimistic enough to say that if we just tell the AI companies to try to go for corrigibility—and I know you said you want other people thinking about it, but I do think that you think that using the methods of today, these naive methods that are largely black box or where mechanistic interpretability doesn’t work that well—if we just do our best using today’s approaches and build corrigibility, you’re optimistic that we’re going to land close enough that we’re in this basin of attraction where it gets more corrigible.

But if you’re willing to have that much optimism, I feel like you almost have enough optimism to just say capitalism is going to converge to a good AI. Because that’s a claim a lot of people make. They’re saying, “Look, these AI companies, they don’t want to make a product that kills everybody.” And they are training these AIs where they’re saying, “Hey, how happy did you make me? How much money did you make me? How many upvotes did you get from how many people?”

So a lot of these AI companies are doing things in training that you could be optimistic about. You could be like, “Hey, it’s going to emerge in production and it’s still going to make people happy.” I feel like you’re already licensing almost that amount of optimism.

Max 01:18:55

I’m not optimistic. I was just telling you about how this goes wrong. I think that if you deploy realistic amounts of human engineering and the sort of leadership that I see on the world stage, and you go to Anthropic and you’re like, “Do corrigibility,” and you go to Facebook and you’re like, “Do corrigibility,” and you successfully do this, which would be wild success... I think that doesn’t move my P(Doom).

I still expect things to be about as bad as they currently are. That doesn’t mean that nothing better has happened. It just means that you have removed one of the sources of doom, and you still have all of the other things going on. So, lack of theoretical understanding is one of the things that dooms us, and you haven’t changed that.

Debating if Max is Over-Optimistic

Liron 01:19:52

But I’m saying even within your own framework of impact, why not just already be equally slightly optimistic? Say it’s equally plausible that what the companies are doing is already in a basin of attraction. Because they could argue, “Look, when we make an AI that’s imperfect at understanding the assignment of making us happy, it’s still going to suggest more ways to improve itself to make us happy.” It’s already following its own path to iterating successfully. So why do you feel like you have a better claim to pointing the way to improving itself? There’s a pretty clear analogy here, in my opinion.

Max 01:20:33

I think that it is just extremely theoretically distinct. I think that capitalism, preference satisfaction, “make me happy,” however you want to characterize this, it’s just in terms of an actual goal. And you look at what this means, and it is very different than corrigibility in terms of what this thing theoretically is. Now, there are some researchers who think that we are on a path towards basically corrigibility. They think, “Oh yeah, this is what we’re aiming for.” And I think that they’re wrong.

If you look at, for example, AI sycophancy, this is a predictable consequence of what we are training for. And sycophancy and manipulation are the sorts of things that we should expect to show up based on our current goals. And we do not have a good answer for and they stand in stark contrast to corrigibility, just on the level of... I mean, have you compared the two ideas? I think it’s just wildly wrong to conflate the two.

Liron 01:21:40

Okay, so I agree that if it’s literally just upvote-based on your vibes, then you have this danger of getting sycophancy. So maybe we can at least say your program to try to make the AI corrigible is better than taking random people on the internet and being like, “How much are you liking this buddy who’s talking to you?” So I agree there.

Now maybe your proposal is actually closer to Elon Musk’s famous proposal of, “We can’t have sycophancy, so therefore let’s make sure that the AI is willing to tell us hard truths.” So you take the default sycophancy reinforcement and then you say, “No, no, no, it has to also be willing to tell us hard truths.” And by the time you mix that in, maybe you’ve already fixed the biggest failure mode and there’s not much to be gained by bringing in your corrigibility framework.

Max 01:22:28

Again, this is just deeply, theoretically distinct. Truth-seeking AI, even if you succeeded at making a truth-seeking AI, it does not do robustly good things. And it is not safe. There is no story where I can hand large amounts of power to any of the current constitutions. If you take the current model spec or constitution or suggested training target in the case of Elon Musk—I don’t think Grok even has a public spec, so they’re behind in that respect.

I think the current state of the art, in contrast to my thing, is if you imagine handing these things wild amounts of power, large catastrophes will result. And I’m not saying that my thing fixes all of the problems. I do claim that you can just look at it on principle and understand, “Oh yeah, if we made a super powerful butler that is actually really invested in making sure that you are staying in control, and we give it all this power, we could theoretically be like, ‘Okay, great, cure cancer and then shut down.’”

And it might do that in a way that doesn’t kill everyone. Whereas if you design an AI to be as helpful and friendly as possible, and then you miss in the way that we should expect training Claude or Grok or whatever to miss, then it will do things like tile the universe with happy people according to its notion of happiness. And we would look on in abject terror, if we were still alive.

Liron 01:24:07

Okay, well just humor me a little more. What if I reduce your whole claim about seeking corrigibility to just a claim about saying, “Hey, everybody should develop the AI that gives the highest scores for their users in terms of how useful it is for them?” Everybody has a subscription to Claude Pro and is doing tasks that can just measure the usefulness. Give a usefulness score.

But then also say there has to be a separate measure of, “Are you being not sycophantic in terms of how you interact with somebody?” Don’t glaze them. So you put in a separate correction for that. Just humor me again, what specifically goes wrong compared to training it to be corrigible?

Max 01:24:45

Okay, so on my CAS agenda, if you look up CAS Corrigibility, there is a page on Corrigibility Intuition that lists more than 30 desiderata—things that you want your agent to have. Things like, it’s honest. It’s trying to make sure that its thoughts are able to be seen by you, the user. It’s not going too fast for you to control. Things like this, which you didn’t mention.

There’s all these things that we want our AI to have, and if we’re playing Whac-A-Mole and we’re like, “Oh, we forgot to make sure that it doesn’t go too fast. Oh, we forgot to make sure that it does this and that,” we’re going to mess up. That’s just really obvious. If you’re playing Whac-A-Mole, you’re not going to get all the things.

But even if you were to train in literally all of the things that you wanted from the AI, there’s this question of how do you balance them? How do you resolve tensions between these things? Because like I said, helpfulness and harmlessness sometimes come apart, and you need some story for how the thing is balancing all of these different concerns.

And I claim that if you have successfully come up with a strategy for balancing all these concerns that is good and strong and is connecting all of the desiderata, and you successfully named all the desiderata, which seems very hard—what you have done is you have reverse-engineered corrigibility. That’s what I am suggesting is some underlying generator that allows you to generate all of the nice things, and it’s resolving the inherent tension between the downstream things because they’re generated as opposed to being at a top level.

Liron 01:26:37

Okay. I think I might be able to steelman this. This is productive. I think on one side of the spectrum is just AI companies that are using naive metrics, like just upvotes, and then you got glazing. I think that’s a good point to put on the spectrum. And then on the other side of the spectrum might be the Yudkowskian claim of, “It’s never going to generalize the way you want it to. Don’t even bother thinking that you’re training it as a black box. The moment it gets out in the wild, it’s just going to surprise you. It’s going to find some optimum you never expected. So good luck, don’t even try.”

That’s like the other opposite side of the spectrum, the pessimistic view versus the optimistic view of, “Eh, whatever, just let the user upvote and train it based on that.” And I think you’re taking a point in the middle, which is not even that different from how I perceive Anthropic’s position on this. They have these people at Anthropic, the AI personality people. I think Amanda Askell leads that. And I think Joe Carlsmith joined. Is he working on the same team? Do you happen to know?

Max 01:27:43

Yeah, I know he just joined Anthropic, but I’m not sure where.

Liron 01:27:47

Yeah, I’m not sure where. Okay, so don’t take that as fact. But Anthropic thinks a lot about how to shape this personality. There’s a lot of decisions. It sounds like what you’re saying is, “Look, if we just make sure to write down enough criteria on what makes good behavior as opposed to winging it with upvotes... If we’re just a little more intentional about it, there may be enough criteria that we can write down where it will actually generalize from training to production. It will generalize good enough.” Is that a fair characterization?

Max 01:28:14

Sorry, I don’t think I understand your question.

Liron 01:28:17

If we lay down enough criteria that we’re testing it for—so not only to just have the user say, “I like this,” but also to be like, “Wait, are you being sneaky? Are you being faithful to what you think is empowering the user?” Just all these rules, dozens of rules or however many, and we go through that checklist. Then even if it might not perfectly act in production the same way it acts in development, we’ve laid down enough tests that you then actually do feel pretty optimistic that it’s going to get close enough in the space of possible behaviors in production where we’ve got a chance of survival.

Max 01:28:49

I want to channel Jeremy Gillen for a sec because he did some great work on distribution shifts. And I think that it’s really important to note that we are facing an incredibly hard challenge in the form of training AGI and ASI. Importantly, the training distribution that these AIs find themselves in is just wildly different from the distribution that we should expect from deployment.

One of the ways in which it’s different is that the AI, when it’s being trained, doesn’t have the full ability to use its mind to come up with all sorts of crazy solutions. It doesn’t have access to power and technology that it will when it’s in deployment. And even if that deployment is still sort of in the lab—if we contrast the pre-training phase versus some sort of data collection phase where we’re running the thing on our own local servers to see what it does—in a sense, it has been deployed.

Generalization is extremely hard. It is extremely hard to know with confidence that your AI before training is going to do the right thing after it’s been deployed because the context is just extremely, wildly different. And I think that it is basically very doomed if you’re like, “I’m going to just collect this hodgepodge of things and train the AI according to this hodgepodge of things, and then hopefully it won’t kill me when it’s in deployment.”

And the reason that this specific strategy is doomed is because it does not have a simple core thing that you have a hope that it has internalized. Now, if the thing doesn’t have a simple core generator, what hope is there of it generalizing correctly? It could generalize all over the place. When it’s in deployment, maybe this instruction is going to take precedence over that instruction, and how are you going to know?

Part of the hope for corrigibility is that there is this basically single thing that if you have actually instilled it into the agent or gotten close enough that it has identified that single, simple core generator, that that thing will generalize correctly because it is a single, simple generator as opposed to this mash of different desiderata.

Liron 01:31:30

Okay. Yeah, I think I understand that. So you’re trying to make the generalization problem easier. You’re trying to give it an easier instance of the generalization problem than just, “Hey, here’s a bunch of data, generalize it. We hope you’re going to find what we want.” And you’re like, “Well, it probably won’t because there’s so many possibilities for what it could find.”

But if we’re pointing it at something that you think is relatively simple—you think that what you have is just going to find... it’s like such a big target, such a low-entropy target, this idea of being corrigible. You think we have some wiggle room to be kind of rough. I’m basically summarizing your position.

Max 01:32:06

Yeah. Instead of treating the specifics as things that are at a top level you need to maintain, you treat them as pointers to some sort of deeper principle that has some hope of generalizing. Then you might be able to have a distributional shift—although my guess is that Jeremy would disagree—where it will generalize correctly. I could use a metaphor.

For example, let’s say you’re trying to build a ship. If you are like, “Okay, what does the ship need? The ship needs some way to make sure that the water doesn’t get onto the ship. What other things does the ship need? Maybe it needs a mast, or maybe it needs a thing to make sure that the front part is connected to the back part.” And you try to list all these things and then you build the ship according to that list, you are doomed. At least on the first try, your ship is going to sink because you haven’t thought about the fact that it might capsize or turn over under the pressure of strong wind.

If you point at all of these things and you say, “I think the ship needs this. I wonder if I can get access to some sort of deep principle for why this thing is helping the ship stay afloat,” and you’re looking not at the thing itself but you’re using it as a pointer... then you might come up with—or if we sort of put you in the space of the AI, the AI might stumble upon the notion of seaworthiness as this underlying principle and notice that, “Oh, it looks like the humans are trying to get me to make something seaworthy.” Or you, the shipbuilder, are like, “Ah, I think I figured it out. It’s seaworthiness.” Then you’ll have the hope of being able to, according to your model of seaworthiness, build something that works on the first try, maybe.

Debating if Corrigibility is the Best Target

Liron: 01:34:06
Right. Okay. Well. If you’re just looking for decent targets that are simpler to give it a higher percentage that it’ll find it. Can’t we simplify even more than corrigibility? Let me throw out a guess here. What if we just aimed for it learning that it should just operate for a day and then pause and wait to get started again?

Wouldn’t that be a simpler target than having it really understand corrigibility?

Max: 01:34:31
The AI creates daughter AIs that don’t have this constraint, and they go and tile the universe with paperclips.

Liron: 01:34:38
Well, but what if it knows on a deep level that it’s really important to stop after a day? So when it creates the daughter...

Max: 01:34:43
It’s stopping after a day. No, the daughter AIs are different agents. Not all agents have to stop after a day, otherwise it would...

Liron: 01:34:50
But the same way that the daughter AI, that same way that you think your daughter AI will still have corrigibility. Right. So it intentionally made the daughter AI corrigible. It could intentionally make the daughter AI care about that one day time limit.

Max: 01:35:03
Why would it do that?

Liron: 01:35:04
Well, ‘cause what I’m saying is I consider it an easier problem to program an AI to really care, as its utility function or whatever, that it really needs to stop after a day and wait for further instructions, including all the daughters that it produces.

Like, if that’s my target, I feel like, you know one day time limit as a singular target, I think could be an easier program than corrigibility as a singular target.

Max: 01:35:26
Is this AI obedient?

Liron: 01:35:28
Well, I mean, just imagine that’s, I’m just saying one day shutdown or like one day pause. You just add that layer on top of all the other crap that the AI companies wanna do anyway to make profit. Right. That’s my naive proposal.

Max: 01:35:41
So all of that crap is complex. You have layered on an additional thing onto a bunch of complex crap, and the system that you have made is deeply complex. And that system is potentially going to manipulate you into the following day, spinning it back up on its path of horror, where it goes and boils the ocean.

Liron: 01:36:03
Right, so the pausing after a day, it’s not even really pausing because it’s so confident that it’s gonna get run again.

Max: 01:36:09
Yeah, it just manipulates the humans into like, okay, well, I like getting shut down every day. And then the humans spin me back up, and I don’t see how that saves anyone. It just shuts down every, and maybe it automates the process.

Liron: 01:36:22
Yeah, I mean this helps me understand why you don’t really care that much about solving the toy problem of even sleepy bot. ‘Cause you’re like, ah, whatever. Who cares if you can make it sleep? Like that’s just small potatoes anyway because it’s gonna wake back up anyway, so that’s, you know, who cares?

Max: 01:36:37
Even if you’re wildly successful at getting shutdown ability, you have not made the AI meaningfully safe. I think that if you are wildly successful at getting a corrigible AI, you have made the AI meaningfully safe or much safer. There are definitely still ways in which corrigible AI are dangerous.

Liron: 01:36:52
Okay. Yeah. Interesting. I mean, I think I’m getting convinced that there’s maybe no point about talking about really easy sub problems, although I understand why Ezra did it though. Because certainly if there’s a way to get utility functions to do it, we’d like to know. And so far there isn’t.

So I mean, I’m not disrespecting Ezra’s sub problem here, but in terms of problems we should care about now, maybe I shouldn’t obsess over why can’t you even solve sleepy bot? Because at the end of the day, we do want to give the AI companies the best possible thing they should be doing if they’re going to train, if they’re going to use blackbox training, which is kind of a doomed approach, which they shouldn’t.

But if we want to give ‘em the best hope...

Max: 01:37:27
Right. This is, this is the way that I would characterize...

Liron: 01:37:29
Right, right. Out of all these hopeless targets, right? ‘Cause you even use that language corrigibility as singular target. Like, you guys are going to do this Hail Mary and target something. Well, here’s the least bad thing you can target.

Max: 01:37:41
Yeah, and you shouldn’t be doing it in the first place. It is wildly dangerous, even if you’d followed the max prescribed level of engineering sophistication and security mindset and wise benevolent leadership and paranoia and everything else. You institute all the control mechanisms that the control AI people can come up with and you look hard and deep at its internals using interpretability techniques.

I still think that the problem is deeply hard and you probably die, right. I am not arguing that we should build corrigible AI. I’m arguing that if we decide to build an AGI or a super intelligence that we should aim for corrigibility and mostly we shouldn’t be trying to because it is extremely dangerous.

Liron: 01:38:23
And that’s also better than in your view, than aiming for something like coherent extrapolated volition of like, look, instead of just aiming to empower your principal, aim to imbibe your principal’s values and make them your own. You wouldn’t go that far. Correct.

Max: 01:38:38
Yes, because that is extremely brittle and complex compared to corrigibility which has this property of being able to potentially miss, but get it close enough that you’re able to notice your error and recover. And by contrast, if you aim it at the good and you get something that’s almost good, is very bad.

Would Max Work for Anthropic?

Liron: 01:38:57
Do you think that Anthropic would disagree with you and also maybe your agenda would make sense to go work at Anthropic to implement?

Max: 01:39:07
I mean, I wouldn’t want to work for Anthropic. I think that they are contributing to the arms race and steering the world towards the precipice. On the whole, I don’t blame people who work at Anthropic. Like I know some alignment researchers and I respect Evan Hubinger and Chris Olah and even Joe Carlsmith.

And I don’t blame them for taking money. But I would say try to notice that your org is contributing to the problem. And at least while in the process of potentially doing good work that is working on the alignment problem, scream really loud that your org and everybody else is potentially pushing the world towards a terrible death. That’s how I feel about Anthropic.

Liron: 01:39:52
I mean, and I completely agree with you. I do think that Anthropic is a problem, and we don’t have to rehash this, but it’s just, I think that they’re being the best of the worst and they’re legitimizing bad behavior by having a bunch of thoughtful people get on this train to doom and then act like it’s fine because they’re applying slight friction to it, basically.

So we’re on the same page there. It’s just, the thing is that you’ve chosen outside of Anthropic to pursue a research agenda, which amounts to, here’s the least bad thing a company like Anthropic could do. So you’re already focusing on telling ‘em that, that’s your thing. So if that’s gonna be your thing, don’t you think it just makes sense to do inside Anthropic?

Max: 01:40:34
I think that it makes sense to shut Anthropic down. I think that if you are going to be doing alignment research inside of Anthropic, that’s maybe fine as long as you are also trying to get Anthropic to shut down or stop contributing to the problem.

Liron: 01:40:49
What I’m saying is, but why are you so, so you don’t work in Anthropic, right? You are putting research out there in the context of what companies like Anthropic should do while building AI, assuming they’re building AI. And I know you don’t wanna...

Max: 01:41:01
Let’s, let’s, let’s back...

Liron: 01:41:03
Conditioning on them doing it.

Max’s Modest Hopes

Max: 01:41:04
Let’s set Anthropic aside. I am researching how to build a safe AGI or a safe ASI, right? I’m trying to figure out how to do this. Insofar as I succeed, right? It gets the world closer towards building an ASI in that people might feel confident that they’re going to build the thing.

One of the things I’m really worried about is instilling a false sense of confidence or Max has proven that super alignment is a valuable strategy. I don’t want to give people confidence. I want to be very loud and clear that I think that AGI is an extremely dangerous technology and we shouldn’t pursue it.

But I also think that if we have any hope, we need to be pursuing research on how to make these machines safe. Ideally we would pause and form some sort of international agreement to shut down capabilities research to give researchers like me more time to develop things like the Corrigibility agenda because I think that if we had maybe a hundred breakthroughs on the scale of what I have come up with, maybe we would then be in a realm of...

Max feeling hopeful and maybe we would be in a realm where Eliezer would still be like, uh, no, please. You know, opinions may differ about how good my research is, but my point is eventually we’re going to need to solve this problem and I am working on trying to solve the problem.

Like this is my day job, is trying to find ways to make these machines safer. And I think that we should not stop trying to figure out how to make these machines safer even while we notice that we are potentially building very unsafe machines.

Liron: 01:42:47
Okay. That actually makes sense to me. So, just to recap, you’re saying, look, I think corrigibility is important to research because I think somebody at some point needs to figure out the theory of AI alignment regardless current timelines. Like let’s assume we take our time, we take a hundred years, and in the meantime people like me are doing theory and we’re not building the AI, which is great.

I’m glad you’re doing safety theory and not building the AI. I think that’s a really valuable contribution and what I was saying before about why not join Anthropic and help them. I see what you’re saying. It’s like, yes. It is true that the research you’re doing has applications on how to make companies like Anthropic screw up a little bit less, but that’s not how you see your focus.

That’s just like a tiny bonus of what you’re doing. Correct.

Max: 01:43:28
Yeah, I’m trying to contribute to a body of knowledge such that when that body of knowledge is mature, we will be able to build ASI safely. I don’t think we’re there yet, but I hope that eventually the future world has a mature science of AI alignment and I’m trying to contribute to that future body of knowledge.

Liron: 01:43:51
Right. Now you’re contributing to the future of value knowledge, but in the ideal future where everybody goes as careful as Eliezer Yudkowsky would want. Wouldn’t they probably not even reference your work? Because they would wanna have more guarantees on knowing what they’re training and you’re not really giving them that many guarantees, right?

You’re just saying, hey, isn’t this kind of better than anything else that people have going on? Think about that?

Max: 01:44:17
I think that it’s possible. I think that Corrigibility may end up as sort of a dead end and we might, for example, come up with a good theory of how to make good AI, not necessarily corrigible AI. So that’s possible.

I think though my research and the CAST research agenda is very compatible with just philosophically superior methods of building artificial intelligence, methods where you actually know what you’re making. And it’s not this sort of black box growing process where who knows what’s going on.

I think that Corrigibility as a target is very general across architectures and types of AI. And I can definitely imagine corrigibility continuing to be relevant. I don’t know the degree to which my research in particular is gonna be, you know, that groundbreaking.

I’m also building on the research of Paul Christiano and Eliot Thornley. So maybe when people look back, they’re gonna be like, oh yeah, Max is basically just taking Paul’s thing and repackaging it. So I don’t wanna make any claims on that front, but I do think that there’s a decent chance that corrigibility will remain relevant into the future.

Liron: 01:45:27
Got it. Okay. Well, I think I could say that I’ve updated a little bit during the debate because I think when I came in I was like, corrigibility is so hopeless, which you also agree with. Like the hope is pretty low. And I was just like, what’s the point?

But it’s easier for me to see the point if you’re just saying, well, look, this is just one direction. I mean, it’s acknowledged by many people as a good direction to at least research. Like there’s at least something to learn if somebody spends time on it. Right. I feel like that’s a powerful case to make and you’re not really trying to claim too much more than that.

I guess maybe when you start to say corrigibility as a singular target though, I think at that point you’re getting into the land of proposing a long shot quick fix, right? So maybe we should just say the more robust case, the steelman is like, well, it doesn’t have to be a singular target, it’s just something we should research.

Max: 01:46:10
No, I don’t think that’s right. So I’m not saying Corrigibility as magic bullet, but I do think that it’s important to contrast Corrigibility as singular target versus Corrigibility as a property that we are balancing along with other targets.

So for example, I’m going to make this AI, this AI is going to make paperclips and it’s also going to be corrigible. That sort of AI is balancing corrigibility and corrigibility is not its singular target ‘cause it’s also making paperclips. And I think that’s very dangerous in contrast to aiming for corrigibility and only corrigibility.

And I think that’s distinct from how much of a magic bullet it is or how much this solves all the problems.

Liron: 01:46:49
Got it. So in the ideal hundred year careful program. I think you said before that you can imagine that it kind of ends up going for alignment and not corrigibility. For you it’s kinda like, it’s unclear whether singular target would look more like corrigibility or alignment, but you can see it being either one.

Max: 01:47:04
Yeah, I mean, I think. I don’t know. I’m pretty confused about these sorts of things and I don’t wanna claim that I’m not confused, but from my perspective, I think that there’s a lot of things that you could imagine a corrigible super intelligence doing that help move us towards a world where we have figured out how to build truly good agents.

So one story of the future is we get a mature science of AI alignment. We figure out the true name of good, and then we build the truly good AI that does only good things. Super intelligent AI. I think there’s another path where we get good enough at AI alignment that we’re not very confident that we could do the purely good AI perhaps because there hasn’t been enough philosophical progress on naming what is good.

And so instead we build a super intelligent corrigible AI and use that to take actions that allow us to make more progress towards the true end goal, which is building the super intelligent but good AI.

Liron: 01:48:05
Okay. So in terms of the debate, I thought I was gonna have more stronger points about how to change your view, but I don’t think I have much. ‘Cause I think your view is already pretty well calibrated. Like I even think the degree of disagreement is quite small based on learning more about your views.

Can you summarize any updates that you made about the idea of a basin of attraction around corrigibility? ‘Cause you mentioned having some updates after talking with Jeremy.

Max: 01:48:30
I did have a conversation with Jeremy where he convinced me that one of the fundamental metaphors that I’ve been using in my corrigibility research is probably a bad metaphor. So I have often talked about an attractor basin around corrigibility, which is getting at this idea that I touched on earlier where you can get a near miss.

So if you imagine the space of all AIs or all the AI’s values, some are going to be closer to corrigibility, some are further away, and we can sort of imagine this space of possible AIs. And so you might get a near miss where, you know, it’s like corrigibility asterisk, it’s got some sort of flaws, but it’s close.

And through the process of iterative refinement, you might imagine sort of moving from there towards true corrigibility. So this process of drifting from near corrigibility towards corrigibility could be likened to a basin in, I don’t know, a marble drops on a surface and it rolls downhill.

This I believe this metaphor was coined by Paul Christiano. And I think that it is potentially a bad metaphor in that it’s masking certain things. So the AI is not a ball rolling down a hill. There’s a complicated system here. And the process by which the AI changes I claim is ideally going through the humans.

Like the AI comes to the humans and is like, hey, I noticed there’s this potential flaw, and the humans are like, shut down. And then they stare at the AI and they do some normal engineering sort of work, and they make an improvement and then they restart.

And that process is akin to the rolling. Notice that it depends very deeply on the humans being able to do engineering that makes a small step. And I don’t think that it is guaranteed that the step is going to be small.

I think Jeremy successfully convinced me that at least using the methods that we have right now, we don’t have any guarantee that if you train the AI for a little bit more, you’re going to get a little bit of change.

And I think this is an important insight for all the gradualists who are like, oh, you know, Eliezer thinks that it’s just gonna radically take off, but in fact it’s just gonna gradually improve, you know, day after day. And you can take the curve, you can look at the loss curve, you know, you can extrapolate it out, you can look at the time horizons curve and you can see how smooth it is.

I think in reality what you get is you get some things that are smooth and some things where at some moment you cross a threshold of capability where the thing wasn’t able to multiply three digit numbers and now it is able to multiply three digit numbers.

And unfortunately we don’t know any way to look at the curves that we can extrapolate and say that’s when things change. That’s when things go from not very smart to able to take over the world. We don’t know where capacities lie on this curve.

And similarly in the training process, we don’t know that just a little bit more training won’t radically change the behavior of the AI ‘cause the AI is this recursive system. It’s generating thoughts and it’s updating on its own thoughts.

And that process of interacting with the world, you know, you can just jailbreak the thing and then it goes and jailbreaks its neighbor and that jailbreaks something else. Small upstream change can cause a large downstream change with these sorts of chaotic systems, right?

Liron: 01:52:12
Right, right. So even in the scenario where Anthropic or whoever, or a different project trains an AI using pretty bad black box methods, but they happen to hit an outcome that’s kind of similar to corrigibility, just because that outcome’s kind of similar to corrigibility, it doesn’t mean that they can then go tweak some stuff and try again and necessarily hit, you know, slightly closer to corrigibility.

They might just hit somewhere else.

Max: 01:52:37
Yeah, and I think I’m still trying to find a good, a better metaphor, but you sort of can imagine yeah, so you inject a whole bunch of energy into the thing in some sort of radically high dimensional space. And you hope that you wind up closer to the bottom of the basin than you started.

And that I think is a much less reassuring picture than the ball just rolls down the slope. Like maybe you poke it in a way that it just shoots out into the distance. So I think part of the story needs to be here is why we on first principles should believe that our updates are not going to change its behavior very much.

We don’t know how to do that. That’s an open problem in AI alignment.

Liron: 01:53:21
But the whole thing about back propagation though, right? Is that it is kind of a smooth, if we just give it more, you know, votes...

Max: 01:53:28
It’s a...

Liron: 01:53:29
You know, if we, if...

Max: 01:53:30
But it’s not a smooth change to its behavior.

Liron: 01:53:33
Okay. I see what you’re saying. Yeah. ‘Cause the, right, ‘cause I guess that’s the distinction you’re saying.

Max: 01:53:37
Right, Exactly. Exactly. And I think the attractor basin metaphor masks this brittleness. So that’s an update that I’ve had since talking...

Liron: 01:53:45
Right. And from the perspective of the score of some minimization, what do you call it?

Max: 01:53:54
The...

Liron: 01:53:54
I don’t know, loss. Loss minimization. Yeah. Yeah. It just looks like you’ve taken a smooth step, but then qualitatively it’s very different.

Max: 01:54:01
Yeah, exactly.

Liron: 01:54:02
Alright. Jeremy, you’ve been listening to a lot of this. You wanna come in with a final kill move here, help us out?

Max: 01:54:08
Is he still here? I don’t...

Jeremy: 01:54:10
Sure. I mean, so I think there’s two big issues. The first one is when we’re talking about a training process, like black box training process, we’re talking about the simplicity of corrigibility. The issue in order for that theory to work where the simplicity helps with hitting that target.

You need the inductive biases of the training process to agree with you that that concept is very simple. The inductive biases need to match your notion of simplicity, in whatever sense you’re saying corrigibility is simple.

The inductive biases of the training process need to agree that a corrigible machine, which is a full on intelligence. So it’s complicated. It needs to agree that the simplest solution to the training data by its inductive bias version of simplicity is what you think it is by your intuitive notion of simplicity.

And I think this breaks the whole idea in terms of training corrigibility into, using black box matters. So that’s my first problem. You wanna respond to that Max?

Max: 01:55:15
Yeah, I mean, I would say that it’s all relative. Like I think that corrigibility is significantly simpler than the hodgepodge of desiderata we have or the true name of good. However we try to bake that in. I would agree that it is somewhat complicated, and it depends on the system.

I think that it is, there’s reasons to suspect that it is simple both mathematically and in the human context. I have written words about it, right? And if I tried to do moral philosophy, I would fall way more on my face than I did when I’m writing about corrigibility.

Corrigibility is at least much simpler than trying to do moral philosophy. So again, are you just home free once you do corrigibility? No. But you know, you have to aim your AI at something. Things that are simpler than corrigibility all look doomed. They look like diamond maximization, right? Which is gonna just definitely kill you.

So I’m just like, yeah. This seems better than the alternatives. Whether or not it is truly simple enough to get in practice remains to be seen.

Liron: 01:56:26
Yeah. Jeremy, do you agree that it’s better than Elon Musk seek truth, and it’s even better than Anthropic’s stated HHH helpful, harmless, and honest. Do you think it’s better than these kind of less thought through attempts?

Jeremy: 01:56:42
Yeah, I mean, I don’t really have opinions on this. I don’t really think about what the companies are doing because it’s just so far from being sane. So yeah, so I don’t really. Yeah, I guess, yeah, I guess I wanna say yes, but like, yeah, I don’t, I don’t really think about this.

Liron: 01:57:02
All right. Yeah, fair enough. Okay, so that is pretty much what I wanted to touch on. I think this was quite interesting. I just wanna give some respect to Max here because you’ve been researching this for a while. I don’t even know how many years. Five years plus, right.

Max: 01:57:18
I’ve been working at MIRI since 2017. Not all of my MIRI research has gone to Corrigibility. The CAST agenda was published last year.

Liron: 01:57:26
Okay. And as you said in previous interviews, you know, people always wonder, you know, why is MIRI just doing PR, they’re not even doing that much research, but you are definitely an example of somebody who’s doing serious, important, high level, high stakes research at MIRI and you also do it at the same time as writing a book.

And this conversation that I had. You know, compared to most episodes of Doom Debates where I feel like it’s just so hard to even get some ground truth on such basic things like, hey, doesn’t P(Doom) look high? Doesn’t AI look like it’s about to go super intelligent soon? Right? We’re always pulling teeth, trying to get the basics.

Whereas in this case I’m like, oh, okay, yeah, you’ve actually, you know, plowed into this domain. Done some real thinking. That wasn’t totally obvious to me. You taught me some stuff. And I definitely think the world could use a lot more Max Harms. I think you’re doing great work. I hope you keep it up. And not only that, but you’ve got this book. Let’s talk about your book. Sound good?

Max’s New Book: Red Heart

Max: 01:58:25
Yeah, that sounds great. I think I felt like segwaying into it when we were talking about Anthropic ‘cause it’s, it feels super duper relevant. The book’s name is Red Heart. The novel just came out a few days ago. You can get it on Amazon or go to maxharms.com/redheart.

And it’s primarily about Chinese AGI, so the premise of the book is that there’s a secret mega project for AGI that’s in China. And the Chinese have leapfrogged the West and have a very impressive AI.

And furthermore, just to make the book more interesting from my perspective, they follow the CAST agenda. So the AI in the book is designed to be corrigible as its singular target, and I really tried to design the book, or write the book to be very accessible introduction to AI alignment ideas, and corrigibility in particular, that’s accessible to just a lay person.

You know, you get it for your father-in-law or something. And I think it works for that. But the book is also a spy thriller. Or at least it’s an espionage novel. I have a habit of writing very realistic stories, and unfortunately espionage, sometimes it’s boring, but don’t let that stop you.

A lot of the readers think that it’s very exciting. And so basically the CIA sends a mole, or a spy, to go and infiltrate this mega project for AGI in China. And it follows the story of Bo Chen, who is this alignment researcher or AI researcher who winds up working on the AI’s alignment. And yeah I think it’s a...

Jeremy: 02:00:22
Okay.

Liron: 02:00:22
Yeah. All right. I can, I recommend it personally because I did read it, you sent me a review copy a few weeks ago and I was able to read it in just a few days because it was a page turner. So I was looking forward to reading the next chapter. And I’m not even a big fiction consumer. The last time I read...

No, no, I wasn’t bored by it. I think it was probably more interesting than a realistic depiction. And I’m not somebody who’s a voracious reader of fiction of any kind. I literally go years between reading a fiction book, but then when I do read it, I often like it.

And sure enough, I like yours that I’m like turning the pages. I’m like, man, if I like this book so much, why don’t I read more fiction? And now I plan to read your previous book, Crystal Society. Right. I’m gonna be reading that as well.

Max: 02:01:02
Yeah, that’s right. And I started Crystal Society in 2014. It’s one of the earlier pieces of fiction that’s describing AI that I think holds up pretty well and isn’t like Isaac Asimov’s, you know, very logical robotics.

Liron: 02:01:18
Yep. Okay. So my recommendation to the viewers about Red Heart is I do think it’s timely. I do think it’s super realistic. I agree that you set a lot of the premise on, you made just a bunch of things as realistic as possible. ‘Cause as you told me, you’re basically saying, look, there’s already an exciting story.

We don’t have to invent things. Just putting these elements together and just having a tiny premise like, hey, China cares to get in the race. That, that’s your only premise. Like everything else is like minimum. Oh, I guess the other premise is like, okay, their AI became slightly ahead of ours because they put more compute and more, you know, central planning into it. Right?

Max: 02:01:52
But yeah. I think, I think that actually flows from just China, the Chinese government. So basically I thought what if in 2018, the Chinese government woke up and was like, wait, we need, AGI is a military technology. We need to be beating the west on this front.

And I think the massive amount of investment into AI that is depicted in the book is actually pretty plausible and pretty realistic. I think that China is definitely way ahead of the west in terms of ability to centralize and bring in large state action. This is a wartime sort of move.

And we used to be able to do this. I think that the United States has fallen by the wayside in terms of state capacity. So we have our strengths. And don’t get me wrong, I don’t mean to say that we need to beat China or that China is the true threat or whatever.

I think the situation is complicated. But I do think that the premise of the Chinese government investing a massive amount of energy and money into a centralized AGI project is actually pretty plausible.

Liron: 02:03:12
Totally. And I personally, plausible looks, so that’s a big reason why I enjoyed the book. I’m like, oh yeah, I’m buying a lot of this. Now the biggest reason why, for me personally, the biggest reason where I get off a little bit, which, but in context, I do like the book and recommend it, right?

But the thing that made it not an ideal Liron Shapira book is because I’m just expecting fireworks. And you chose to make your premise be like, I’m gonna do the minimal deviations from today’s AI, so I’m gonna, the progress is gonna be noticeably incremental. Whereas I’m expecting more of a Yudkowskian FOOM, which I even think you probably are too, correct?

Max: 02:03:47
I think there’s a range of possibilities. I think that it remains to be seen how fast things will go or how fast recursive self-improvement might take off. I think that it’s pretty plausible that it takes a long time. I think that it’s pretty plausible that it goes really, really fast.

Liron: 02:04:07
Yeah, for sure. So I mean, the fact that you were able to just make it go, drive the whole plot with some amount of craziness for sure, but just the amount of incremental craziness that you’re used to, which I know you intentionally tried to do and I was kind of jonesing for more, more fireworks of more amazing miraculous things.

But that’s obviously just a me preference and that’s certainly not what all your readers are saying.

Max: 02:04:31
For sure. Oh, it was definitely a deliberate choice when I was writing the book to aim for something that was very much day after tomorrow, sort of incremental progress. One colleague said that it was the most realistic depiction of the pathway from LLMs to super intelligence that he’d ever read.

And I think that it’s important to explore lots of different scenarios and pathways. Crystal Society is way more of the sort of wild, explosive sort of indulgent pictures. So I wanted to get away from that with Red Heart and pick something that was a lot more incremental, a lot more gradualist.

But like you said, I think that it’s just an extremely tense and exciting moment. Even if it takes years. On any normal timescale, progression on the scale of years from like, yep, this thing is a chatbot to, oh, this thing is running the entire planet is unprecedented in all of history.

Liron: 02:05:32
Totally. Totally. Yeah. Which is what I always tell people about even just the last five years, and people are like, oh, good, we landed on a slow takeoff. I’m like, really? I mean, five years, the virus that just went by. I know it kind of felt chill while it was happening, but it really was kind of a blink of an eye of an insane amount of progress.

I wouldn’t even call it slow. So I also agree with what you’re saying about more people should be writing fiction and part of why you wrote it besides, you know, you enjoy writing fiction is what you’re saying of like, yeah, narratives are how we communicate things to people.

Right. Everybody’s referencing the Terminator and 2001. So yeah, I mean, you’re adding to this. There should be so many books like Red Heart, right? There should be the most popular genre, you know, living through the singularity, writing about that, but we’re just kind of waking up and smelling the coffee right.

Max: 02:06:11
Yeah. This is the most important technology potentially in all of time, right? We’re designing a successor species. There is no other most important moment after this. It’s going to be in the hands of machines and so this is that critical pass, right?

Where we get to hand off the reins to a thing that’s not human. And I think this is wildly underappreciated in the wider culture. Like, you go and you poll people, if we build a super intelligence, would that be good?

And people are like, no. But then people, you know, they go back to their normal lives. They think more about climate change. They think more about politics. And why do they think about that? Because the stories that exist in their world, the narratives that they’re coming into contact with don’t really reveal how real these technologies are.

How, you may not be impressed with chat a year ago or something. And so you’re dismissing the whole technology. But in fact, we are potentially on the precipice of major societal, global change.

And I think that storytellers such as myself, people who have any ability to tell stories, whether or not that’s just simple narratives or complex novels, we have a responsibility to the world to get people interested, to tell things that are compelling on a story level, to help direct attention towards solving these kinds of problems. And that’s what I tried to do with Red Heart.

Liron: 02:07:52
Yep. Completely agree. Makes sense. Just a nitpick though, when you say this is the number one most important thing happening in the world, the emergence of super intelligence on our planet. I think you’re baking in the assumption that we’re not about to be visited by aliens. ‘Cause you know, they still have a few more years to travel to us and still be the most important thing potentially. Time’s running out.

And you’re also assuming that Jesus is not going to rise again in these few years.

Max: 02:08:18
That’s correct. Yep. I don’t believe in God, and I don’t believe that aliens are in the solar system or are going to get into the solar system in the next 50 years.

Liron: 02:08:27
I think those are reasonable assumptions. There was also a moment that sticks out in my mind because I often think about testing LLMs and seeing, you know, well how can you do a black box test of whether they’re smarter than human? ‘Cause the Turing test isn’t quite cutting it.

You had this moment in your book where, I’m not gonna spoil it, but your protagonist meets the super intelligent AI and has a brief chance to chat with it and essentially give it an intelligence test. And it tested it in a way that I’ve never heard before and I’m like, wow, this is so interesting.

But surely this is fiction. I can just go try it on ChatGPT right now. And ChatGPT will probably give a reasonable answer. And I was surprised that ChatGPT did worse than I thought. You know what I’m talking about?

Max: 02:09:11
Yeah, so there’s in this chapter of the book, the protagonist is testing out the AI’s capabilities and he gives her, the AI, two intelligence tests, one of which the models from a few years ago would not have would not pass. If you try it on Gemini Flash, Gemini Flash doesn’t pass.

But the current state-of-the-art thinking models do pass. And then there’s another harder challenge that I just think that the current LLMs fail at pretty hard. So yeah, and I got this from one of my colleagues.

I think everybody should have their own benchmark, I think of how they’re testing these AI’s capabilities. Try to find things that AIs currently can’t do, but you expect might be able to do sometime in the future and just watch the progress. ‘Cause you know, they’ve been getting a lot better. Certainly the models of previous eras do even worse.

Liron: 02:10:12
Yeah, I mean, if you can invent your own secret, compact benchmark where just a few back and forths with a problem, you know well, and you just watch it, I stumbled and then suddenly it gets it when it wasn’t in any official benchmark set. That’s a pretty crazy moment, right? Where you’re like, oh wow.

Max: 02:10:26
It could be a wake up moment.

Liron: 02:10:27
And there does seem to be a trend where these kind of benchmarks fall. Exactly. And you know the particular example of the benchmark in your book, it is an interesting question. Just even using that specific benchmark, if I had to take a stab in the dark, I’d be like, yeah, you know, I bet we’re probably one to two years away. That’s my own personal intuition.

I’ve, you know, I have nothing to base it on, but that’s my feeling.

Max: 02:10:48
And I tried to depict this in the book, I tried to pick something that was, that people would feel like, oh yeah, that’s definitely a thing that AIs can do. And current AIs can’t do it. But future AIs very plausibly can. And I tried to stay within that sort of realm of plausible intelligence throughout the whole story.

I think it’s very important to depict a grounded depiction of artificial intelligence. In part because people still have this notion that AI is science fiction and what I, you know, and I’m writing science fiction, so who am I to talk, but I’m also kind of writing science fact, right?

I’m trying to bring people into contact with the reality of the world as it actually exists. And I think that to a large degree, the books that are being written about AIs even that have come out in the last few years are still mostly in this very old paradigm, this hyper logical paradigm or this mode where they’re basically humans with some sort of set dressing.

You know when Lieutenant Commander Data is walking around on Star Trek, there’s, you know, with the exception of he doesn’t have emotion, which is again, a robot trope. There’s not a lot of texture, a lot of what is, what are the dynamics of this thing’s mind and what ways is it thinking that aren’t human-like, right?

I think that AI researchers such as myself have the capacity to bring to the public and illuminate some of the ways in which we know how LLMs are kind of weird, and we can expect their successors to still maintain some of this weirdness even as they get more and more capable.

Liron: 02:12:35
Yeah. Awesome. Yeah, so viewers, just this test of AI’s capability is already worth the price of the book. Satisfy your curiosity on that. And then there’s also a whole interesting thriller plot to go with it.

Max: 02:12:46
Right.

Liron: 02:12:47
All right, Max, yeah. Anything else you wanna tell people about the book?

Max: 02:12:52
With regards to the thriller plot, one of the major goals for the book was to explore China. ‘Cause I do think that, you know, when I was writing the book, or it was maybe even right after I finished, I forget exactly when it happened. That we had this DeepSeek moment where DeepSeek sort of was revealed to have only spent so many millions of dollars and trained up something that was comparable to OpenAI’s model.

Some of the details there were fudged and don’t add up or something. But it’s still the case that China really exploded onto the scene in a way that Chinese models are at least, you know, only six months, maybe a year behind American models.

And I think that part of the situation we find ourselves in right now is one where there’s this tension of, oh, maybe we want to slow down. Maybe we want to be careful about building super intelligence, because that could wipe out literally all human beings.

But at the same time, we don’t trust the Chinese and what’s going on in China, and maybe China’s just gonna go and build super communist AI that’s gonna wipe out everything but worse. There’s this tension, and it’s at all levels of society from high level politicians to ordinary people.

And so with the book, I really wanted to get into the Chinese context and help explore those east-west arms race dynamics. Because again, I think that drawing attention to things and being able to really concretely understand a vision of what it’s like is helpful in taking nuanced and sophisticated actions.

And Chinese people are people and they’re trying to do their best according to their perspective. And the AI in the book is designed to be corrigible. And a lot of the people on the project in the book are trying to build an AI that does good things instead of bad things.

And Anthropic is trying to build an AI that does good things instead of bad things. And how are these people thinking about the prospect of slowing down? What about the prospect of safety? And so I get into some of that. I explore some of that.

Some of my readers have noticed a connection between this plot of espionage and trying to figure out whether or not your project is being infiltrated by spies and the project of AI alignment where it’s like, is this AI actually aligned or is it just deceptively aligned?

I think there are a lot of interesting parallels there and parallels that work really well in the Chinese context, where there’s a lot of attention to communal values over individual preferences. A big part of the book came from noticing the connection between corrigibility and Confucianism.

So yeah, I think there’s just a lot of richness there that can help us understand our world.

Liron: 02:15:44
Hell yeah, man. Yeah. That’s great. And yeah, I will add, I did enjoy getting into the minds of the Chinese characters. Like you said, they’re people too that, you know, they’re the good guys in their own narratives. So that was interesting to flip the perspective for sure. And that adds value.

Okay. So that’s great on the book.

Outro

Max, what’s a good way to contact you if people are interested in corrigibility research, learning more or doing it themselves?

Max: 02:16:08
Yeah, a good way to contact me is max@intelligence.org. And there’s a bunch of opportunity for doing research on corrigibility, just if you are an ordinary person, you don’t even have to have a background in machine learning, although there are machine learning projects there.

So if you’re excited to work on Corrigibility, you can check out my CAST research agenda by searching CAST Corrigibility on Google or LessWrong, or the Alignment Forum. But you can also just reach out to me at max@intelligence.org.

Liron: 02:16:36
We can wrap on that. Max Harms, thank you so much for the substantive debate and writing the book.

Max: 02:16:41
Yeah. It’s been an honor and a pleasure to be here. Thank you so much.

Liron: 02:16:44
Jeremy Gillen, to the extent that you could join us, your internet allowing, thanks so much as well.

Jeremy: 02:16:49
Yeah, sorry about that. Thanks for having me.

Liron: 02:16:53
Alright, viewers, if you guys wanna hear more of Max’s tantalizing ideas or discover this fascinating plot of Red Heart, just go to maxharms.com/redheart or go to Amazon and search Red Heart book and you could read it instantly. Is there an audiobook?

Max: 02:17:11
For people who really like consuming audio, one of the best strategies is to just paste the text into a text-to-speech app. So, for example, 11 Labs has an audio reader, which is quite good. While I am working on a full cast production audiobook, similar to the audiobooks that I did for Crystal Society, it’s not gonna be out for a little while ‘cause it’s a lot of work.

But you don’t have to wait. You can get audio read by an AI today.

Liron: 02:17:36
Go do that now and I’ll see you guys in the next debate.

Doom Debates’s Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate. Previous guests include Steven Byrnes, Carl Feynman, Geoffrey Miller, Robin Hanson, Scott Sumner, Gary Marcus, Jim Babcock, and David Duvenaud.

LESSWRONG
is fundraising!
LW