By Scott Alexander

A basic primer on why AI might lead to human extinction, and why solving the problem 
is difficult. Scott Alexander walks readers through a number of questions with evidence based on progress from machine learning.

Recent Discussion

With the recent proposals about moratoriums and regulation, should we also start thinking about a strike by AI researchers and developers?

The reasoning I imagine as follows. AI capability is now growing really fast, and toward levels that will strongly affect the world. And AI safety lags behind. (A minute ago I used a ChatGPT jailbreak to get instructions for torturing a pregnant woman, that's the market leader performance for you.) And finally, I want to make the argument that working on AI capability while it is ahead of AI safety, is "pushing the bus".

Here's the metaphor, a bunch of people including you are pushing a bus full of children toward a precipice, and you're paid for each step. In this situation would you really say "oh I...

If it happens it'll help shift policy, by giving major ammo to those who say AI is dangerous enough to be regulated. "Look, many researchers aren't just making worried noises about safety but taking this major action."

What Qumeric said, the point is that the leading AI company failed to align their AI with their own notion of safety. And they can't seem to fix it despite repeatedly trying. (And they don't recall the product of course.) ((And they keep working on capability much more than safety.)) Think of it, if you will, as a sign of things to come.
Hm, pushing a bus full of kids towards a 10% chance of precipice is also pretty harsh. Though I do agree that we should applaud those who decline to do it.
Yeah, it's not the kind of strike whose purpose is to get concessions from employers. But the thing in Atlas Shrugged was also called a "strike" and it seems similar in spirit to this.

epistemic status: still a student but quite sure of myself on this topic and pretty sure that this misconception plausibly has a non negligeable impact on some debates


Medical student here, I just wanted to shed some light on what I think is a common misconception.

Yes the human brain contains about 86 billion neurons. But about 60 billions are in the cerebellum and have little to do with consciousness. Those neurons can plausibly be approximated as just filtering the noisy signal going from the rest of the brain (cortex etc) to the limbs.

A person without cerebellum can be perfectly conscious.

But reducing the number of neurons by two thirds like that does not change the order of magnitude of the number of synapses, which arguably are a more important...

What are you talking about? There has been at least one case of a woman literally born without a cerebellum [] and nobody noticed until she had a brain scan. The neocortex was plastic enough to able to take over the functions. She was much less coordinated than average as a child (she was able to speak intelligibly and walk unassisted by age 7), but otherwise behaved like a normal human. Are you seriously asking me to believe she's a zombie? How do I know you're conscious?

How do I know you're conscious?

Exactly! You don't! And all this talk of who is born without which brain region and how they went through life gets us no closer at all to actually understanding which physical systems are not zombies.

Yes we do, it's in the sources.
If what you mean by "consciousness" is something like "ability to utter the words 'I am conscious' ", then sure, but then why do we care about the number of neurons required to make a system utter those words? The main thing of interest here is trying to use baselines from neuroscience to infer things about which AI systems are truly conscious (what other debates were you referring to?), in the it's-something-to-be-like-it sense. Being able to say "I am conscious" does not confer a system moral worth, it is its subjective experience that does that, and the observation that people without a cerebellum can live normal lives doesn't tell us anything about whether it has affected the intensity of their subjective experience.

I like watching videos of Eliezer talking and explaining things. So here is a list of videos I have discovered so far. I have not tried to make this list exhaustive. If you know of any more videos, please post them in the comments.

Eliezer Yudkowsky - Less Wrong Q&A Playlist

The main Eliezer Yudkowsky youtube playlist sorted after the publication date (not quality).

1Johannes C. Mayer2h
Well, that video is already in the playlist, if you look, but thanks for the suggestion.
Sorry, I assumed you posted that just before the interview

If I remember correctly, the interview was the reason that I made this list in the first place 😀

Italy has become the first Western country to block advanced chatbot ChatGPT.

The Italian data-protection authority said there were privacy concerns relating to the model, which was created by US start-up OpenAI and is backed by Microsoft.

The regulator said it would ban and investigate OpenAI "with immediate effect".


Alternative article available here.

1Rudi C1h
Does anyone have any guesses what caused this ban?

From what I understand, the reason has to do with GDPR, the EU's data protection law. It's pretty strict stuff and it essentially says that you can't store people's data without their active permission, you can't store people's data without a demonstrable need (that isn't just "I wanna sell it and make moniez off it"), you can't store people's data past the end of that need, and you always need to give people the right to delete their data whenever they wish for it.

Now, this puts ChatGPT in an awkward position. Suppose you have a conversation that includes... (read more)


[Written for a general audience. You can probably skip the first section. Interested in feedback/comment before publication on The Roots of Progress.]

Will AI kill us all?

That question is being debated seriously by many smart people at the moment. Following Charles Mann, I’ll call them the wizards and the prophets: the prophets think that the risk from AI is so great that we should actively slow or stop progress on it; the wizards disagree.

Why even discuss this?

(If you are already very interested in this topic, you can skip this section.)

Some of my readers will be relieved that I am finally addressing AI risk. Others will think that an AI apocalypse is classic hysterical pessimist doomerism, and they will wonder why I am even dignifying it with a response,...

If some rogue AI were to plot against us, would it actually succeed on the first try? Even genius humans generally don’t succeed on the first try of everything they do. The prophets think that AI can deduce its way to victory—the same way they think they can deduce their way to predicting such outcomes.

I think a weaker thing--I think that if a rogue AI plots against us and fails, this will not spur the relevant authorities to call for a general halt. Instead that bug will be 'patched', and AI development will continue until we create one that does successf... (read more)

New article in Time Ideas by Eliezer Yudkowsky.

Here’s some selected quotes.

In reference to the letter that just came out (discussion here):

We are not going to bridge that gap in six months.

It took more than 60 years between when the notion of Artificial Intelligence was first proposed and studied, and for us to reach today’s capabilities. Solving safety of superhuman intelligence—not perfect safety, safety in the sense of “not killing literally everyone”—could very reasonably take at least half that long. And the thing about trying this with superhuman intelligence is that if you get that wrong on the first try, you do not get to learn from your mistakes, because you are dead. Humanity does not learn from the mistake and dust itself off and try again, as


I mean, the human doesn't have to know that it's creating a doomsday virus. The AI could be promising it a cure for his daughter's cancer, or something.

Human cloning.
I translated this text into Russian []
I want to step in here as a moderator. We're getting a substantial wave of new people joining the site who aren't caught up on all the basic arguments for why AI is likely to be dangerous.  I do want people with novel critiques of AI to be able to present them. But LessWrong is a site focused on progressing the cutting edge of thinking, and that means we can't rehash every debate endlessly. This comment makes a lot of arguments that have been dealt with extensively on this forum, in the AI box experiment, Cold Takes, That Alien Message, So It Looks Like You're Trying to Take Over The World, and many other places.  If you want to critique this sort of claim, the place to do it is on another thread. (By default you can bring it up in the periodic All AGI Safety questions welcome [] threads). And if you want to engage significantly about this topic on LessWrong, you should focus on understanding why AI is commonly regarded as dangerous here, and make specific arguments about where you expect those assumptions to be wrong. You can also check out [] which is an FAQ site optimized for answering many common questions. The LessWrong moderation team is generally shifting to moderate more aggressively as a large wave of people start engaging. John Kluge has made a few comments in this reference class so for now I'm rate limiting them to one-comment per 3 days. 
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with

Prior to ChatGPT, I was slightly against talking to governments about AGI. I worried that attracting their interest would cause them to invest in the technology and shorten timelines.

However, given the reception of ChatGPT and the race it has kicked off, my position has completely changed. Talking to governments about AGI now seems like one of the best options we have to avert a potential catastrophe.

Most of all, I would like people to be preparing governments to respond quickly and decisively to AGI warning shots. 

Eliezer Yudkowsky recently had a letter published in Time that I found inspiring: . It contains an unprecented international policy proposal:

Shut down all the large GPU clusters (the large computer farms where the most powerful AIs are refined). Shut down all the


Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.

Thanks to Jérémy Scheurer, Nicholas Dupuis and Evan Hubinger for feedback and discussion

When people talk about mesa-optimization, they sometimes say things like “we’re searching for the optimizer module” or “we’re doing interpretability to find out whether the network can do internal search”. An uncharitable interpretation of these claims is that the researchers expect the network to have something like an “optimization module” or “internal search algorithm” that is clearly different and distinguishable from the rest of the network (to be clear, we think it is fine to start with probably wrong mechanistic models). 

In this post, we want to argue why we should not expect mesa-optimization to be modular or clearly different from the rest of the...

Overall, strong upvote, I like this post a lot, these seem like good updates you've made. I agree. Heuristic-free search seems very inefficient and inappropriate for real-world intelligence [].  I agree. However, I agree with this as an argument against direct insight transfer from toy->real-world models. If you don't know how to do anything with anything for how e.g. an adult would plan real-world takeover, start simple IMO. First, thanks for making falsifiable predictions. Strong upvote for that. Second, I agree with this point. See also my made-up account of what might happen in a kid's brain when he decides to wander away from his distracting friends []. (It isn't explicit search.) However, I expect there to be something like... generally useful predictive- and behavior-modifying circuits (aliased to "general-purpose problem-solving module", perhaps), such that they get subroutine-called by many different value shards. Even though I think those subroutines are not going to be MCTS. I feel only somewhat interested in "how much mesaoptimization is happening?", and more interested in "what kinds of cognitive work is being done, and how, and towards what ends?" (IE what are the agent's values, and how well are they being worked towards?)

Thank you!

I also agree that toy models are better than nothing and we should start with them but I moved away from "if we understand how toy models do optimization, we understand much more about how GPT-4 does optimization". 

I have a bunch of project ideas on how small models do optimization. I even trained the networks already. I just haven't found the time to interpret them yet. I'm happy for someone to take over the project if they want to. I'm mainly looking for evidence against the outlined hypothesis, i.e. maybe small toy models actually do fair... (read more)

I am looking for some understanding into why this claim is made. 

As far as I can tell, Löb's Theorem does not directly make such an assertion. 

Reading the Cartoon's Guide to Löb's Theorem, it appears that this assertion is made on the basis of the reasoning that Löb's Theorem itself can't prove negations, that is, statements such as "1 + 3 /= 5." 

Alas, this means we can't prove PA sound with respect to any important class of statements.

This is a statement that [due to the presence of negations in it] itself can't be proven within PA. 

Now it seems that it is being argued that the inability to do this is a bad thing [that is, being able to prove that we can't prove PA sound with respect to any 'important' class of statements]. 

I think this is actually a very critical question and I have some ideas for what the central crux is here, but I'd be interested in seeing some answers before delving into that.

1Thoth Hermes5h
I think the key here is that our theorem-prover or "mathematical system" is capable of considering statements to be "true" within itself, in the sense that if it believes it has proven something, well, it considers at least that to be true. It's got to pick something to believe in, in this case, that if it has written a proof of something, that thing has been proven. It has truth on that level, at least.  Consider that if we tabooed the use of the word "true" and used instead "has a proof" as a proxy for it, we don't necessarily get ourselves out of the problem. We basically are forced to do this no matter what, anyway. We sometimes take this to mean that "has a proof" means "could be true, maybe even is mostly really true, but all we know for sure is that we haven't run in to any big snags yet, but we could." Metaphysically, outside-of-the-system-currently-being-used truth? I think the Sequences are saying something more strongly negative than even Gödel's Theorems are usually taken to mean. They are saying that even if you just decide to use "my system thinks it has proved it, and believes that's good enough to act on", you'll run into trouble sooner than if you hesitated to act on anything you think you've already proved.
2hairyfigment2h [] A coherent formal system can't fully define truth for its own language. It can give more limited definitions for the truth of some statement, but often this is best accomplished by just repeating the statement in question. (That idea is also due to Tarski: 'snow is white' is true if and only if snow is white.) You could loosely say (very loosely!) that a claim, in order to mean anything, needs to point to its own definition of what it would mean for that claim to be true. Any more general definition of truth, for a given language, needs to appeal to concepts which can't be expressed within that language, in order to avoid a self-reference paradox. So, there's no comparison between applying the concept in your last paragraph to individual theorems you've already proven, like "2+2=4" - "my system thinks it has proved it, and believes that's good enough to act on", and the application to all hypothetical theorems you might prove later, like "2+2=5". Those two ideas have different meanings - the latter can't even be expressed within the language in a finite way, though it could be an infinite series of theorems or new axioms of the form □P→P - and they have wildly different consequences. You seem to get this when it comes to hypothetical proofs that eating babies is mandatory.

So what you're saying here is, let's say, "level 1 negative" which means, very roughly, things like: We can't formally define what truth is, our formal system must appeal to higher systems outside of it, we can't prove consistency, etc.

What the Sequences say are, let's say, "level 2 negative" which means verbatim what is stated in them, i.e., "a mathematical system cannot assert its own soundness without becoming inconsistent." This literally says that if a mathematical system tried to assert its own soundness, it would become inconsistent. This is worse t... (read more)