Book 2 of the Sequences Highlights

A major theme of the Sequences is the ways in which human reasoning goes astray. This sample of essays describes a number of failure modes and invokes us to do better.

First Post: The Bottom Line

Recent Discussion

I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems. [] I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies. 

It seems like just 4 months ago you still endorsed your second power-seeking paper:

This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.

Why are you now "fantasizing" about retracting it?

I think a healthy alignment community would have rebuked me for that line of research, but s

... (read more)
4Aryeh Englander4h
You should make this a top level post so it gets visibility. I think it's important for people to know the caveats attached to your results and the limits on its implications in real-world dynamics.

Today, the AI Extinction Statement was released by the Center for AI Safetya one-sentence statement jointly signed by a historic coalition of AI experts, professors, and tech leaders.

Geoffrey Hinton and Yoshua Bengio have signed, as have the CEOs of the major AGI labs–Sam Altman, Demis Hassabis, and Dario Amodei–as well as executives from Microsoft and Google (but notably not Meta).

The statement reads: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”

We hope this statement will bring AI x-risk further into the overton window and open up discussion around AI’s most severe risks. Given the growing number of experts and public figures who take risks from advanced AI seriously, we hope to improve epistemics by encouraging discussion and focusing public and international attention toward this issue.

Not every system of values places extinction on its own special pedestal [...] in terms of expected loss of life AI could be even with those other things

Well this is wrong, and I'm not feeling any sympathy for a view that it's not. An eternity of posthuman growth after recovering from a civilization-spanning catastrophe really is much better than lights out, for everyone, forever.

I agree that there are a lot of people who don't see this, and will dismiss a claim that expresses this kind of thing clearly. In mainstream comments to the statement, I've see... (read more)

I honestly think that's for the best because I don't believe super fast takeoff FOOM scenarios are actually realistic. And in any slower takeoff existential risk looks more like a muddled mix of human and AI driven processes than "nanomachines, everyone dies".
1 is an obvious one that many would deny out of sheer copium. Though of course "not dying" has to go hand in hand with "not aging" or it would rightly be seen as torture. 2 seems vague enough that I don't think people would vehemently disagree. If you specify, such as suggesting that there are absolutely correct or wrong answers to ethical questions, for example, then you'll get disagreement (including mine, for that matter, on that specific hypothetical claim).
Yeah, beyond that honestly I would worry that his politics in general might do even more to polarize the issue in an undesirable way. I think it's not necessarily a bad call in the current atmosphere.

I think that relatively simple alignment techniques can go a long way. In particular, I want to tell a plausible-to-me story about how simple techniques can align a proto-AGI so that it makes lots of diamonds.

But why is it interesting to get an AI which makes lots of diamonds? Because we avoid the complexity of human value and thinking about what kind of future we want, while still testing our ability to align an AI. Since diamond-production is our goal for training, it’s actually okay (in this story) if the AI kills everyone. The real goal is to ensure the AI ends up acquiring and producing lots of diamonds, instead of just optimizing some weird proxies that didn’t have anything to do with diamonds. It’s also OK...

(Huh, I never saw this -- maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)

I really appreciate these thoughts!

But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.

I would say "that isn't how on-policy RL works; it doesn't just intelligently find increasingly high-reinforcement policies; which reinforcement events get 'exploited' depends on the explorat... (read more)


To me, it is obvious that veganism introduces challenges to most people. Solving the challenges is possible for most but not all people, and often requires trade-offs that may or may not be worth it.  I’ve seen effective altruist vegan advocates deny outright that trade-offs exist, or more often imply it while making technically true statements. This got to the point that a generation of EAs went vegan without health research, some of whom are already paying health costs for it, and I tentatively believe it’s harming animals as well. 

Discussions about the challenges of veganism and ensuing trade-offs tend to go poorly, but I think it’s too important to ignore. I’ve created this post so I can lay out my views as legibly as possible, and invite...

I think, in full generality, "Nutrition is complicated and multidimensional, and the more you restrict along any axis for any reason the more tradeoffs you need to make and considerations you need to take to ensure you're getting what you need."

I'm also not sure if relying on traditional vegan dishes and cuisines is sufficient, if your metabolic context is very different otherwise? E.g. I suspect a farm worker who needs to eat 4000 kcal/day to do all his physical labor has to worry a lot less about specific food choices to ensure he gets enough of key micr... (read more)

I think you could add "and the more vegans there are, the easier it is". A huge part of it is adoption. Things like which foods are easily found, what restaurants offer, whether you have to bring vegan options from home if you're eating at a conference or other event, etc. will change a lot between cultures and times depending on how common the need is. I'm pescatarian, trending towards vegetarian, live in the UK, and feel like being vegetarian here would be fairly effortless - almost everywhere always has at least a few options. Veganism already would be a lot harder. Not impossible (if only thanks to all the Indian and Pakistani food around!), but noticeably harder.

In 2009, Eliezer Yudkowsky published Raising the Sanity Waterline. It was the first article in his Craft and the Community sequence about the rationality movement itself, and this first article served as something of a mission statement. The rough thesis behind this article—really, the thesis behind the entire rationalist movement—can be paraphrased as something like this:

We currently live in a world where even the smartest people believe plainly untrue things. Religion is a prime example: its supernatural claims are patently untrue, and yet a huge number of people at the top of our institutions—scholars, scientists, leaders—believe otherwise.

But religion is just a symptom. The real problem is humanity's lack of rationalist skills. We have bad epistemology, bad meta-ethics, and we don't update our beliefs based on evidence. If we don't master


And even more deeply than door-to-door conversations, political and religious beliefs spread through long-term friend and romantic relationships, even unintentionally.

I can attest to this first-hand because I converted from atheism to Catholicism (25 years ago) by the unintended example of my girlfriend-then-wife, and then I saw the pattern repeat as a volunteer in RCIA, an education program for people who have decided to become Catholic (during the months before confirmation), and pre-Cana, another program for couples who plan to be married in the church ... (read more)

The answer is to read the sequences (I'm not being facetious). They were written with the explicit goal of producing people with EY's rationality skills in order for them to go into producing Friendly AI (as it was called then). It provides a basis for people to realize why most approaches will by default lead to doom.  At the same time, it seems like a generally good thing for people to be as rational as possible, in order to avoid the myriad cognitive biases and problems that plague humanities thinking, and therefore actions. My impression is that the hope was to make the world more similar to Dath Ilan. 
So the idea is that if you get as many people in the AI business/research and as possible to read the sequences, then that will change their ideas in a way that will make them work in AI in a safer way, and that will avoid doom? I'm just trying to understand how exactly the mechanism that will lead to the desired change is supposed to work. If that is the case, I would say the critique made by OP is really on point. I don't believe the current approach is convincing many people to read the sequences, and I also think reading the sequences won't necessarily make people change their actions when business/economic/social incentives work otherwise. The latter being unavoidably a regulatory problem, and the former a communications strategy problem. Or are you telling me to read the sequences? I intend to sometime, I just have a bunch of stuff to read already and I'm not exactly good at reading a lot consistently. I don't deny having good material on the subject is not essential either.
It depends what you mean by political. If you mean something like "people should act on their convictions" then sure. But you don't have to actually go in to politics to do that, the assumption being that if everyone is sane, they will implement sane policies (with the obvious caveats of Moloch, Goodhart etc.). If you mean something like "we should get together and actively work on methods to force (or at least strongly encourage) people to be better", then very much no. Or rather it gets complicated fast. 

Epistemic status: This post is a distillation of many comments/posts. I believe that my list of problems is not the best organization of sub-problems. I would like to make it shorter, and simpler, because cool theories are generally simple unified theories, by identifying only 2 or 3 main problems without aggregating problems with different types of gear level mechanisms, but currently I am too confused to be able to do so. Note that this post is not intended to address the potential negative impact of RLHF research on the world, but rather to identify the key technical gaps that need to be addressed for an effective alignment solution. Many thanks to Walter Laurito, Fabien Roger, Ben Hayum, Justis Mills for useful feedbacks.

RLHF tldr: We need a reward function,...

The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.

This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that "the model has been shaped to maximize its reward by any means necessary"? This is likewise not weak evidence for playing the training game. 

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with


Starting in 2008, Robin Hanson and Eliezer Yudkowsky debated the likelihood of FOOM: a rapid and localized increase in some AI's intelligence that occurs because an AI recursively improves itself.

As Yudkowsky summarizes his position:

I think that, at some point in the development of Artificial Intelligence, we are likely to see a fast, local increase in capability—“AI go FOOM.” Just to be clear on the claim, “fast” means on a timescale of weeks or hours rather than years or decades; and “FOOM” means way the hell smarter than anything else around, capable of delivering in short time periods technological advancements that would take humans decades, probably including full-scale molecular nanotechnology. (FOOM, 235)

Over the course of this debate, both Hanson and Yudkowsky made a number of incidental predictions about...

Two and a half years ago, I wrote Extrapolating GPT-N performance, trying to predict how fast scaled-up models would improve on a few benchmarks. One year ago, I added PaLM to the graphs. Another spring has come and gone, and there are new models to add to the graphs: PaLM-2 and GPT-4. (Though I only know GPT-4's performance on a small handful of benchmarks.)

Converting to Chinchilla scaling laws

In previous iterations of the graph, the x-position represented the loss on GPT-3's validation set, and the x-axis was annotated with estimates of size+data that you'd need to achieve that loss according to the Kaplan scaling laws. (When adding PaLM to the graph, I estimated its loss using those same Kaplan scaling laws.)

In these new iterations, the x-position instead represents...

30,000ft takeaway I got from this: we're ~ < 2 OOM from 95% performance. Which passes the sniff test, and is also scary/exciting

When discussing AI, we often associate technological proficiency with an understanding of AI's broader implications. This notion is as misguided as believing that a master printing press mechanic from Gutenberg's era would have been best equipped to predict the impacts of the printing press.

The AI development field, represented by experts like Geoffrey Hinton (proponent of safety measures and potential regulation) or Yann LeCun (an advocate of AI acceleration with fewer restrictions), primarily concerns itself with technical prowess and the creation of AI algorithms. In contrast, understanding the expansive implications of AI necessitates a comprehensive, multidisciplinary approach encompassing areas like warfare, game theory, sociology, philosophy, ethics, economics, and law, among others.

In this context, the debate between Yann LeCun and historian-author Yuval Noah Harari offers a glimpse into these...

Every other important technological breakthrough. The Internet and nuclear weapons are specific examples if you want any.

Are you saying that outside experts were better at understanding potential consequences in these cases? I have trouble believing it.