A major theme of the Sequences is the ways in which human reasoning goes astray. This sample of essays describes a number of failure modes and invokes us to do better.
Today, the AI Extinction Statement was released by the Center for AI Safety, a one-sentence statement jointly signed by a historic coalition of AI experts, professors, and tech leaders.
Geoffrey Hinton and Yoshua Bengio have signed, as have the CEOs of the major AGI labs–Sam Altman, Demis Hassabis, and Dario Amodei–as well as executives from Microsoft and Google (but notably not Meta).
The statement reads: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
We hope this statement will bring AI x-risk further into the overton window and open up discussion around AI’s most severe risks. Given the growing number of experts and public figures who take risks from advanced AI seriously, we hope to improve epistemics by encouraging discussion and focusing public and international attention toward this issue.
Not every system of values places extinction on its own special pedestal [...] in terms of expected loss of life AI could be even with those other things
Well this is wrong, and I'm not feeling any sympathy for a view that it's not. An eternity of posthuman growth after recovering from a civilization-spanning catastrophe really is much better than lights out, for everyone, forever.
I agree that there are a lot of people who don't see this, and will dismiss a claim that expresses this kind of thing clearly. In mainstream comments to the statement, I've see...
I think that relatively simple alignment techniques can go a long way. In particular, I want to tell a plausible-to-me story about how simple techniques can align a proto-AGI so that it makes lots of diamonds.
But why is it interesting to get an AI which makes lots of diamonds? Because we avoid the complexity of human value and thinking about what kind of future we want, while still testing our ability to align an AI. Since diamond-production is our goal for training, it’s actually okay (in this story) if the AI kills everyone. The real goal is to ensure the AI ends up acquiring and producing lots of diamonds, instead of just optimizing some weird proxies that didn’t have anything to do with diamonds. It’s also OK...
(Huh, I never saw this -- maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)
I really appreciate these thoughts!
But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
I would say "that isn't how on-policy RL works; it doesn't just intelligently find increasingly high-reinforcement policies; which reinforcement events get 'exploited' depends on the explorat...
To me, it is obvious that veganism introduces challenges to most people. Solving the challenges is possible for most but not all people, and often requires trade-offs that may or may not be worth it. I’ve seen effective altruist vegan advocates deny outright that trade-offs exist, or more often imply it while making technically true statements. This got to the point that a generation of EAs went vegan without health research, some of whom are already paying health costs for it, and I tentatively believe it’s harming animals as well.
Discussions about the challenges of veganism and ensuing trade-offs tend to go poorly, but I think it’s too important to ignore. I’ve created this post so I can lay out my views as legibly as possible, and invite...
I think, in full generality, "Nutrition is complicated and multidimensional, and the more you restrict along any axis for any reason the more tradeoffs you need to make and considerations you need to take to ensure you're getting what you need."
I'm also not sure if relying on traditional vegan dishes and cuisines is sufficient, if your metabolic context is very different otherwise? E.g. I suspect a farm worker who needs to eat 4000 kcal/day to do all his physical labor has to worry a lot less about specific food choices to ensure he gets enough of key micr...
In 2009, Eliezer Yudkowsky published Raising the Sanity Waterline. It was the first article in his Craft and the Community sequence about the rationality movement itself, and this first article served as something of a mission statement. The rough thesis behind this article—really, the thesis behind the entire rationalist movement—can be paraphrased as something like this:
We currently live in a world where even the smartest people believe plainly untrue things. Religion is a prime example: its supernatural claims are patently untrue, and yet a huge number of people at the top of our institutions—scholars, scientists, leaders—believe otherwise.
But religion is just a symptom. The real problem is humanity's lack of rationalist skills. We have bad epistemology, bad meta-ethics, and we don't update our beliefs based on evidence. If we don't master
And even more deeply than door-to-door conversations, political and religious beliefs spread through long-term friend and romantic relationships, even unintentionally.
I can attest to this first-hand because I converted from atheism to Catholicism (25 years ago) by the unintended example of my girlfriend-then-wife, and then I saw the pattern repeat as a volunteer in RCIA, an education program for people who have decided to become Catholic (during the months before confirmation), and pre-Cana, another program for couples who plan to be married in the church ...
Epistemic status: This post is a distillation of many comments/posts. I believe that my list of problems is not the best organization of sub-problems. I would like to make it shorter, and simpler, because cool theories are generally simple unified theories, by identifying only 2 or 3 main problems without aggregating problems with different types of gear level mechanisms, but currently I am too confused to be able to do so. Note that this post is not intended to address the potential negative impact of RLHF research on the world, but rather to identify the key technical gaps that need to be addressed for an effective alignment solution. Many thanks to Walter Laurito, Fabien Roger, Ben Hayum, Justis Mills for useful feedbacks.
RLHF tldr: We need a reward function,...
The model has been shaped to maximize its reward by any means necessary, even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.
This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that "the model has been shaped to maximize its reward by any means necessary"? This is likewise not weak evidence for playing the training game.
Starting in 2008, Robin Hanson and Eliezer Yudkowsky debated the likelihood of FOOM: a rapid and localized increase in some AI's intelligence that occurs because an AI recursively improves itself.
As Yudkowsky summarizes his position:
I think that, at some point in the development of Artificial Intelligence, we are likely to see a fast, local increase in capability—“AI go FOOM.” Just to be clear on the claim, “fast” means on a timescale of weeks or hours rather than years or decades; and “FOOM” means way the hell smarter than anything else around, capable of delivering in short time periods technological advancements that would take humans decades, probably including full-scale molecular nanotechnology. (FOOM, 235)
Over the course of this debate, both Hanson and Yudkowsky made a number of incidental predictions about...
Two and a half years ago, I wrote Extrapolating GPT-N performance, trying to predict how fast scaled-up models would improve on a few benchmarks. One year ago, I added PaLM to the graphs. Another spring has come and gone, and there are new models to add to the graphs: PaLM-2 and GPT-4. (Though I only know GPT-4's performance on a small handful of benchmarks.)
In previous iterations of the graph, the x-position represented the loss on GPT-3's validation set, and the x-axis was annotated with estimates of size+data that you'd need to achieve that loss according to the Kaplan scaling laws. (When adding PaLM to the graph, I estimated its loss using those same Kaplan scaling laws.)
In these new iterations, the x-position instead represents...
30,000ft takeaway I got from this: we're ~ < 2 OOM from 95% performance. Which passes the sniff test, and is also scary/exciting
When discussing AI, we often associate technological proficiency with an understanding of AI's broader implications. This notion is as misguided as believing that a master printing press mechanic from Gutenberg's era would have been best equipped to predict the impacts of the printing press.
The AI development field, represented by experts like Geoffrey Hinton (proponent of safety measures and potential regulation) or Yann LeCun (an advocate of AI acceleration with fewer restrictions), primarily concerns itself with technical prowess and the creation of AI algorithms. In contrast, understanding the expansive implications of AI necessitates a comprehensive, multidisciplinary approach encompassing areas like warfare, game theory, sociology, philosophy, ethics, economics, and law, among others.
In this context, the debate between Yann LeCun and historian-author Yuval Noah Harari offers a glimpse into these...
Are you saying that outside experts were better at understanding potential consequences in these cases? I have trouble believing it.
It seems like just 4 months ago you still endorsed your second power-seeking paper:
Why are you now "fantasizing" about retracting it?... (read more)