The Goddess of Everything Else - The Animation
This is an animation of The Goddess of Everything Else, by Scott Alexander. I hope you enjoy it :)
This is an animation of The Goddess of Everything Else, by Scott Alexander. I hope you enjoy it :)
In this video, we explain how Anthropic trained "sleeper agent" AIs to study deception. A "sleeper agent" is an AI model that behaves normally until it encounters a specific trigger in the prompt, at which point it awakens and executes a harmful behavior. Anthropic found that they couldn't undo the...
In this video, we walk you through a plausible scenario in which AI could lead to humanity’s extinction. There are many alternative possibilities, but this time we focus on superhuman AIs developing misaligned personas, similar to how Microsoft’s Bing Chat developed the misaligned “Sydney” persona shortly after its release. This...
In the future, AIs will likely be much smarter than we are. They'll produce outputs that may be difficult for humans to evaluate, either because evaluation is too labor-intensive, or because it's qualitatively hard to judge the actions of machines smarter than us. This is the problem of “scalable oversight.”...
Rational Animations takes a look at Tom Davidson's Takeoff Speeds model (https://takeoffspeeds.com). The model uses formulas from economics to answer two questions: how long do we have until AI automates 100% of human cognitive labor, and how fast will that transition happen? The primary scriptwriter was Allen Liu (the first...
The video is about extrapolating the future of AI progress, following a timeline that starts from today’s chatbots to future AI that’s vastly smarter than all of humanity combined–with God-like capabilities. We argue that such AIs will pose a significant extinction risk to humanity. This video came out of a...
In this Rational Animations video, we look at dangerous knowledge: information hazards (infohazards) and external information hazards (exfohazards). We talk about one way they can be classified, what kinds of dangers they pose, and the dangers that come from too much secrecy. The primary scriptwriter was Allen Liu (the first...
Below is Rational Animations' new video about Goal Misgeneralization. It explores the topic through three lenses: * How humans are an example of goal misgeneralization with respect to evolution's implicit goals. * An example of goal misgeneralization in a very simple AI setting. * How deceptive alignment shares key features...
This is very late, but I want to acknowledge that the discussion about the UAT in this thread seems broadly correct to me, although the script's main author disagreed when I last pinged him about this in May. And yeah, it was an honest mistake. Internally, we try quite hard to make everything true and not misleading, and the scripts and storyboards go through multiple rounds of feedback. We absolutely do not want to be deceptive.
In this video, we explain how Anthropic trained "sleeper agent" AIs to study deception. A "sleeper agent" is an AI model that behaves normally until it encounters a specific trigger in the prompt, at which point it awakens and executes a harmful behavior. Anthropic found that they couldn't undo the sleeper agent training using standard safety training, but they could detect sleeper agents through a simple interpretability technique.
The main author of the script is John Burden. You can find it below.
Imagine an AI system tasked with governing nuclear power plants. It does this safely and reliably for many years, leading to widespread deployment all over the world. Then one day, seemingly out... (read 2094 more words →)
In this video, we walk you through a plausible scenario in which AI could lead to humanity’s extinction. There are many alternative possibilities, but this time we focus on superhuman AIs developing misaligned personas, similar to how Microsoft’s Bing Chat developed the misaligned “Sydney” persona shortly after its release. This video was inspired by this thread by @Richard_Ngo. You can find the script below.
In previous videos, we talked about how misaligned AI systems could cause catastrophes or end human civilization. This could happen in many different ways. In this video, we’ll sketch one possible scenario.
Suppose that, in the near future, human-level AIs are widely deployed across society to automate a wide range... (read 1842 more words →)
In the future, AIs will likely be much smarter than we are. They'll produce outputs that may be difficult for humans to evaluate, either because evaluation is too labor-intensive, or because it's qualitatively hard to judge the actions of machines smarter than us. This is the problem of “scalable oversight.” Proposed solutions include “debate” and iterated amplification. But how can we run experiments today to see whether these ideas actually work in practice?
In this video, we cover Ajeya Cotra’s “sandwiching” proposal: asking non-experts to align a model that is smarter than they are but less smart than a group of experts, and seeing how well they do. We then show how Sam... (read 2474 more words →)
I’m about 2/3 of the way through watching “Orb: On the Movements of the Earth.” It’s an anime about heliocentrism. It’s not the real story of the idea, but it’s not that far off, either. It has different characters and perhaps a slightly different Europe. I was somehow hesitant to start it, but it’s very good! I don’t think I’ve ever watched a series that’s as much about science as this one.
Rational Animations takes a look at Tom Davidson's Takeoff Speeds model (https://takeoffspeeds.com). The model uses formulas from economics to answer two questions: how long do we have until AI automates 100% of human cognitive labor, and how fast will that transition happen? The primary scriptwriter was Allen Liu (the first author of this post), with feedback from the second author (Writer), other members of the Rational Animations team, and external reviewers. Production credits are at the end of the video. You can find the script of the video below.
How long do we have until AI will be able to take over the world? AI technology is hurtling forward. We’ve previously argued that a... (read 1973 more words →)
That's fair, we wrote that part before DeepSeek became a "top lab" and we failed to notice there was an adjustment to make
It's true that a video ending with a general "what to do" section instead of a call-to-action to ControlAI would have been more likely to stand the test of time (it wouldn't be tied to the reputation of one specific organization or to how good a specific action seemed at one moment in time). But... did you write this because you have reservations about ControlAI in particular, or would you have written it about any other company?
Also, I want to make sure I understand what you mean by "betraying people's trust." Is it something like, "If in the future ControlAI does something bad, then, from the POV of our viewers, that means that they can't trust what they watch on the channel anymore?"
The video is about extrapolating the future of AI progress, following a timeline that starts from today’s chatbots to future AI that’s vastly smarter than all of humanity combined–with God-like capabilities. We argue that such AIs will pose a significant extinction risk to humanity.
This video came out of a partnership between Rational Animations and ControlAI. The script was written by Arthur Frost (one of Rational Animations’ writers) with Andrea Miotti as an adaptation of key points from The Compendium (thecompendium.ai), with extensive feedback and rounds of iteration from ControlAI. ControlAI is working to raise public awareness of AI extinction risk—moving the conversation forward to encourage governments to take action.
You can find the... (read 2426 more words →)
But the "unconstrained text responses" part is still about asking the model for its preferences even if the answers are unconstrained.
That just shows that the results of different ways of eliciting its values remain sorta consistent with each other, although I agree it constitutes stronger evidence.
Perhaps a more complete test would be to analyze whether its day to day responses to users are somehow consistent with its stated preferences and analyzing its actions in settings in which it can use tools to produce outcomes in very open-ended scenarios that contain stuff that could make the model act on its values.
Thanks! I already don't feel as impressed by the paper as I was while writing the shortform and I feel a little embarrassed for not thinking through things a little bit more before posting my reactions, although at least now there's some discussion under the linkpost so I don't entirely regret my comment if it prompted people to give their takes. I still feel to have updated in a non-negligible way from the paper though, so maybe I'm still not as pessimistic about it as other people. I'd definitely be interested in your thoughts if you find discourse is still lacking in a week or two.
I'd guess an important caveat might be that stated preferences being coherent doesn't immediately imply that behavior in other situations will be consistent with those preferences. Still, this should be an update towards agentic AI systems in the near future being goal-directed in the spooky consequentialist sense.
Surprised that there's no linkpost about Dan H's new paper on Utility Engineering. It looks super important, unless I'm missing something. LLMs are now utility maximisers? For real? We should talk about it: https://x.com/DanHendrycks/status/1889344074098057439
I feel weird about doing a link post since I mostly post updates about Rational Animations, but if no one does it, I'm going to make one eventually.
Also, please tell me if you think this isn't as important as it looks to me somehow.
EDIT: Ah! Here it is! https://www.lesswrong.com/posts/SFsifzfZotd3NLJax/utility-engineering-analyzing-and-controlling-emergent-value thanks @Matrice Jacobine!
In this Rational Animations video, we look at dangerous knowledge: information hazards (infohazards) and external information hazards (exfohazards). We talk about one way they can be classified, what kinds of dangers they pose, and the dangers that come from too much secrecy. The primary scriptwriter was Allen Liu (the first author of this post), with feedback from the second author (Writer), and other members of the Rational Animations team. Outside reviewers, including some authors of the cited sources, provided input as well. Production credits are at the end of the video. You can find the script of the video below.
“What you don’t know can’t hurt you”, or so the saying goes. In... (read 1383 more words →)
Below is Rational Animations' new video about Goal Misgeneralization. It explores the topic through three lenses:
You can find the script below, but first, an apology: I wanted Rational Animations to produce more technical AI safety videos in 2024, but we fell short of our initial goal. We managed only four videos about AI safety and eight videos in total. Two of them are narrative-focused, and the other two address older—though still relevant—papers. Our original plan was to publish videos on both... (read 1950 more words →)
Rational Animations' new video is an animation of The King and the Golem, by @Richard_Ngo, with minimal changes to the original text (we removed some dialogue tags). I hope you'll enjoy it!
Our new video is an adaptation of That Alien Message, by @Eliezer Yudkowsky. This time, the text has been significantly adapted, so I include it below. The author of the adaptation is Arthur Frost. Eliezer has reviewed the adaptation.
Picture a world just like ours, except the people are a fair bit smarter: in this world, Einstein isn’t one in a million, he’s one in a thousand. In fact, here he is now. He’s made all the same discoveries, but they’re not quite as unusual: there have been lots of other discoveries. Anyway, he’s out one night with a friend looking up at the stars when something odd happens. [visual: stars get... (read 2248 more words →)
This video is an adaptation of Max Roser's article, "The world is awful. The world is much better. The world can be much better." It's a simple yet important point that society at large underappreciates. Yet, it's a crucial aspect of humanity's trajectory and quite relevant to how we think about the future.
For me, perhaps the biggest takeaway from Aschenbrenner's manifesto is that even if we solve alignment, we still have an incredibly thorny coordination problem between the US and China, in which each is massively incentivized to race ahead and develop military power using superintelligence, putting them both and the rest of the world at immense risk. And I wonder if, after seeing this in advance, we can sit down and solve this coordination problem in ways that lead to a better outcome with a higher chance than the "race ahead" strategy and don't risk encountering a short period of incredibly volatile geopolitical instability in which both nations develop and possibly use never-seen-before weapons of mass destruction.
Edit: although I can see how attempts at intervening in any way and raising the salience of the issue risk making the situation worse.
Stories of AI takeover often involve some form of hacking. This seems like a pretty good reason for using (maybe relatively narrow) AI to improve software security worldwide. Luckily, the private sector should cover it in good measure for financial interests.
I also wonder if the balance of offense vs. defense favors defense here. Usually, recognizing is easier than generating, and this could apply to malicious software. We may have excellent AI antiviruses devoted to the recognizing part, while the AI attackers would have to do the generating part.
[Edit: I'm unsure about the second paragraph here. I'm feeling better about the first paragraph, especially given slow multipolar takeoff and similar, not sure about fast unipolar takeoff]
Yoshua Bengio is looking for postdocs for alignment work:
I am looking for postdocs, research engineers and research scientists who would like to join me in one form or another in figuring out AI alignment with probabilistic safety guarantees, along the lines of the research program described in my keynote (https://www.alignment-workshop.com/nola-2023) at the New Orleans December 2023 Alignment Workshop.
I am also specifically looking for a postdoc with a strong mathematical background (ideally an actual math or math+physics or math+CS degree) to take a leadership role in supervising the Mila research on probabilistic inference and GFlowNets, with applications in AI safety, system 2 deep learning, and AI for science.
Please contact me if you are interested.
Here's a new RA short about AI Safety: https://www.youtube.com/shorts/4LlGJd2OhdQ
This topic might be less relevant given today's AI industry and the fast advancements in robotics. But I also see shorts as a way to cover topics that I still think constitute fairly important context, but, for some reason, it wouldn't be the most efficient use of resources to cover in long forms.
RA has started producing shorts. Here's the first one using original animation and script: https://www.youtube.com/shorts/4xS3yykCIHU
The LW short-form feed seems like a good place for posting some of them.
Maybe obvious sci-fi idea: generative AI, but it generates human minds
Was Bing responding in Tibetan to some emojis already discussed on LW? I can't find a previous discussion about it here. I would have expected people to find this phenomenon after the SolidGoldMagikarp post, unless it's a new failure mode for some reason.
I watched the first episode of Pluto (about 1 hour long), and the second part of it is entirely about a blind old pianist and his robot butler, North N.2. I liked that part a lot and wanted to share a couple of interesting things that are in it (free of important spoilers):
1. The pianist kinda hates the robot, he's rude to it, and he's convinced the robot can't "truly" sing or play piano. Everything music-wise that comes out of the robot must be soulless.
2. The robot doesn't mind the rudeness, but it's also slightly adversarial to the pianist. It has its own goal of wanting to learn the piano. Despite the... (read more)