In their hopes that it’s not too late for course correction around AI, Nate and Eliezer have written a book making the detailed case for this unfortunate reality. Available in September, you can preorder it now, or read endorsements, quotes, and reviews from scientists, national security officials, and more.
This post was written as part of AISC 2025.
Introduction
In Agents, Tools, and Simulators, we outlined several lenses for conceptualizing LLM-based AI, with the intention of defining what simulators are in contrast to their alternatives. This post will consider the alignment implications of each lens in order to establish how much we should care if LLMs are simulators or not. We conclude that agentic systems seem to have a high potential for misalignment, simulators have a mild to moderate risk, tools push the dangers elsewhere, and the potential for blended paradigms muddies this evaluation.
Alignment for Agents
The basic problem of AI alignment, under the agentic paradigm, can be summarized as follows:
Agree, when discussing the alignment of simulators in this post, we are referring to safety from the subset of dangers related to unbounded optimization towards alien goals, which does not include everything within value alignment, let alone AI safety. But this qualification points to a subtle meaning drift in use of the word "alignment" in this post (towards something like "comprehension and internalization of human values") which isn't good practice and something I'll want to figure out how to edit/fix soon.
In Agents, Tools, and Simulators we outlined what it means to describe a system through each of these lenses and how they overlap. In the case of an agent/simulator, our central question is: which property is "driving the bus" with respect to the system's behavior, utilizing the other in its service?
Aligning Agents, Tools, and Simulators explores the implications of the above distinction, predicting different types of values (and thus behavior) from
Specifically, we expect simulator-first systems to have holistic goals...
[Crossposted on Windows On Theory]
Throughout history, technological and scientific advances have had both good and ill effects, but their overall impact has been overwhelmingly positive. Thanks to scientific progress, most people on earth live longer, healthier, and better than they did centuries or even decades ago.
I believe that AI (including AGI and ASI) can do the same and be a positive force for humanity. I also believe that it is possible to solve the “technical alignment” problem and build AIs that follow the words and intent of our instructions and report faithfully on their actions and observations.
I will not defend these two claims here. However, even granting these optimistic premises, AI’s positive impact is not guaranteed. In this essay, I will:
On the "safety underinvestment" point I am going to say something which I think is obvious (and has probably been discussed before) but I have not personally seen anyone advocate for:
The DoD should be conducting its own safety/alignment research, and pushing for this should be one of the primary goals of safety advocates.
There is this constant push for labs to invest more in safety. At the same time we all acknowledge this is a zero sum game. Every dollar/joule/flop they put into safety/alignment is a dollar/joule/flop they can't put into capabilities rese...
Conjecture: when there is regime change, the default outcome is for a faction to take over—whichever faction is best prepared to seize power by force.
One example: The Iranian Revolution of 1978-1979. In the years leading up to the revolution, there was turmoil and broad hostility towards the Shah, across many sectors of the population. These hostilities ultimately combined in an escalation of protest, crack-down, more protest from more sectors (protests, worker strikes). Finally, the popular support for Khomeini as the flag-bearer of the broad-based revolution was enough to get the armed forces to defect, ending the Shah's rule.
From the Britannica article on the aftermath:
...On April 1, following overwhelming support in a national referendum, Khomeini declared Iran an Islamic republic. Elements within the clergy
Ok, thanks for the info. (For the record, these do not sound like what I would remotely call "strokes of genius".)
I'll explain my reasoning in a second, but I'll start with the conclusion:
I think it'd be healthy and good to pause and seriously reconsider the focus on doom if we get to 2028 and the situation feels basically like it does today.
I don't know how to really precisely define "basically like it does today". I'll try to offer some pointers in a bit. I'm hoping folk will chime in and suggest some details.
Also, I don't mean to challenge the doom focus right now. There seems to be some good momentum with AI 2027 and the Eliezer/Nate book. I even preordered the latter.
But I'm still guessing this whole approach is at least partly misled. And I'm guessing that fact will show up in 2028 as "Oh, huh, looks...
What if the authors weren’t a subset of the community at all? What if they’d never heard of LessWrong, somehow?
Wouldn't that not change it very much, because the community signal-boosting a claim from outside the community still fits the pattern?
Take: Exploration hacking should not be used as a synonym for deceptive alignment.
(I have observed one such usage)
Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging "intermediate" capabilities during RL training to avoid learning a "full" capability.
This is a two-post series on AI “foom” (this post) and “doom” (next post).
A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...
1.3.1 Existence proof: the human cortex
So unfortunately this is one of those arguments that rapidly descends into which prior you should apply and how you should update on what evidence, but.
Your entire post basically hinges on this point and I find it unconvincing. Bionets are very strange beasts that cannot even implement backprop in the way we're used to, it's not remotely obvious that we would recognize known algorithms even if they were what the cortex amounted to. I will confess that I'm not a professional neuroscientist, but Beren Millidge is and...
Greedy-Advantage-Aware RLHF addresses the negative side effects from misspecified reward functions problem in language modeling domains. In a simple setting, the algorithm improves on traditional RLHF methods by producing agents that have a reduced tendency to exploit misspecified reward functions. I also detect the presence of sharp parameter topology in reward hacking agents, which suggests future research directions. The repository for the project can be found here.
In the famous short story The Monkey's Paw by W.W. Jacobs, the White family receives a well-traveled friend of theirs, Sergeant-Major Morris, and he brings with him a talisman from his visits to India: a mummified monkey's paw. Sergeant Major Morris reveals that the paw has a magical ability to grant wishes, but cautions against using its power. The family does...
Is it ok to compute the advantage function as a difference of Value functions?
To my understanding, the advantage in ppo is not simply the difference between value functions, but uses GAE, which depends on later value, reward terms of the sampled trajectory.
Shouldn't we necessarily use that function during PPO training?