The Best of LessWrong

When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.
+

Rationality

Eliezer Yudkowsky
Local Validity as a Key to Sanity and Civilization
Buck
"Other people are wrong" vs "I am right"
Mark Xu
Strong Evidence is Common
TsviBT
Please don't throw your mind away
Raemon
Noticing Frame Differences
johnswentworth
You Are Not Measuring What You Think You Are Measuring
johnswentworth
Gears-Level Models are Capital Investments
Hazard
How to Ignore Your Emotions (while also thinking you're awesome at emotions)
Scott Garrabrant
Yes Requires the Possibility of No
Ben Pace
A Sketch of Good Communication
Eliezer Yudkowsky
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases
Duncan Sabien (Deactivated)
Lies, Damn Lies, and Fabricated Options
Scott Alexander
Trapped Priors As A Basic Problem Of Rationality
Duncan Sabien (Deactivated)
Split and Commit
Duncan Sabien (Deactivated)
CFAR Participant Handbook now available to all
johnswentworth
What Are You Tracking In Your Head?
Mark Xu
The First Sample Gives the Most Information
Duncan Sabien (Deactivated)
Shoulder Advisors 101
Scott Alexander
Varieties Of Argumentative Experience
Eliezer Yudkowsky
Toolbox-thinking and Law-thinking
alkjash
Babble
Zack_M_Davis
Feature Selection
abramdemski
Mistakes with Conservation of Expected Evidence
Kaj_Sotala
The Felt Sense: What, Why and How
Duncan Sabien (Deactivated)
Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Ben Pace
The Costly Coordination Mechanism of Common Knowledge
Jacob Falkovich
Seeing the Smoke
Duncan Sabien (Deactivated)
Basics of Rationalist Discourse
alkjash
Prune
johnswentworth
Gears vs Behavior
Elizabeth
Epistemic Legibility
Daniel Kokotajlo
Taboo "Outside View"
Duncan Sabien (Deactivated)
Sazen
AnnaSalamon
Reality-Revealing and Reality-Masking Puzzles
Eliezer Yudkowsky
ProjectLawful.com: Eliezer's latest story, past 1M words
Eliezer Yudkowsky
Self-Integrity and the Drowning Child
Jacob Falkovich
The Treacherous Path to Rationality
Scott Garrabrant
Tyranny of the Epistemic Majority
alkjash
More Babble
abramdemski
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems
Raemon
Being a Robust Agent
Zack_M_Davis
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
Benquo
Reason isn't magic
habryka
Integrity and accountability are core parts of rationality
Raemon
The Schelling Choice is "Rabbit", not "Stag"
Diffractor
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Raemon
Propagating Facts into Aesthetics
johnswentworth
Simulacrum 3 As Stag-Hunt Strategy
LoganStrohl
Catching the Spark
Jacob Falkovich
Is Rationalist Self-Improvement Real?
Benquo
Excerpts from a larger discussion about simulacra
Zvi
Simulacra Levels and their Interactions
abramdemski
Radical Probabilism
sarahconstantin
Naming the Nameless
AnnaSalamon
Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"
Eric Raymond
Rationalism before the Sequences
Owain_Evans
The Rationalists of the 1950s (and before) also called themselves “Rationalists”
Raemon
Feedbackloop-first Rationality
LoganStrohl
Fucking Goddamn Basics of Rationalist Discourse
Raemon
Tuning your Cognitive Strategies
johnswentworth
Lessons On How To Get Things Right On The First Try
+

Optimization

So8res
Focus on the places where you feel shocked everyone's dropping the ball
Jameson Quinn
A voting theory primer for rationalists
sarahconstantin
The Pavlov Strategy
Zvi
Prediction Markets: When Do They Work?
johnswentworth
Being the (Pareto) Best in the World
alkjash
Is Success the Enemy of Freedom? (Full)
johnswentworth
Coordination as a Scarce Resource
AnnaSalamon
What should you change in response to an "emergency"? And AI risk
jasoncrawford
How factories were made safe
HoldenKarnofsky
All Possible Views About Humanity's Future Are Wild
jasoncrawford
Why has nuclear power been a flop?
Zvi
Simple Rules of Law
Scott Alexander
The Tails Coming Apart As Metaphor For Life
Zvi
Asymmetric Justice
Jeffrey Ladish
Nuclear war is unlikely to cause human extinction
Elizabeth
Power Buys You Distance From The Crime
Eliezer Yudkowsky
Is Clickbait Destroying Our General Intelligence?
Spiracular
Bioinfohazards
Zvi
Moloch Hasn’t Won
Zvi
Motive Ambiguity
Benquo
Can crimes be discussed literally?
johnswentworth
When Money Is Abundant, Knowledge Is The Real Wealth
GeneSmith
Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible
HoldenKarnofsky
This Can't Go On
Said Achmiz
The Real Rules Have No Exceptions
Lars Doucet
Lars Doucet's Georgism series on Astral Codex Ten
johnswentworth
Working With Monsters
jasoncrawford
Why haven't we celebrated any major achievements lately?
abramdemski
The Credit Assignment Problem
Martin Sustrik
Inadequate Equilibria vs. Governance of the Commons
Scott Alexander
Studies On Slack
KatjaGrace
Discontinuous progress in history: an update
Scott Alexander
Rule Thinkers In, Not Out
Raemon
The Amish, and Strategic Norms around Technology
Zvi
Blackmail
HoldenKarnofsky
Nonprofit Boards are Weird
Wei Dai
Beyond Astronomical Waste
johnswentworth
Making Vaccine
jefftk
Make more land
jenn
Things I Learned by Spending Five Thousand Hours In Non-EA Charities
Richard_Ngo
The ants and the grasshopper
So8res
Enemies vs Malefactors
Elizabeth
Change my mind: Veganism entails trade-offs, and health is one of the axes
+

World

Kaj_Sotala
Book summary: Unlocking the Emotional Brain
Ben
The Redaction Machine
Samo Burja
On the Loss and Preservation of Knowledge
Alex_Altair
Introduction to abstract entropy
Martin Sustrik
Swiss Political System: More than You ever Wanted to Know (I.)
johnswentworth
Interfaces as a Scarce Resource
eukaryote
There’s no such thing as a tree (phylogenetically)
Scott Alexander
Is Science Slowing Down?
Martin Sustrik
Anti-social Punishment
johnswentworth
Transportation as a Constraint
Martin Sustrik
Research: Rescuers during the Holocaust
GeneSmith
Toni Kurz and the Insanity of Climbing Mountains
johnswentworth
Book Review: Design Principles of Biological Circuits
Elizabeth
Literature Review: Distributed Teams
Valentine
The Intelligent Social Web
eukaryote
Spaghetti Towers
Eli Tyre
Historical mathematicians exhibit a birth order effect too
johnswentworth
What Money Cannot Buy
Bird Concept
Unconscious Economics
Scott Alexander
Book Review: The Secret Of Our Success
johnswentworth
Specializing in Problems We Don't Understand
KatjaGrace
Why did everything take so long?
Ruby
[Answer] Why wasn't science invented in China?
Scott Alexander
Mental Mountains
L Rudolf L
A Disneyland Without Children
johnswentworth
Evolution of Modularity
johnswentworth
Science in a High-Dimensional World
Kaj_Sotala
My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms
Kaj_Sotala
Building up to an Internal Family Systems model
Steven Byrnes
My computational framework for the brain
Natália
Counter-theses on Sleep
abramdemski
What makes people intellectually active?
Bucky
Birth order effect found in Nobel Laureates in Physics
zhukeepa
How uniform is the neocortex?
JackH
Anti-Aging: State of the Art
Vaniver
Steelmanning Divination
KatjaGrace
Elephant seal 2
Zvi
Book Review: Going Infinite
Rafael Harth
Why it's so hard to talk about Consciousness
Duncan Sabien (Deactivated)
Social Dark Matter
Eric Neyman
How much do you believe your results?
Malmesbury
The Talk: a brief explanation of sexual dimorphism
moridinamael
The Parable of the King and the Random Process
Henrik Karlsson
Cultivating a state of mind where new ideas are born
+

Practical

+

AI Strategy

paulfchristiano
Arguments about fast takeoff
Eliezer Yudkowsky
Six Dimensions of Operational Adequacy in AGI Projects
Ajeya Cotra
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
paulfchristiano
What failure looks like
Daniel Kokotajlo
What 2026 looks like
gwern
It Looks Like You're Trying To Take Over The World
Daniel Kokotajlo
Cortés, Pizarro, and Afonso as Precedents for Takeover
Daniel Kokotajlo
The date of AI Takeover is not the day the AI takes over
Andrew_Critch
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
paulfchristiano
Another (outer) alignment failure story
Ajeya Cotra
Draft report on AI timelines
Eliezer Yudkowsky
Biology-Inspired AGI Timelines: The Trick That Never Works
Daniel Kokotajlo
Fun with +12 OOMs of Compute
Wei Dai
AI Safety "Success Stories"
Eliezer Yudkowsky
Pausing AI Developments Isn't Enough. We Need to Shut it All Down
HoldenKarnofsky
Reply to Eliezer on Biological Anchors
Richard_Ngo
AGI safety from first principles: Introduction
johnswentworth
The Plan
Rohin Shah
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
lc
What an actually pessimistic containment strategy looks like
Eliezer Yudkowsky
MIRI announces new "Death With Dignity" strategy
KatjaGrace
Counterarguments to the basic AI x-risk case
Adam Scholl
Safetywashing
habryka
AI Timelines
evhub
Chris Olah’s views on AGI safety
So8res
Comments on Carlsmith's “Is power-seeking AI an existential risk?”
nostalgebraist
human psycholinguists: a critical appraisal
nostalgebraist
larger language models may disappoint you [or, an eternally unfinished draft]
Orpheus16
Speaking to Congressional staffers about AI risk
Tom Davidson
What a compute-centric framework says about AI takeoff speeds
abramdemski
The Parable of Predict-O-Matic
KatjaGrace
Let’s think about slowing down AI
Daniel Kokotajlo
Against GDP as a metric for timelines and takeoff speeds
Joe Carlsmith
Predictable updating about AI risk
Raemon
"Carefully Bootstrapped Alignment" is organizationally hard
KatjaGrace
We don’t trade with ants
+

Technical AI Safety

paulfchristiano
Where I agree and disagree with Eliezer
Eliezer Yudkowsky
Ngo and Yudkowsky on alignment difficulty
Andrew_Critch
Some AI research areas and their relevance to existential safety
1a3orn
EfficientZero: How It Works
elspood
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
So8res
Decision theory does not imply that we get to have nice things
Vika
Specification gaming examples in AI
Rafael Harth
Inner Alignment: Explain like I'm 12 Edition
evhub
An overview of 11 proposals for building safe advanced AI
TurnTrout
Reward is not the optimization target
johnswentworth
Worlds Where Iterative Design Fails
johnswentworth
Alignment By Default
johnswentworth
How To Go From Interpretability To Alignment: Just Retarget The Search
Alex Flint
Search versus design
abramdemski
Selection vs Control
Buck
AI Control: Improving Safety Despite Intentional Subversion
Eliezer Yudkowsky
The Rocket Alignment Problem
Eliezer Yudkowsky
AGI Ruin: A List of Lethalities
Mark Xu
The Solomonoff Prior is Malign
paulfchristiano
My research methodology
TurnTrout
Reframing Impact
Scott Garrabrant
Robustness to Scale
paulfchristiano
Inaccessible information
TurnTrout
Seeking Power is Often Convergently Instrumental in MDPs
So8res
A central AI alignment problem: capabilities generalization, and the sharp left turn
evhub
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
paulfchristiano
The strategy-stealing assumption
So8res
On how various plans miss the hard bits of the alignment challenge
abramdemski
Alignment Research Field Guide
johnswentworth
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
Buck
Language models seem to be much better than humans at next-token prediction
abramdemski
An Untrollable Mathematician Illustrated
abramdemski
An Orthodox Case Against Utility Functions
Veedrac
Optimality is the tiger, and agents are its teeth
Sam Ringer
Models Don't "Get Reward"
Alex Flint
The ground of optimization
johnswentworth
Selection Theorems: A Program For Understanding Agents
Rohin Shah
Coherence arguments do not entail goal-directed behavior
abramdemski
Embedded Agents
evhub
Risks from Learned Optimization: Introduction
nostalgebraist
chinchilla's wild implications
johnswentworth
Why Agent Foundations? An Overly Abstract Explanation
zhukeepa
Paul's research agenda FAQ
Eliezer Yudkowsky
Coherent decisions imply consistent utilities
paulfchristiano
Open question: are minimal circuits daemon-free?
evhub
Gradient hacking
janus
Simulators
LawrenceC
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
TurnTrout
Humans provide an untapped wealth of evidence about alignment
Neel Nanda
A Mechanistic Interpretability Analysis of Grokking
Collin
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
evhub
Understanding “Deep Double Descent”
Quintin Pope
The shard theory of human values
TurnTrout
Inner and outer alignment decompose one hard problem into two extremely hard problems
Eliezer Yudkowsky
Challenges to Christiano’s capability amplification proposal
Scott Garrabrant
Finite Factored Sets
paulfchristiano
ARC's first technical report: Eliciting Latent Knowledge
Diffractor
Introduction To The Infra-Bayesianism Sequence
TurnTrout
Towards a New Impact Measure
LawrenceC
Natural Abstractions: Key Claims, Theorems, and Critiques
Zack_M_Davis
Alignment Implications of LLM Successes: a Debate in One Act
johnswentworth
Natural Latents: The Math
TurnTrout
Steering GPT-2-XL by adding an activation vector
Jessica Rumbelow
SolidGoldMagikarp (plus, prompt generation)
So8res
Deep Deceptiveness
Charbel-Raphaël
Davidad's Bold Plan for Alignment: An In-Depth Explanation
Charbel-Raphaël
Against Almost Every Theory of Impact of Interpretability
Joe Carlsmith
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Eliezer Yudkowsky
GPTs are Predictors, not Imitators
peterbarnett
Labs should be explicit about why they are building AGI
HoldenKarnofsky
Discussion with Nate Soares on a key alignment difficulty
Jesse Hoogland
Neural networks generalize because of this one weird trick
paulfchristiano
My views on “doom”
technicalities
Shallow review of live agendas in alignment & safety
Vanessa Kosoy
The Learning-Theoretic Agenda: Status 2023
ryan_greenblatt
Improving the Welfare of AIs: A Nearcasted Proposal
#1

As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.

63Buck
I think this paper was great. I'm very proud of it. It's a bit hard to separate out this paper from the follow-up post arguing for control, but I'll try to. This paper led to a bunch of research by us and other people; it helped AI control become (IMO correctly) one of the main strategies discussed for ensuring safety from scheming AIs. It was accepted as an oral at ICML 2024. AI companies and other researchers have since built on this work (Anthropic’s “Sabotage Evaluations”, Mathew et al “Hidden in Plain Text”; I collaborated on Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats; Redwood has also been working on a big follow up paper that should be out soon), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI.  My main regret about this paper is that we didn't use defer-to-resample, a technique where you replace suspicious actions with a resample from the untrusted model (as discussed e.g. here). This probably would have been better than the other defer techniques we tried. I have more regrets about the follow-up post ("The case for ensuring...") than about this post; this post was more straightforward and less ambitious, and so gave us fewer opportunities to stick our necks out making arguments or introducing concepts that we'd later regret. I'm very excited for more follow-up work on this paper, and I'm working on mentoring such projects and sourcing funding for them.
58johnswentworth
I think control research has relatively little impact on X-risk in general, and wrote up the case against here. Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That's a problem for which verification is hard, solving the problem itself seems pretty hard too, so it's a particularly difficult type of problem to outsource to AI - and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn't.
#13

Lawrence, Erik, and Leon attempt to summarize the key claims of John Wentworth's natural abstractions agenda, formalize some of the mathematical proofs, outline how it aims to help with AI alignment, and critique gaps in the theory, relevance to alignment, and research methodology.

12Vanessa Kosoy
This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism. To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity). Some thoughts about natural abstractions inspired by this post: * The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view). * A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs Ψ and Φ, we say that Ψ⪯Φ when some conditioning of Ψ on a finite set of observations produces a refinement of some conditioning of Φ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form Ψ0≺Ψ1≺Ψ2≺…  Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences {Ψi} and {Φi}. The agreement theorem can then state that for all i∈N, there exists j
#14

Having become frustrated with the state of the discourse about AI catastrophe, Zack Davis writes both sides of the debate, with back-and-forth takes between Simplicia and Doominir that hope to spell out stronger arguments from both sides.

45Zack_M_Davis
(Self-review.) I'm as proud of this post as I am disappointed that it was necessary. As I explained to my prereaders on 19 October 2023: I think the dialogue format works particularly well in cases like this where the author or the audience is supposed to find both viewpoints broadly credible, rather than an author avatar beating up on a strawman. (I did have some fun with Doomimir's characterization, but that shouldn't affect the arguments.) This is a complicated topic. To the extent that I was having my own doubts about the "orthodox" pessimist story in the GPT-4 era, it was liberating to be able to explore those doubts in public by putting them in the mouth of a character with the designated idiot character name without staking my reputation on Simplicia's counterarguments necessarily being correct. Giving both characters perjorative names makes it fair. In an earlier draft, Doomimir was "Doomer", but I was already using the "Optimistovna" and "Doomovitch" patronymics (I had been consuming fiction about the Soviet Union recently) and decided it should sound more Slavic. (Plus, "-mir" (мир) can mean "world".)
12Seth Herd
This post skillfully addressed IMO the most urgent issue in alignment:; bridging the gap between doomers and optimists. If half of alignment thinkers think alignment is very difficult, while half think it's pretty achievable, decision-makers will be prone to just choose whichever expert opinion supports what they want to do anyway. This and its following acts are the best work I know of in refining the key cruxes. And they do so in a compact, readable, and even fun form.
#15

Evan et al argue for developing "model organisms of misalignment" - AI systems deliberately designed to exhibit concerning behaviors like deception or reward hacking. This would provide concrete examples to study potential AI safety issues and test mitigation strategies. The authors believe this research is timely and could help build scientific consensus around AI risks to inform policy discussions.

10Fabien Roger
This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).
#16

John Wentworth explains natural latents – a key mathematical concept in his approach to natural abstraction. Natural latents capture the "shared information" between different parts of a system in a provably optimal way. This post lays out the formal definitions and key theorems.

38Jeremy Gillen
This post deserves to be remembered as a LessWrong classic.  1. It directly tries to solve a difficult and important cluster of problems (whether it succeeds is yet to be seen). 2. It uses a new diagrammatic method of manipulating sets of independence relations. 3. It's a technical result! These feel like they're getting rarer on LessWrong and should be encouraged. There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other.  * Ontology identification involves taking a goal defined in an old ontology[1] and accurately translating it into a new ontology. * High-level models and low-level models need to interact in a bounded agent. I.e. learning a high-level fact should influence your knowledge about low-level facts and vice versa. * Value identification is the problem of translating values from a human to an AI. This is much like ontology identification, with the added difficulty that we don't get as much detailed access or control over the human world model. * Interpretability is about finding recognisable concepts and algorithms in trained neural networks. In general, we can solve these problems using shared variables and shared sub-structures that are present in both models. * We can stitch together very different world models along shared variables. E.g. if you have two models of molecular dynamics, one faster and simpler than the other. You want to simulate in the fast one, then switch to the slow one when particular interactions happen. To transfer the state from one to the other you identify variables present in both models (probably atom locations, velocities, some others), then just copy these values to the other model. Under-specified variables must be inferred from priors. * If you want to transfer a new concept from WM1 to a less knowledgeable WM2, you can do so by identifying the lower-level concepts that both WMs share, then constructing an "expla
#17

Alex Turner and collaborators show that you can modify GPT-2's behavior in surprising and interesting ways by just adding activation vectors to its forward pass. This technique requires no fine-tuning and allows fast, targeted modifications to model behavior. 

#18

Researchers have discovered a set of "glitch tokens" that cause ChatGPT and other language models to produce bizarre, erratic, and sometimes inappropriate outputs. These tokens seem to break the models in unpredictable ways, leading to hallucinations, evasions, and other strange behaviors when the AI is asked to repeat them.

#21

There are some obvious ways you might try to train deceptiveness out of AIs. But deceptiveness can emerge from the recombination of non-deceptive cognitive patterns. As AI systems become more capable, they may find novel ways to be deceptive that weren't anticipated or trained against. The problem is that, in the underlying territory, "deceive the humans" is just very useful for accomplishing goals.

16Daniel Murfet
I like the emphasis in this post on the role of patterns in the world in shaping behaviour, the fact that some of those patterns incentivise misaligned behaviour such as deception, and further that our best efforts at alignment and control are themselves patterns that could have this effect. I also like the idea that our control systems (even if obscured from the agent) can present as "errors" with respect to which the agent is therefore motivated to learn to "error correct". This post and the sharp left turn are among the most important high-level takes on the alignment problem for shaping my own views on where the deep roots of the problem are. Although to be honest I had forgotten about this post, and therefore underestimated its influence on me, until performing this review (which caused me to update a recent article I wrote, the Queen's Dilemma, which is clearly a kind of retelling of one aspect of this story, with an appropriate reference). I assess it to be a substantial influence on me even so. I think this whole line of thought could be substantially developed, and with less reliance on stories, and that this would be useful.
#23

Charbel-Raphaël summarizes Davidad's plan: Use near AGIs to build a detailed world simulation, then train and formally verify an AI that follows coarse preferences and avoids catastrophic outcomes. 

33ryan_greenblatt
At the time when I first heard this agenda proposed, I was skeptical. I remain skeptical, especially about the technical work that has been done thus far on the agenda[1]. I think this post does a reasonable job of laying out the agenda and the key difficulties. However, when talking to Davidad in person, I've found that he often has more specific tricks and proposals than what was laid out in this post. I didn't find these tricks moved me very far, but I think they were helpful for understanding what is going on. This post and Davidad's agenda overall would benefit from having concrete examples of how the approach might work in various cases, or more discussion of what would be out of scope (and why this could be acceptable). For instance, how would you make a superhumanly efficient (ASI-designed) factory that produces robots while proving safety? How would you allow for AIs piloting household robots to do chores (or is this out of scope)? How would you allow for the AIs to produce software that people run on their computers or to design physical objects that get manufactured? Given that this proposal doesn't allow for safely automating safety research, my understanding is that it is supposed to be a stable end state. Correspondingly, it is important to know what Davidad thinks can and can't be done with this approach. My core disagreements are on the "Scientific Sufficiency Hypothesis" (particularly when considering computational constraints), "Model-Checking Feasibility Hypothesis" (and more generally on proving the relevant properties), and on the political feasibility of paying the needed tax even if the other components work out. It seems very implausible to me that making a sufficiently good simulation is as easy as building the Large Hadron Collider. I think the objection in this comment holds up (my understanding is Davidad would require that we formally verify everything on the computer).[2] As a concrete example, I found it quite implausible that you
10Charbel-Raphaël
Ok, time to review this post and assess the overall status of the project. Review of the post What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I'm still quite satisfied with the construction of the post—it's progressive and clearly distinguishes between what's important and what's not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn't really understand it, and thought, "There's no way this is going to work." Then I reconsidered, thought about it more deeply, and realized there was something important here. Hopefully, this post succeeded in showing that there is indeed something worth exploring! I think such distillation and analysis are really important. I'm especially happy about the fact that we tried to elicit as much as we could from Davidad's model during our interactions, including his roadmap and some ideas of easy projects to get early empirical feedback on this proposal. Current Status of the Agenda. (I'm not the best person to write this, see this as an informal personal opinion) Overall, Davidad performed much better than expected with his new job as program director in ARIA and got funded 74M$ over 4 years. And I still think this is the only plan that could enable the creation of a very powerful AI capable of performing a true pivotal act to end the acute risk period, and I think this last part is the added value of this plan, especially in the sense that it could be done in a somewhat ethical/democratic way compared to other forms of pivotal acts. However, it's probably not going to happen in time. Are we on track? Weirdly, yes for the non-technical aspects, no for the technical ones? The post includes a roadmap with 4 stages, and we can check if we are on track. It seems to me that Davidad jumped directly to stage 3, without going through sta
#25

Charbel-Raphaël argues that interpretability research has poor theories of impact. It's not good for predicting future AI systems, can't actually audit for deception, lacks a clear end goal, and may be more harmful than helpful. He suggests other technical agendas that could be more impactful for reducing AI risk. 

79Charbel-Raphaël
Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have. First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it. I still want everyone working on interpretability to read it and engage with its arguments. Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions. Updates on my views Legend: * On the left of the arrow, a citation from the OP → ❓ on the right, my review which generally begins with emojis * ✅ - yes, I think I was correct (>90%) * ❓✅ - I would lean towards yes (70%-90%) * ❓ - unsure (between 30%-70%) * ❓❌ - I would lean towards no (10%-30%) * ❌ - no, I think I was basically wrong (<10%) * ⭐ important, you can skip the other sections Here's my review section by section: ⭐ The Overall Theory of Impact is Quite Poor? * "Whenever you want to do something with interpretability, it is probably better to do it without it" → ❓ I still think this is basically right, even if I'm not confident this will still be the case in the future; But as of today, I can't name a single mech-interpretability technique that does a better job at some non-intrinsic interpretability goal than the other more classical techniques, on a non-toy model task. * "Interpretability is Not a Good Predictor of Future Systems" →
#27

Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

17Fabien Roger
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.
#35

GPTs are being trained to predict text, not imitate humans. This task is actually harder than being human in many ways. You need to be smarter than the text generator to perfectly predict their output, and some text is the result of complex processes (e.g. scientific results, news) that even humans couldn't predict. 

GPTs are solving a fundamentally different and often harder problem than just "be human-like". This means we shouldn't expect them to think like humans.

#36

Some AI labs claim to care about AI safety, but continue trying to build AGI anyway. Peter argues they should explicitly state why they think this is the right course of action, given the risks. He suggests they should say something like "We're building AGI because [specific reasons]. If those reasons no longer held, we would stop."

#38

Nate Soares argues that there's a deep tension between training an AI to do useful tasks (like alignment research) and training it to avoid dangerous actions. Holden is less convinced of this tension. They discuss a hypothetical training process and analyze potential risks.

10Vanessa Kosoy
This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach. Here is how I view this question: The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality. Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3. The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On t
#41
Neural networks generalize unexpectedly well. Jesse argues this is because of singularities in the loss surface which reduce the effective number of parameters. These singularities arise from symmetries in the network. More complex singularities lead to simpler functions which generalize better. This is the core insight of singular learning theory. [50WordSummary]
12Vanessa Kosoy
This post is a solid introduction to the application of Singular Learning Theory to generalization in deep learning. This is a topic that I believe to be quite important. One nitpick: The OP says that it "seems unimportant" that ReLU networks are not analytic. I'm not so sure. On the one hand, yes, we can apply SLT to (say) GELU networks instead. But GELUs seem mathematically more complicated, which probably translates to extra difficulties in computing the RLCT and hence makes applying SLT harder. Alternatively, we can consider a series of analytical response functions that converges to ReLU, but that probably also comes with extra complexity. Also, ReLU have an additional symmetry (the scaling symmetry mentioned in the OP) and SLT kinda thrives on symmetries, so throwing that out might be bad! It seems to me like a fascinating possibility that there is some kind of tropical geometry version of SLT which would allow analyzing generalization in ReLU networks directly and perhaps somewhat more easily. But, at this point it's merely a wild speculation of mine.
#42

Paul Christiano lays out how he frames various questions of "will AI cause a really bad outcome?", and gives some probabilities.

#45

A comprehensive overview of current technical research agendas in AI alignment and safety (as of 2023). The post categorizes work into understanding existing models, controlling models, using AI to solve alignment, theoretical approaches, and miscellaneous efforts by major labs. 

#47

We might soon be creating morally relevant AI systems with real welfare concerns. How can we help ensure good lives for AIs, especially if we don't have that many resources to allocate to it? 

12ryan_greenblatt
My views remain similar to when I wrote this post, and the state of nearcasted interventions still looks reasonably similar to me. I have some slightly different thoughts on how we should relate to interventions around communication, but relatively prioritizing communication still seems reasonable to me. One change in my perspective is that I'm now somewhat less excited about allocating larger fractions of resources toward specifically AI welfare. (I now think 0.2% seems better than 1%.) I've updated toward thinking safety concerns will get a smaller fraction of resources than I was previously expecting (due to more pessimism and shorter timelines), and I think safety and welfare resource usage might trade off. Another change is that I'm relatively more excited about making deals with AIs as a safety intervention (as well as a welfare intervention). This would include things like paying them to reveal misalignment or promising later compensation if they don't cause issues for us (and if we're still in control). I have some forthcoming empirical work related to this, and work discussing the conceptual aspects of this is hopefully forthcoming.