Things that have not changed

Well I don’t think either Epoch and Dario were talking about data improvements (Epoch because they used perplexity not benchmarks, and perplexity on a fixed corpus is only slightly helped by training data improvements; and Dario based on the wording he used, see §2.2 excerpt).

If Epoch and Dario are making claims that are crazy (6-8 month halving time excluding the data category), and lots of people misunderstood those claims as asserting something directionally less crazy (6-8 month halving time including the data category) … umm, I guess that’s a good thing for public understanding of LLMs, and I should be happy about it?

But it still matters for other reasons. E.g. I think... (read more)

Steven Byrnes3d

If only there was a handy-dandy lesswrong blog-publishing checklist that included “update the link preview image” as one of its items.

Steven Byrnes4d*Quick Take

FYI, if anyone read my post “The nature of LLM algorithmic progress” last week, it’s now a heavily-revised version 2.

FWIW, inspired by Justis, I’ve been keeping up a list of things that I could usefully automate with Claude Code (or similar) for my own personal productivity, adding to the list every time something pops into my head. I’ve been adding to the list for the past three weeks. But so far it’s a very underwhelming list! Here’s ~the whole thing:

Custom interface for composing tweet-threads, including their funny formula for counting characters (I have some complaints about the built-in twitter one, e.g. I usually also post them onto bluesky)
Jeff’s “clipboard normalizer” (but I have a PC not Mac)
…And something similar for clipboard conversion from simple HTML into the abstruse “typst” format that

Social drives 1: “Sympathy Reward”, from compassion to dehumanization

Thanks! I appreciate the brainstorming here.

it feels like you are adding more gears to your model as you go. … I'm unable to tell if this is all coming from a consistent model or is more something plausible you suggest could work.

I am acutely aware of the risk of post-hoc storytelling instead of principled postdiction :) I think I'm pretty good at doing principled postdiction rather than post-hoc storytelling (although maybe everybody thinks that about themselves), but I’m certainly capable of the latter, especially when I’m just brainstorming and haven't stewed on something for months or years. E.g. much of my previous comment was early-stage low-confidence brainstorming, I hope I made that... (read 849 more words →)

Replying toThe nature of LLM algorithmic progress (v2)

That’s helpful, thanks! The new version 2 has a rewritten optimization section, hope it’s better now.

Replying toThe nature of LLM algorithmic progress (v2)

Are there lessons from high-reliability engineering for AGI safety?

Thanks for your feedback; I incorporated some of it in my rewrite (it’s now version 2). In particular, I appreciate the data showing FLOP utilization staying (roughly) constant, and the idea that there’s a red-queen race against communication overhead etc. And I added some of those examples from DeepSeek & Kimi in the appropriate sections. Thanks!

…But I do want to push back on your suggestion that your HellaSwag plot implies what you think it implies.

The hypothesis that Gopher is better than the other two mainly because of better training data seems like a totally viable hypothesis to me. For example, Gopher trained on 20× more books, presumably due to Google’s mountain of

Steven Byrnes6d

(Thanks!) To me, your comment is like: “We have a great plan for robust engineering of vehicles (as long as they are at on a dry, warm, indoor track, going under 10kph).” OK that’s better than nothing. But if we are eventually going to be driving cars at high speed in the cold rain, it’s inadequate. We did not test or engineer them in the right environment.

This is not a complex systems objection (e.g., it’s not about how the world changes with billions of cars). It’s a distribution shift objection. Even just one car will fail at high speed in the cold rain under those circumstances.

If there’s a distribution shift (test environment... (read more)

Replying toIn (highly contingent!) defense of interpretability-in-the-loop ML training

Steven Byrnes6d

In (highly contingent!) defense of interpretability-in-the-loop ML training

I’m not making any claims about what the “interpretability” system is. It can be any system whatsoever whose input is activations and whose output is one or more numbers. The “system” could be a linear probe. Or the “system” could be a team of human researchers who pause the model after every forward pass, scrutinize the activation state for a week, and then output a “this activation state represents scheming” score from 0 to 10. (That’s not a practical example, because if you pause for a week on each forward pass then the training would take a zillion years. But in principle, sure!) Or the “system” could be something even more exotic... (read more)

Steven Byrnes6d

One thing you can maybe do is throw such accusations right back: “You say I’m being closed-minded to you, but aren’t you equally being closed-minded to me?”

It comes across as escalatory, and might be counterproductive, but I’ve also sometimes found it helpful. Depends a lot on the person and situation.

In (highly contingent!) defense of interpretability-in-the-loop ML training

Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function.

Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

Or Zvi 2025:

The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is

... (read 824 more words →)

Are there lessons from high-reliability engineering for AGI safety?

10d

(Heavily revised on Feb. 9, 2026—see changelog at the bottom.)

There’s a lot of talk about “algorithmic progress” in LLMs, especially in the context of exponentially-improving algorithmic efficiency. For example:

Epoch AI: “[training] compute required to reach a set performance threshold has halved approximately every 8 months”.
Dario Amodei 2025: “I'd guess the number today is maybe ~4x/year”.
Gundlach et al. 2025a “Price of Progress”: “Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around 3× per year”.

It’s nice to see three independent sources reach almost exactly the same conclusion—halving times of 8 months, 6 months, and 7½ months respectively. Surely a sign... (read 3724 more words →)

103

•••

New version of “Intro to Brain-Like-AGI Safety”

13d

This post is partly a belated response to Joshua Achiam, currently OpenAI’s Head of Mission Alignment:

If we adopt safety best practices that are common in other professional engineering fields, we'll get there … I consider myself one of the x-risk people, though I agree that most of them would reject my view on how to prevent it. I think the wholesale rejection of safety best practices from other fields is one of the dumbest mistakes that a group of otherwise very smart people has ever made. —Joshua Achiam on Twitter, 2021
“We just have to sit down and actually write a damn specification, even if it's like pulling teeth. It's the most important

... (read 2236 more words →)

My AGI safety research—2025 review, ’26 plans

23d

A new version of “Intro to Brain-Like-AGI Safety” is out!

Things that have not changed

Same links as before:

As a series of 15 blog posts on LessWrong / Alignment Forum: https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8
As a 225-page PDF (now up to version 3): https://osf.io/preprints/osf/fe36n
Summary video: Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

…And same abstract as before:

Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?
I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of

... (read 5581 more words →)

For anyone who read my “Intuitive Self-Models” series (2024): I’m planning to revise it by replacing the term “homunculus” with “active self” wherever it appears. I just think “active self” is a better term for the specific thing I’m trying to talk about there. If anyone disagrees, here’s your chance to speak up. :)

Reward Function Design: a starter pack

2mo

Previous: 2024, 2022

“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody^[1]

1. Background & threat model

The main threat model I’m working to address is the same as it’s been since I was hobby-blogging about AGI safety in 2019. Basically, I think that:

The “secret sauce” of human intelligence is a big uniform-ish learning algorithm centered around the cortex;
This learning algorithm is different from and more powerful than LLMs;
Nobody knows how it works today;
Someone someday will either reverse-engineer this learning algorithm, or reinvent something similar;
And then we’ll have Artificial General Intelligence (AGI) and superintelligence (ASI).

I think that, when this learning algorithm is understood,... (read 3492 more words →)

133

We need a field of Reward Function Design

2mo

In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study, since they seem to be an existence proof that such reward functions exist. So what is the general principle of Reward Function Design that underlies the non-ruthless (“ruthful”??) properties of human social instincts? And whatever that general principle is, can we apply it to future RL agent AGIs?

I don’t have all the answers, but I think I’ve made some progress, and the... (read 4556 more words →)

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

2mo

(Brief pitch for a general audience, based on a 5-minute talk I gave.)

Let’s talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI)

My research focuses on “RL agents”, broadly construed. These were big in the 2010s—they made the news for learning to play Atari games, and Go, at superhuman level. Then LLMs came along in the 2020s, and everyone kinda forgot that RL agents existed. But I’m part of a small group of researchers who still thinks that the field will pivot back to RL agents, one of these days. (Others in this category include Yann LeCun and Rich Sutton & David Silver.)

Why do I think that?... (read 1235 more words →)

118

Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking

2mo

Tl;dr

AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” sigh the former. “Just you wait.”

As it happens, I’m basically in that “alas, just you wait” camp,... (read 4808 more words →)

357

•••

Social drives 1: “Sympathy Reward”, from compassion to dehumanization

3mo

(Follow-up to Social drives 1: “Sympathy Reward”, from compassion to dehumanization, but this post is self-contained.)

1. Intro & summary

1.1 Background

In^[1] Intro to Brain-Like-AGI Safety (2022), I argued: (1) We should view the brain as having a reinforcement learning (RL) reward function, which says that pain is bad, eating-when-hungry is good, and dozens of other things (sometimes called “innate drives” or “primary rewards”); and (2) Reverse-engineering human social innate drives in particular would be a great idea—not only would it help explain human personality, mental health, morality, and more, but it might also yield useful tools and insights for the technical alignment problem for Artificial General Intelligence.

Then in Neuroscience of human social instincts: a sketch... (read 5003 more words →)