Looking over the comments, some of the most upvoted comments express the sentiment ththat Yudkowsky is not the best communicator. This is what the people say.
I'm afraid the evolution analogy isn't as convincing an argument for everyone as Eliezer seems to think. For me, for instance, it's quite persuasive because evolution has long been a central part of my world model. However, I'm aware that for most "normal people", this isn't the case; evolution is a kind of dormant knowledge, not a part of the lens they see the world with. I think this is why they can't intuitively grasp, like most rat and rat-adjacent people do, how powerful optimization processes (like gradient descent or evolution) can lead to mesa-optimization, and what the consequences of that might be: the inferential distance is simply too large.
I think Eliezer has made great strides recently in appealing to a broader audience. But if we want to convince more people, we need to find rhetorical tools other than the evolution analogy and assume less scientific intuition.
That’s a bummer. I’ve only listened partway but was actually impressed so far with how Eliezer presented things, and felt like whatever media prep has been done has been quite helpful
Certainly he did a better job than he has in previous similar appearances. Things get pretty bad about halfway through though, Ezra presents essentially an alignment-by-default case and Eliezer seems to have so much disdain for that idea that he's not willing to engage with it at all (I of course don't know what's in his brain. This is how it reads to me, and I suspect how it reads to normies.)
I am a fan of Yudkowsky and it was nice hearing him of Ezra Klein, but I would have to say that for my part the arguments didn't feel very tight in this one. Less so than in IABED (which I thought was good not great).
Ezra seems to contend that surely we have evidence that we can at least kind of align current systems to at least basically what we usually want most of the time. I think this is reasonable. He contends that maybe that level of "mostly works" as well as the opportunity to gradually give feedback and increment current systems seems like it'll get us pretty far. That seems reasonable to me. 
As I understand it, Yudkowsky probably sees LLMs as vaguely anthropomophic at best, but not meaningfully aligned in a way that would be safe/okay if current systems were more "coherent" and powerful. Not even close. I think he contended that if you just gave loads of power to ~current LLMs, they would optimize for something considerably different than the "true moral law". Because of the "fragility of value", he also believes it is likely the case that most types of psuedoalignments are not worthwhile. Honestly, that part felt undersubstantiated in a "why should I trust that this guy knows the personality of GPT 9" sort of way; I mean, Claude seems reasonably nice right? And also, ofc, there's the "you can't retrain a powerful superintelligence" problem / the stop button problem / the anti-natural problems of corrigible agency which undercut a lot of Ezra's pitch, but which they didn't really get into.
 
So ya, I gotta say, it was hardly a slam dunk case / discussion for high p(doom | superintelligence).
The comments on the video are a bit disheartening... lots of people saying Yudkowsky is too confusing, answers everything too technically or with metaphors, structuring sentences in a way that's hard to follow, and Ezra didn't really understand the points he was making.
One example: Eliezer mentioned in the interview that there was a kid whose chatbot encouraged him to commit suicide, with the point that "no one programmed the chatbot to do this." This comment made me think:
if you get a chance listen to the interviews with the parents and the lawyers who are suing chatgpt because that kid did commit suicide.
Oh yeah, probably most people telling this story would at least mention that the kid did in fact commit suicide, rather than treating it solely as evidence for an abstract point...
Klein comes off very sensibly. I don’t agree with his reasons for hope, but they do seem pretty well thought out and Yudkowsky did not answer them clearly.
I was excited to listen to this episode, but spent most of it tearing my hair out in frustration. A friend of mine who is a fan of Klein told me unprompted that when he was listening, he was lost and did not understand what Eliezer was saying. He seems to just not be responding to the questions Klein is asking, and instead he diverts to analogies that bear no obvious relation to the question being asked. I don't think anyone unconvinced of AI risk will be convinced by this episode, and worse, I think they will come away believing the case is muddled and confusing and not really worth listening to.
This is not the first time I've felt this way listening to Eliezer speak to "normies". I think his writings are for the most part very clear, but his communication skills just do not seem to translate well to the podcast/live interview format.
I've been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he's actually stronger in many parts than in writing: a lot of people found the "Sable" story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it's emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.
Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format - shorter and more decontextualized - produced way too much inferential distance for so many of the answers.
Richard Sutton rejects AI Risk.
AI is a grand quest. We're trying to understand how people work, we're trying to make people, we're trying to make ourselves powerful. This is a profound intellectual milestone. It's going to change everything... It's just the next big step. I think this is just going to be good. Lot's of people are worried about it - I think it's going to be good, an unalloyed good.
Introductory remarks from his recent lecture on the OaK Architecture. 
"Richard Sutton rejects AI Risk" seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, "Rather quickly, they could displace us from existence". Most would consider our extinction as a risk they are referring to when they say "AI Risk".
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
[Sutton] agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is "attempt to have a slave society," not "slow down AI progress for decades"---I think he might also believe that stagnation is much worse than a handoff but haven't heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it's not as bad as the alternative.
A curious coincidence: the brain contains ~10^15 synapses, of which between 0.5%-2.5% are active at any given time. Large MoE models such as Kimi K2 contains 10^12 parameters, of which 3.2% are active in any forward pass. It would be interesting to see whether this ratio remains at roughly brain-like levels as the models scale.
Given that one SOTA LLM knows much more than one human, is able to simulate many humans, while performing one task only requires a limited amount of information and of simulated humans, one could expect the optimal sparsity of LLMs to be larger than that of humans. I.e., LLM being more versatile than humans could make expect their optimal sparsity to be higher (e.g., <0.5% of activated parameters).
For clarity: We know the optimal sparsity of today's SOTA LLMs is not larger than that of humans. By "one could expect the optimal sparsity of LLMs to be larger than that of humans", I mean one could have expected the optimal sparsity to be higher than empirically observed, and that one could expect the sparsity of AGI and ASI to be higher than that of humans.
I don't think this means much, because dense models with 100% active parameters are still common, and some MoEs have high percentages, such as the largest version of DeepSeekMOE with 15% active.
Unless anyone builds it, everyone dies.
Edit: I think this statement is true, but we shouldn’t build it anyway.
Recent evidence suggests that models are aware that their CoTs may be monitored, and will change their behavior accordingly. As capabilities increase I think CoTs will increasingly become a good channel for learning facts which the model wants you to know. The model can do its actual cognition inside forward passes and distribute it over pause tokens learned during RL like 'marinade' or 'disclaim', etc.
For what it's worth, I don't think it matters for now, for a couple of reasons:
So I don't really worry about models trying to change their behavior in ways that negatively affect safety/sandbag tasks via steganography/one-forward pass reasoning to fool CoT monitors.
We shall see in 2026 and 2027 whether this continues to hold for the next 5-10 years or so, or potentially more depending on how slowly AI progress goes.
Edit: I retracted the claim that most capabilities come from CoT, due to the paper linked in the very next tweet, and think that RL on CoTs is basically a capability elicitation, not a generator of new capabilities.
As for AI progress being slow, I think that without theoretical breakthroughs like neuralese AI progress might come to a stop or at building more and more expensive models. Indeed, the two ARC-AGI benchmarks[1] could have demonstrated a pattern where maximal capabilities scale[2] linearly or multilinearly with ln(cost/task).
If this effect persists deep into the future of transformer LLMs, then most AI companies could run into the limits of the paradigm well before researching the next one and losing any benefits of having a concise CoT.
I'm a big fan of OpenAI investing in video generation like Sora 2. Video can consume an infinite amount of compute, which otherwise might go to more risky capabilities research.
My pet AGI strategy, as a 12 year old in ~2018, was to build sufficiently advanced general world models (from YT videos etc.), then train an RL policy on said world model (to then do stuff in the actual world).
A steelman of 12-year-old me would point out that video modeling has much better inductive biases than language modeling for robotics and other physical (and maybe generally agentic) tasks, though language modeling fundamentally is a better task for teaching machines language (duh!) and reasoning (mathematical proofs aren't physical objects, nor encoded in the laws of physics).
OpenAI's Sora models (and also DeepMind's Genie and similar) very much seems like a backup investment in this type of AGI (or at least transformative narrow robotics AI), so I don't think this is good for reducing OpenAI's funding (robots would be a very profitable product class), nor influence (obv. a social network gives a lot of influence, to e.g. prevent an AI pause or to move the public towards pro-AGI views).
In any scenario, Sora 2 seems to me as a net-negative activity for AI safety:
Thanks, these are good points!
I think that the path to AGI involves LLMs/automated ML research, and the first order effects of diverting compute away from this still seem large. I think OpenAI is bottlenecked more by a lack of compute (and Nvidia release cycles), than by additional funding from robotics. And I hope I'm wrong, but I think the pause movement won't be large enough to make a difference. The main benefit in my view comes if it's a close race with Anthropic, where I think slowing OpenAI down seems net positive and decreases the chances we die by a bit. If LLMs aren't the path to AGI, then I agree with you completely. So overall it's hard to say, I'd guess it's probably neutral or slightly positive still. 
Of course, both paths are bad, and I wish they would invest this compute into alignment research, as they promised!
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
What prompt did you use? I have also experimented with playing chess against GPT-4.5, and used the following prompt:
"You are Magnus Carlsen. We are playing a chess game. Always answer only with your next move, in algebraic notation. I'll start: 1. e4"
Then I just enter my moves one at a time, in algebraic notation.
In my experience, this yields roughly good club player level of play.
Given the Superalignment paper describes being trained on PGNs directly, and doesn't mention any kind of 'chat' reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don't come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
but I expect that the RLHFed models would try to play the moves which maximize their chances of winning
RLHF doesn't maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the 'post-training' phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. "Yes master, it is just as you say!") It's not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It's hardly an important or valuable task.
“Let's play a game of chess. I'll be white, you will be black. On each move, I'll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
I think it would be cool if someone made a sandbagging eval, measuring the difference in model capabilities when it is finetuned to do a task vs. when it is prompted to do a task. Right now I think the difference would be small for most tasks but this might change.
I would guess that a sandbagging eval should be different from what you describe. I did try to sketch a potential way to measure sandbagging by comparing various benchmarks like METR, ARC-AGI, boardgames, etc, and developing scaling laws for every new architecture. If the laws break down, then the model is likely sandbagging on the evaluations.
Interesting, perhaps that could work!
Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
I have two arguments against it.
However, it might have been due to xAI being algorithmically behind.
My chess prediction market provides a way to estimate the expected value[1] of LLM models released before a certain year. We can convert this to upper bounds[2] of their FIDE rating:
Any model announced before 2026: 20% expected value → 1659 FIDE
Any model announced before 2027: 50% expected value → 1900 FIDE
Any model announced before 2028: 69% expected value → 2039 FIDE
Any model announced before 2029: 85% expected value → 2202 FIDE
Any model announced before 2030: 91% expected value → 2302 FIDE
For reference, a FIDE master is 2300, a strong grandmaster is ~2600 FIDE and Magnus Carlsen is 2839 FIDE. 
These are very rough estimates since it isn't a real money market and long-term options have an opportunity cost. But I'd be interested in more markets like this for predicting AGI timelines. 
The former inequality seems almost certain, but I'm not sure that that the latter inequality holds even over the long term. It probably does hold conditional on long-term non-extinction of humanity, since P(ABI) probably gets very close to 1 even if P(IABIED) is high and remains high.
I am registering here that my median timeline for the Superintelligent AI researcher (SIAR) milestone is March 2032. I hope I'm wrong and it comes much later!
What happened to the ‘Subscribed’ tab on LessWrong? I can’t see it anymore, and I found it useful for keeping track of various people’s comments and posts.
I'm not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I'm not very confident in their claim that "Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier".
Yes, there have been a variety. Here's the latest which is causing a media buzz: Meta's Coconut https://arxiv.org/html/2412.06769v2
An intuition I’ve had for some time is that search is what enables an agent to control the future. I’m a chess player rated around 2000. The difference between me and Magnus Carlsen is that in complex positions, he can search much further for a win, such than I gave virtually no chance against him; the difference between me and an amateur chess player is similarly vast.
This is at best over-simplified in terms of thinking about 'search': Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
As of December 2024, Carlsen is also ranked No. 1 in the FIDE rapid rating list with a rating of 2838, and No. 1 in the FIDE blitz rating list with a rating of 2890.[495]
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
best-of-n sampling which solved ARC-AGI
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.