Monday AI Radar #19

Against Moloch

We are in a strange situation: the big labs take AI risk more seriously than the government, and are doing a better job of preparing for it. There has been surprising progress on alignment over the last few years, and some strong work figuring out how to shape model behavior. But society—and the government—aren’t remotely ready for what is coming. LLMs are making alarming progress toward automated discovery and exploitation of critical security vulnerabilities, but there’s no evidence of government leadership in formulating a response, and no sign that the rest of the world is paying attention.

Top pick

Boaz Barak on the state of AI safety

Boaz Barak shares four graph sketches that summarize the state of AI safety in early 2026. I largely agree with all four points:

Capabilities continue to increase at breakneck speed.
Alignment is going surprisingly well, but isn’t progressing as fast as we need relative to capabilities progress.
We continue to see very little evidence of scheming. This may change as capabilities increase.
“The worst news is that society is not ready for AI, and is not showing signs of getting ready.”

Agents!

Claude Code auto mode

The Claude Code team continues to ship new features at a brisk pace. I’m particularly excited about their latest feature: auto mode uses Sonnet to identify risky operations, requesting explicit user permission for only a small set of operations that are most likely to be dangerous.

This isn’t magic: it’ll still ask for some permissions, and it will sometimes fail to ask for permission when it should. But it sounds like they’ve done a very impressive job of maintaining user oversight without a barrage of mostly useless requests. There’s a very delicate balance here: if you request permission for too many innocuous actions, your users learn to mechanically approve every request without carefully considering it, which means you might as well not ask at all.

There’s some very cool engineering behind this—I highly recommend checking it out.

Cybersecurity

I now feel strongly that cybersecurity is the biggest short-term AI risk. This isn’t an extinction-level risk, but major disruptions to multiple critical systems are a real possibility sometime this year.

Black-hat LLMs

Anthropic’s Nicholas Carlini is alarmed:

Basic lesson I hope you take away from this talk is relatively simple: today it is true that language models can autonomously, and without fancy scaffolding, find and exploit 0 day vulnerabilities in very important pieces of software. This is not something that was true even, let’s say three or four months ago.

[…] they’re getting really really good really fast, and this means that the nice balance we had between attackers and defenders over the last twenty years or so seems like it’s probably coming to an end.

Autonomous jailbreak development

We talk a lot about agents getting good enough to meaningfully assist with AI research. Those same capabilities can be pointed at all sorts of problems: “Claudini” uses a pipeline similar to Karpathy’s autoresearch to develop new (and highly effective) jailbreak and prompt injection attacks.

First programmers, now Pliny the Elder—is nobody’s job safe?

Claude Mythos

A misconfigured Anthropic CMS (yes, there’s some irony here) leaked details about an upcoming new model:

Although Mythos is currently far ahead of any other AI model in cyber capabilities, it presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders.

That’s why our release plan for Mythos focuses on cyber defenders: we’re releasing it in early access to organizations, giving them a headstart in improving the robustness of their codebases against the impending wave of AI-driven exploits.

I’m already pretty hardcore about cybersecurity but I’ve become even more paranoid in recent months. Security is always a tradeoff between convenience and protection, but the right tradeoff shifts as the threat level rises.

Capabilities and trajectories

Progress on FrontierMath: Open Problems

That didn’t take long: the first problem from FrontierMath: Open Problems has fallen. The problem is a conjecture from a published paper that the authors had failed several times to solve—this is a significant accomplishment.

He He: What research looks like with agents

He He gives a first-hand account of using Codex for ML research:

This is not a toy problem; it is not some grand challenge either. But it represents typical empirical ML research. It is the kind of problem you might give to a junior PhD student. My takeaway is that this kind of problem can be automated to a large extent today.

Over the last few months we’ve seen a clear pattern across coding, math, and cybersecurity: AI can’t replace professionals, but it can automate significant amounts of routine research.

ARC-AGI-3

ARC-AGI-3 is the latest iteration of one of the most interesting AI benchmarks. Rather than targeting useful real-world tasks, the team deliberately focuses on what they see as the most important deficits in frontier AI:

The benchmarks target the residual gap between what's hard for AI and what's easy for humans. It's meant to be a tool to measure AGI progress and to drive researchers towards the most important open problems on the way to AGI.

The mini games are playable by humans and I recommend choosing one and playing a few levels. They’re fun to play and require a different kind of intelligence than traditional benchmarks.

Ryan Greenblatt on crystalized vs fluid intelligence

Ryan Greenblatt has a helpful analogy for thinking about AI capabilities: current LLMs have immense crystallized intelligence but very limited fluid intelligence. The models are currently great at accomplishing a growing set of tasks, but very limited in their ability to learn genuinely new skills.

(Note the parallels to ARC-AGI-3’s focus on figuring out the goals and rules of each game).

Alignment and interpretability

OpenAI’s approach to the Model Spec

OpenAI explains how they approach the Model Spec: what it’s for, how they build it, and why it’s structured the way it is. It’s a strong document that engages directly with the considerable complexity of trying to shape model behavior. A few things stand out to me:

The very first sentence states that AI should be "fair, safe, and freely available”. Reasonable, but it feels defensive—perhaps a response to criticism about ChatGPT showing ads?

They emphasize that it’s a working document that is in some places aspirational, and it will sometimes get ahead of the capabilities of their publicly released models. Yes, exactly.

Their use of decision rubrics and concrete examples as tools for clarifying ambiguity makes a lot of sense, both as a training technique and as a way of illustrating to civilians the types of tradeoffs involved in steering model behavior. Many questions about model behavior are easy in a vacuum, but much harder in the context of tradeoffs against other desirable behaviors. Both OpenAI and Anthropic are being smart about publicly discussing those tradeoffs.

Using AI

Why LLMs Are Bad Writers But Good Editors

Jasmine Sun advocates for using LLMs as an editor (but not a writer) on Substack (partial paywall) and The Atlantic (paywall).

I strongly endorse this: I’ve seen nothing that makes me want to let an AI write for me, but I find Claude Code to be a great editor. The key, in my experience, is to have a very clear idea of what you want. “How can I make this essay better?” is a silly question that will get you mediocre feedback. But if you can articulate in detail what kind of writer you want to be and what is best and worst about your current writing, AI is quite good at helping you realize that vision.

Is it as useful as a professional human editor? No, and yes. A good human will give you better feedback, but an AI can give you instant feedback, any time you want it, over and over again.

Are we dead yet?

Why Dean Ball isn’t a doomer

Dean Ball takes a strong stand against the argument that ASI would doom humanity:

The implicit, and sometimes even explicit, argument of “the doomers” is that intelligence is the sole bottleneck on capability (because any other bottlenecks can be resolved with more intelligence), and that everything else follows instantly once that bottleneck is removed. I believe this is just flatly untrue, and thus I doubt many “AI doom” scenarios. Intelligence is neither omniscience nor omnipotence.

What all of this means is that I am doubtful about the ability of an AI system—no matter how smart—to eradicate or enslave humanity in the ways imagined by the doomers.

It’s worth reading, but this is one of the rare times when I strongly disagree with Dean. A misaligned superintelligence wouldn’t be able to eradicate humanity instantly, but it’s only a matter of time before it would find a way.

As it happens, Alon Torres also disagrees with him:

I agree with these points in principle - superintelligence is not omniscience. But I believe Dean uses these valid observations to reach a conclusion that dramatically underestimates how capable ASI might be in practice.

Both pieces are worth reading, especially as two thoughtful sides of a very important debate.

Jobs and the economy

Plentiful, high-paying jobs in the age of AI

The idea of comparative advantage comes up regularly in discussions about AI’s impact on jobs. It’s often cited as one mechanism by which humans might still have high-paying jobs even if AI can do everything better than us. It’s a very elegant concept, but extremely counter-intuitive until you get your head around it. Noah Smith does a great job of explaining how it works and how it applies to AI-related job loss.

It’s an important concept to understand, but I’m deeply skeptical that it’ll play a meaningful role in employment. It might be relevant if AI capabilities max out at near-human levels, but in a world of truly superhuman AI, the resources you need to live can be more efficiently allocated to an AI.

Strategy and politics

Government and the private sector

Dean Ball points out a rather inconvenient fact about the politics of AI safety:

The roles are totally reversed from the logic that Pause AI and frankly other AI safety advocates confidently assumed for years. It is industry that is in favor of alignment and at least somewhat measured deployment risks, and government whose actions seem much closer to reckless.

Obviously it is still the case that certain coordination problems can only be solved by government. But if your plan relies on an idealized government that doesn’t actually exist, you don’t have a plan.

Cognitive Revolution interviews Zvi

Zvi tells Cognitive Revolution why he believes we’re now shifting from the beginning of the AI story to the middle (he considers the endgame to begin when humans are no longer in control). This is a good overview of his worldview—highly recommended, even though it’s brutally long: 3.5 hours of audio, or a 38,000 word transcript.

The Rise and Reckoning of AI

Neil deGrasse Tyson moderates a debate about AI for the 2026 Isaac Asimov Memorial Debate. On the one hand: you probably don’t need to watch this because it’s extremely bad. On the other hand, it’s a useful reality check about the quality of the discourse about AI even among relatively knowledgeable people. The number of “AI experts” who can’t predict the present is just staggering.

Concrete projects to prepare for superintelligence

What projects would be most useful to help prepare for superintelligence? Forethought has an interesting list of the potential projects they see as most important. Even if you aren’t looking to start a new organization, there are some useful ideas here.

Automated macrostrategy is a good idea I haven’t seen explicitly articulated before. We talk about the implications of AI being involved in strategy debates, but I haven’t seen previous discussion of systematically training it to do that well.

What Is Anthropic Thinking?

Jack Clark will be leading the newly announced Anthropic Institute, which “exists to understand and shape the consequences of powerful AI systems”. Derek Thompson interviews him about the role of government in AI, job loss and economic impact, and current capabilities.

Jack’s great, and Anthropic does more than any other lab to help humanity prepare for what is coming. But that only goes so far if humanity doesn’t use that information to make sensible preparations.

Technical

Quantization from the ground up

Sam Rose has an excellent interactive article explaining model quantization. I thought I understood it pretty well, but I learned a lot from this piece—it’s much more complicated than just “chop off some bits of precision”.

LESSWRONG
LW