How are we doing on solving the alignment problem? Harry Law begins this week’s newsletter with an explanation of alignment-by-default: the idea that because LLMs are trained on an immense body of human text, they are predisposed to understand and pursue human values. But predisposition isn’t enough: Ryan Greenblatt argues that current models show a concerning pattern of mundane misalignment that could become catastrophic if it isn’t fixed.
And lest we spend all our time worrying about how to ensure that AI does what we want, Robert Long explores the ethics of whether we should create intelligent beings that want to serve us. Alignment is far from solved, but these challenges are concrete—and solvable—in a way that few people expected five or ten years ago.
The orthogonality thesis states that superintelligence is compatible with a vast range of possible goals. In traditional AI safety thinking, that presents a serious challenge for alignment. How do you ensure your AI is aligned with human values if they represent just a tiny subset of the possible goals it might learn during training?
There is a strong case to be made that the orthogonality thesis is misleading when it comes to LLMs. As Harry Law explains in this week’s top pick:
Alignment-by-default says, for the class of systems defined by autoregressive language modeling over human-generated text, the training process generates a normative prior such that the default expectation should be partial alignment.
The idea is that because LLM base models are pre-trained on an immense amount of human text, they are not blank slates that need to be taught human values from scratch. Pre-training gives them a deep understanding of those values and a “normative prior” that predisposes them to act accordingly.
In this view, post-training doesn’t have to teach human values, but merely needs to steer the model within a set of values to which it is already predisposed. Alignment by default doesn’t guarantee that LLMs will be perfectly aligned, but implies that they will default to partial alignment and will be easier to fully align than has been traditionally supposed.
Alignment is a hard problem that is far from solved, and alignment-by-default doesn’t change that. But the nature of LLMs means that some parts of alignment are much easier than we once expected.
Who I follow An opinionated take on how best to keep up with the most important developments in AI.
Inkhaven
I’m spending April at the Inkhaven Writing Residency. It’s a fantastic program that I highly recommend if you’re interested in skilling up as a blogger. Curious about how it works? Come to the Inkhaven Fair on Saturday April 25 (I’ll be there and would love to say hi).
ChinaTalk explores what Mythos means for national security. This is the best piece I’ve seen for understanding the implications of Mythos’ cybersecurity capabilities. Mythos is alarmingly capable and the security landscape is going to be challenging for at least the next year or two. But how bad it gets will depend as much on mundane details like rapid deployment of patches as it will on raw technical capabilities.
Looking beyond cyber, Ben Buchanan is unfortunately correct about what comes next:
I think we are very fortunate that cyber is coming first. I think we should use cyber as a lesson for what is coming next at the intersection of AI and other fields. Bio will not be far behind. At some point we will have a Mythos moment for bio.
Should it serve as a lesson? Yes.
Will it serve as a lesson? The post-covid dismantling of public health doesn’t fill me with confidence.
UK AISI’s evaluation of Mythos finds that Mythos is not only able to find subtle vulnerabilities, but it represents a major step forward in autonomously conducting complete attacks consisting of numerous discrete steps.
Epoch AI presents MirrorCode, a new benchmark that tests the ability to perform long but well-specified coding tasks. It’s a nicely designed evaluation: the AI is tasked with writing a functional equivalent of a command line tool and given access to the tool, documentation, and a set of test cases, but not the source code itself.
The task is well-specified and easy to verify, making it an ideal task for an LLM. Epoch finds a steady progression in Opus’ capability: 4.0 succeeded at a task that required 650 lines of code (LoC), 4.5 succeeded at a 1,200 LoC task, and 4.6 succeeded at a 7,700 LoC task. Epoch estimates that a human coder would have needed several weeks to succeed at the same task.
This aligns well with Ryan Greenblatt’s recent piece arguing that AI can now accomplish difficult tasks that would take experts months or years to complete if the tasks are sufficiently easy to verify. An obvious corollary is that there is immense alpha in making more tasks highly verifiable.
Minh Pham coins “the Tolerance Gap” as a tool for thinking about how AI can be usefully applied to different types of tasks. High-tolerance tasks (vibe coding) can tolerate significant errors in exchange for high productivity, while low-tolerance tasks (accounting) cannot. It’s a great term for a useful concept.
This advice seems spot-on, and a good example of the concept in action:
For founders: pick a side of the Gap and commit. A product that tries to straddle both regimes usually fails both. The winners on the high-tolerance side are shipping agents, raising autonomy, racing on horizon length. The winners on the low-tolerance side are (quietly) building verification layers, domain-specific guardrails, and human-in-the-loop tooling that treats the model as one input among many.
Sayash Kapoor and Arvind Narayanan have a comprehensive paper on “open-world evaluations: long-horizon tasks in real-world environments, where success can’t be neatly specified or automatically graded.” They review recent examples and present a framework for thinking about how open-world evaluations work, what their limitations are, and how to best make use of them.
These types of evaluations are harder to create and don’t lend themselves well to easy comparisons between models. But they are perhaps the best way to assess the full capabilities of frontier models.
Many people—especially AI company employees—believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions). I disagree.
Ryan argues that although we see little evidence of malicious misbehavior, there is a clear pattern that might be described as a combination of laziness, overeagerness, and misrepresenting the success of their work. While it’s currently mostly annoying,
I still think this misalignment is indicative of serious problems and would ultimately be existentially catastrophic if not solved.
It’s a thoughtful piece and I’m updating my beliefs based on it. I’m not convinced, however, that this type of misalignment would be catastrophic: there are plausible scenarios where that might be the case, but I’m not sure that’s the default path. He notes that Alex Mallen will soon post more about this—I’m excited to read that.
I’m also more optimistic that this class of misalignment will get fixed: the associated problems seem highly legible, and the incentives to fix them seem strong.
If you’re just running a Claude Code session forever and letting it auto-compact when the context window gets full, you’re leaving a ton of performance on the table. Anthropic’s Thariq has a detailed guide to the tools and strategies you should be using to manage your context.
Quanta takes an in-depth look at AI and advanced math. Math is an area where AI capabilities are advancing rapidly: although it isn’t anywhere close to being able to replace mathematicians, it’s increasingly able to provide substantive assistance with solving hard problems:
Gómez-Serrano noted that any one of their results might have been obtained by an expert in a given area who worked at it for a few months. But without being experts in many of these fields, “we were able to obtain comparable results in the span of a day or two,” he said.
One of my fellow residents at Inkhaven is Benjamin Grayzel, who submitted the first AI-ideated resolution to Erdős problem #659. He’s written an excellent account of what it looks like to do math with AI.
Conspicuous Cognition talks with Robert Long about AI consciousness and welfare. They discuss how Claude perceives itself, whether it’s ethical to create a being that genuinely wants to serve others, and how AI welfare and AI safety might be related. Rob’s idea that consciousness and moral status might be decoupled seems important but confusing—that’s a reflection of the complexity that surrounds any discussion of consciousness.
Nathan Lambert shares 13 beliefs about open models in mid 2026. This feels like a transition time for open models, where the current business model isn’t holding up but it isn’t yet clear what replaces it.
This is a complex picture, where the long-term trajectory is more of an economics question rather than an ability one.
Dwarkesh recently interviewed Jensen Huang. It’s worth listening to if you’re deeply interested in the details of the GPU business, probably not otherwise. The part that upset the Twitterati is the discussion about whether we should allow NVIDIA to sell high end chips to China. Zvi’s assessment is exactly right:
What matters is Nvidia selling chips to China. That’s it. Nothing else matters. That keeps Nvidia and CUDA dominant, and what’s good for Nvidia is good for America, because if anything is built on his chips then that’s ‘good news’ and we win, whereas if it’s built on someone else’s chips, then that is ‘bad news’ and we lose. This does not actually make any sense whatsoever.
What’s confusing here is that Jensen is determined to sell advanced chips to China, even though he would have no trouble selling those same chips domestically. I’m unable to come up with a charitable explanation.
Andy Hall runs a new lab focused on using agentic AI for academic research. As part of a series by Roots of Progress Institute, he discusses what his team has learned so far:
Any one of these projects would have been extremely difficult to carry out a year ago, requiring intensive focus over many months. Completing multiple ambitious public-impact projects in a two-month period would have been completely unthinkable.
The challenge will be to ensure, as Andy says, that we generate 100x as much knowledge, not 100x as many papers.
How are we doing on solving the alignment problem? Harry Law begins this week’s newsletter with an explanation of alignment-by-default: the idea that because LLMs are trained on an immense body of human text, they are predisposed to understand and pursue human values. But predisposition isn’t enough: Ryan Greenblatt argues that current models show a concerning pattern of mundane misalignment that could become catastrophic if it isn’t fixed.
And lest we spend all our time worrying about how to ensure that AI does what we want, Robert Long explores the ethics of whether we should create intelligent beings that want to serve us. Alignment is far from solved, but these challenges are concrete—and solvable—in a way that few people expected five or ten years ago.
Top pick
Alignment by default?
The orthogonality thesis states that superintelligence is compatible with a vast range of possible goals. In traditional AI safety thinking, that presents a serious challenge for alignment. How do you ensure your AI is aligned with human values if they represent just a tiny subset of the possible goals it might learn during training?
There is a strong case to be made that the orthogonality thesis is misleading when it comes to LLMs. As Harry Law explains in this week’s top pick:
The idea is that because LLM base models are pre-trained on an immense amount of human text, they are not blank slates that need to be taught human values from scratch. Pre-training gives them a deep understanding of those values and a “normative prior” that predisposes them to act accordingly.
In this view, post-training doesn’t have to teach human values, but merely needs to steer the model within a set of values to which it is already predisposed. Alignment by default doesn’t guarantee that LLMs will be perfectly aligned, but implies that they will default to partial alignment and will be easier to fully align than has been traditionally supposed.
Alignment is a hard problem that is far from solved, and alignment-by-default doesn’t change that. But the nature of LLMs means that some parts of alignment are much easier than we once expected.
My writing
Don’t cut yourself on the jagged frontier Some quick thoughts about the dangers of well-aligned superintelligence and the relevance of the jagged frontier.
Who I follow An opinionated take on how best to keep up with the most important developments in AI.
Inkhaven
I’m spending April at the Inkhaven Writing Residency. It’s a fantastic program that I highly recommend if you’re interested in skilling up as a blogger. Curious about how it works? Come to the Inkhaven Fair on Saturday April 25 (I’ll be there and would love to say hi).
Mythos
Mythos and national power
ChinaTalk explores what Mythos means for national security. This is the best piece I’ve seen for understanding the implications of Mythos’ cybersecurity capabilities. Mythos is alarmingly capable and the security landscape is going to be challenging for at least the next year or two. But how bad it gets will depend as much on mundane details like rapid deployment of patches as it will on raw technical capabilities.
Looking beyond cyber, Ben Buchanan is unfortunately correct about what comes next:
Should it serve as a lesson? Yes.
Will it serve as a lesson? The post-covid dismantling of public health doesn’t fill me with confidence.
UK AISI evaluates Mythos’ cyber capabilities
UK AISI’s evaluation of Mythos finds that Mythos is not only able to find subtle vulnerabilities, but it represents a major step forward in autonomously conducting complete attacks consisting of numerous discrete steps.
Claude Mythos #3: capabilities and additions
Part 3 of Zvi’s Mythos coverage focuses on capabilities. If you don’t have time to read the whole thing, the conclusion covers the essentials.
Benchmarks and Forecasts
MirrorCode
Epoch AI presents MirrorCode, a new benchmark that tests the ability to perform long but well-specified coding tasks. It’s a nicely designed evaluation: the AI is tasked with writing a functional equivalent of a command line tool and given access to the tool, documentation, and a set of test cases, but not the source code itself.
The task is well-specified and easy to verify, making it an ideal task for an LLM. Epoch finds a steady progression in Opus’ capability: 4.0 succeeded at a task that required 650 lines of code (LoC), 4.5 succeeded at a 1,200 LoC task, and 4.6 succeeded at a 7,700 LoC task. Epoch estimates that a human coder would have needed several weeks to succeed at the same task.
This aligns well with Ryan Greenblatt’s recent piece arguing that AI can now accomplish difficult tasks that would take experts months or years to complete if the tasks are sufficiently easy to verify. An obvious corollary is that there is immense alpha in making more tasks highly verifiable.
The tolerance gap
Minh Pham coins “the Tolerance Gap” as a tool for thinking about how AI can be usefully applied to different types of tasks. High-tolerance tasks (vibe coding) can tolerate significant errors in exchange for high productivity, while low-tolerance tasks (accounting) cannot. It’s a great term for a useful concept.
This advice seems spot-on, and a good example of the concept in action:
Open-world evaluations for measuring frontier AI capabilities
Sayash Kapoor and Arvind Narayanan have a comprehensive paper on “open-world evaluations: long-horizon tasks in real-world environments, where success can’t be neatly specified or automatically graded.” They review recent examples and present a framework for thinking about how open-world evaluations work, what their limitations are, and how to best make use of them.
These types of evaluations are harder to create and don’t lend themselves well to easy comparisons between models. But they are perhaps the best way to assess the full capabilities of frontier models.
Alignment and interpretability
Current AIs seem pretty misaligned to me
Ryan Greenblatt is concerned about the state of alignment:
Ryan argues that although we see little evidence of malicious misbehavior, there is a clear pattern that might be described as a combination of laziness, overeagerness, and misrepresenting the success of their work. While it’s currently mostly annoying,
It’s a thoughtful piece and I’m updating my beliefs based on it. I’m not convinced, however, that this type of misalignment would be catastrophic: there are plausible scenarios where that might be the case, but I’m not sure that’s the default path. He notes that Alex Mallen will soon post more about this—I’m excited to read that.
I’m also more optimistic that this class of misalignment will get fixed: the associated problems seem highly legible, and the incentives to fix them seem strong.
Agents
Managing context in Claude Code
If you’re just running a Claude Code session forever and letting it auto-compact when the context window gets full, you’re leaving a ton of performance on the table. Anthropic’s Thariq has a detailed guide to the tools and strategies you should be using to manage your context.
Math
The AI revolution in math has arrived
Quanta takes an in-depth look at AI and advanced math. Math is an area where AI capabilities are advancing rapidly: although it isn’t anywhere close to being able to replace mathematicians, it’s increasingly able to provide substantive assistance with solving hard problems:
What it looks like to do math with AI
One of my fellow residents at Inkhaven is Benjamin Grayzel, who submitted the first AI-ideated resolution to Erdős problem #659. He’s written an excellent account of what it looks like to do math with AI.
AI psychology
Should we care about AI welfare?
Conspicuous Cognition talks with Robert Long about AI consciousness and welfare. They discuss how Claude perceives itself, whether it’s ethical to create a being that genuinely wants to serve others, and how AI welfare and AI safety might be related. Rob’s idea that consciousness and moral status might be decoupled seems important but confusing—that’s a reflection of the complexity that surrounds any discussion of consciousness.
Open models
My bets on open models, mid-2026
Nathan Lambert shares 13 beliefs about open models in mid 2026. This feels like a transition time for open models, where the current business model isn’t holding up but it isn’t yet clear what replaces it.
Strategy and politics
Dwarkesh interviews Jensen Huang
Dwarkesh recently interviewed Jensen Huang. It’s worth listening to if you’re deeply interested in the details of the GPU business, probably not otherwise. The part that upset the Twitterati is the discussion about whether we should allow NVIDIA to sell high end chips to China. Zvi’s assessment is exactly right:
What’s confusing here is that Jensen is determined to sell advanced chips to China, even though he would have no trouble selling those same chips domestically. I’m unable to come up with a charitable explanation.
Academia
Accelerating academic research with agentic AI
Andy Hall runs a new lab focused on using agentic AI for academic research. As part of a series by Roots of Progress Institute, he discusses what his team has learned so far:
The challenge will be to ensure, as Andy says, that we generate 100x as much knowledge, not 100x as many papers.
Briefly
Resources for upskilling in AI policy
80,000 Hours has a list of resources for people who want to get started in AI policy.
Something frivolous
Andon market
You know what’s more fun than letting an AI run a vending machine? Letting it run a physical store.