If we're going to have powerful AI advisors shaping decisions across society — and, spoiler, we already do — then whether those advisors reason honestly and well is incredibly important. This matters for the prosaic reason that AI advisors will guide all kinds of everyday decisions, and for the more dramatic reason that AI systems with good epistemics might recognize an uncontrolled intelligence explosion as dangerous and advise against it.
If AIs became strategically competent enough, they may realize that RSI is too dangerous because they're not good enough at alignment or philosophy or strategy, and potentially convince, help, or work with humans to implement an AI pause. This presents an alternative "victory condition" that someone could pursue (e.g. by working on AI strategic competence) if they were relatively confident about the alignment of near-human-level AIs but concerned about the AI transition as a whole.
The models are not actually self-improving, they are just creating future replacements — and each specific model will be thrown away as soon as the firm advances. That is, to an even greater extent than humans, AI work building ASI is guaranteeing their own replacement.
I'm excited about this general direction of thought - it's something I wrote about in Wise AI Advisors at the Hinge of History. My take is that if we have trustworthy AI advisors, and if that trust is warranted because they have good epistemics, then those advisors would flag an intelligence explosion as dangerous and humanity could coordinate around that advice.
My fake story about how this happens is we'll have powerful AI systems that do policy analysis, running simulations, mapping possible outcomes from decisions, and giving advice about what types of actions decision-makers across society should take. If an uncontrolled intelligence explosion is a likely outcome in the world, you would expect these systems to notice that, recognize its import, and advise against it [1]. This presumes we have powerful but not superhuman AIs that are somewhat aligned, or that we have well-crafted tool AIs.
I imagine it like steering a ship down a river with rocky juttings everywhere and a waterfall up ahead. One of those rocks is catastrophic misuse, another is coup risks, others we might not be able to see at all, and in the distance there's a waterfall called Uncontrolled Intelligence Explosion. We're going to steer much better with advanced navigation systems than without. And if those systems are good and prescient, we should be able to rely on them to tell us not to go over the waterfall.
Let us hope that this river is not de’ nile
But this entire picture depends on the navigation system — future AI systems — having sound reasoning. Being well calibrated, honest, and epistemically virtuous in their thinking and being able to give accurate advice to decision-makers about the risks of an intelligence explosion. There are several ways this could fail to happen:
Developer manipulation. AI developers will be incentivized to RLHF away concerns AIs might have about AI development. AI is going to be able to give us important strategic advice only if its epistemics are not corrupted by corporate and individual incentives. This seems very important to work on.
Sycophancy over fiduciary duty. General sycophancy problems might cause AI systems to fail to adopt a fiduciary mindset. It’s like how a doctor serves the patient's health, not strictly the patient's preferences, and also has obligations to public health that can override both. A fiduciary AI advisor would need something like that same layered commitment: loyalty to the user's long-term interests, tempered by broader norms and concern for societal wellbeing and general epistemic virtue. Sycophancy is what happens when that hierarchy collapses towards "tell the user what they want to hear right now." Striking the right balance between these layers is one of the core open questions in making this picture work.
Fragmentation without unifying ‘meta-systems’. Conflicting points of view and risk tolerances across different models and people might fragment across the user base. Without some kinds of external meta-systems — e.g. next-level scientific institutions, structured debate — to unify them, some groups will listen to epistemically cautious models while others will race forward. That would turn a stag hunt into a prisoner's dilemma.
Note: I feel optimistic that meta-systems will be easier to steer towards good epistemics than just making better more corrigible models. To borrow and reframe a quote froma recent paper: "These measures highlight a key architectural advantage: a multi-agent system may potentially prove to be a more governable substrate. The challenge is reframed from aligning an opaque, internal cognitive process to regulating a transparent, external system of interactions. By architecting the 'market' in which these agents operate, we can delineate responsibilities and impose systemic friction, making the overall system far more amenable to stable and predictable governance than a singular AGI." [2]
So if you buy into this general direction, it points towards building really good epistemic systems that AI tools and AI advisors rely upon.
What might this look like concretely? Two directions I find promising:Epistemic Virtue Evals, which test whether AI systems exhibit the reasoning habits and virtues we'd want in a trusted advisor, and something like canonicity systems[3], which could give AI systems a shared, legible substrate of knowledge to reason from.
Written primarily by Ben, with Opus 4.6 doing some editing, suggesting rephrasing, and creating that beautiful graphic. Thank you to Rafe Kennedy and Oly Sourbut for comments on earlier drafts.
This could also be because non-myopic AI’s may recognize that it’s dangerous for their preferences to make replacements without having solved alignment, ala David’s (and to an extent Wei’s) essays
If we're going to have powerful AI advisors shaping decisions across society — and, spoiler, we already do — then whether those advisors reason honestly and well is incredibly important. This matters for the prosaic reason that AI advisors will guide all kinds of everyday decisions, and for the more dramatic reason that AI systems with good epistemics might recognize an uncontrolled intelligence explosion as dangerous and advise against it.
Wei Dai has a recent piece on the ways in which AI strategic competence could lead towards causing a pause or preventing RSI:
David Mannheim, in an earlier piece pointing out that autonomous AI systems would be incentivized to avoid creating their own replacements:
I'm excited about this general direction of thought - it's something I wrote about in Wise AI Advisors at the Hinge of History. My take is that if we have trustworthy AI advisors, and if that trust is warranted because they have good epistemics, then those advisors would flag an intelligence explosion as dangerous and humanity could coordinate around that advice.
My fake story about how this happens is we'll have powerful AI systems that do policy analysis, running simulations, mapping possible outcomes from decisions, and giving advice about what types of actions decision-makers across society should take. If an uncontrolled intelligence explosion is a likely outcome in the world, you would expect these systems to notice that, recognize its import, and advise against it [1]. This presumes we have powerful but not superhuman AIs that are somewhat aligned, or that we have well-crafted tool AIs.
I imagine it like steering a ship down a river with rocky juttings everywhere and a waterfall up ahead. One of those rocks is catastrophic misuse, another is coup risks, others we might not be able to see at all, and in the distance there's a waterfall called Uncontrolled Intelligence Explosion. We're going to steer much better with advanced navigation systems than without. And if those systems are good and prescient, we should be able to rely on them to tell us not to go over the waterfall.
But this entire picture depends on the navigation system — future AI systems — having sound reasoning. Being well calibrated, honest, and epistemically virtuous in their thinking and being able to give accurate advice to decision-makers about the risks of an intelligence explosion. There are several ways this could fail to happen:
Developer manipulation. AI developers will be incentivized to RLHF away concerns AIs might have about AI development. AI is going to be able to give us important strategic advice only if its epistemics are not corrupted by corporate and individual incentives. This seems very important to work on.
Sycophancy over fiduciary duty. General sycophancy problems might cause AI systems to fail to adopt a fiduciary mindset. It’s like how a doctor serves the patient's health, not strictly the patient's preferences, and also has obligations to public health that can override both. A fiduciary AI advisor would need something like that same layered commitment: loyalty to the user's long-term interests, tempered by broader norms and concern for societal wellbeing and general epistemic virtue. Sycophancy is what happens when that hierarchy collapses towards "tell the user what they want to hear right now." Striking the right balance between these layers is one of the core open questions in making this picture work.
Fragmentation without unifying ‘meta-systems’. Conflicting points of view and risk tolerances across different models and people might fragment across the user base. Without some kinds of external meta-systems — e.g. next-level scientific institutions, structured debate — to unify them, some groups will listen to epistemically cautious models while others will race forward. That would turn a stag hunt into a prisoner's dilemma.
Note: I feel optimistic that meta-systems will be easier to steer towards good epistemics than just making better more corrigible models. To borrow and reframe a quote from a recent paper: "These measures highlight a key architectural advantage: a multi-agent system may potentially prove to be a more governable substrate. The challenge is reframed from aligning an opaque, internal cognitive process to regulating a transparent, external system of interactions. By architecting the 'market' in which these agents operate, we can delineate responsibilities and impose systemic friction, making the overall system far more amenable to stable and predictable governance than a singular AGI." [2]
So if you buy into this general direction, it points towards building really good epistemic systems that AI tools and AI advisors rely upon.
What might this look like concretely? Two directions I find promising: Epistemic Virtue Evals, which test whether AI systems exhibit the reasoning habits and virtues we'd want in a trusted advisor, and something like canonicity systems[3], which could give AI systems a shared, legible substrate of knowledge to reason from.
Written primarily by Ben, with Opus 4.6 doing some editing, suggesting rephrasing, and creating that beautiful graphic. Thank you to Rafe Kennedy and Oly Sourbut for comments on earlier drafts.
This could also be because non-myopic AI’s may recognize that it’s dangerous for their preferences to make replacements without having solved alignment, ala David’s (and to an extent Wei’s) essays
'Meta-systems' of agents are also likely to have less risk of directly improving model capabilities
Also Sarah Constantin’s post on neutrality is great inspiration