There is a risk that AIs are the winners, and no human countries or human individuals win (this could take the form of extinction or permanent disempowerment). And this risk plausibly changes, reduces over time (rather than the uncertainty about a fixed risk getting resolved), as better understanding of how to manage the technical and social problems becomes available. What a ban/pause buys humanity is this improvement in the chances that there are any winners for humanity at all, once sufficiently capable AIs are eventually built.
(This can make unilaterally defecting undesirable, because the defector loses too, gets a worse outcome than if they didn't defect, there is no temptation to cheat sooner rather than later. And there is a simple if painful way of ensuring verification, if nobody builds fabs that are too advanced, or giant datacenters. There are less painful ways to ensure verification that maintain a lot of the effect.)
An analysis that ignores this factor could be useful in that it explores the possibility of agreement on policy with people who categorically disbelieve literal extinction or permanent disempowerment (where AIs end up controlling almost all resources and the future of humanity is left with scraps, with no possibility remaining of literally ever changing this), or that these risks can get better if humanity takes time and doesn't build powerful AIs immediately, as soon as technologically possible.
Also, as more understanding accumulates and the world gets closer to building sufficiently capable AIs safely, the same issues would emerge, as in a world that doesn't significantly risk extinction or permanent disempowerment for all of humanity (or where this risk is constant and isn't being reduced with a longer or better coordinated ban/pause). So safely coordinating a ban/pause towards its end is also an important class of problems, even in a ban/pause world that solves most of the other problems.
Thanks for the comment, I think you have some great points. I am going to try to explain my responses, even though I suspect they may be as unpopular as the post already is. One quick thing at the outset: my essay is not based on disbelieving extinction or terrible outcomes risk, my claim is pause is dominated even if you take that risk very seriously (which I do.)
My model does assume that V>0, and if the winner might have catastrophic devastation then V is a lottery rather than a clear reward. That said, I think adding alignment risks makes the argument harder to address. The verification problem is the same because you can't tell if a rival's pause deploys compute to do alignment or to defect. Restricting fabs or datacenters is definitely stronger than software monitoring (it's much easier to check), but θ∗ is a capability threshold rather than a compute threshold. It seems like this is an unpredictable thing to estimate as well; I wonder if a verification plan built around compute ceilings is just betting that compute-to-capability holds (I don't mean to argue against the bitter lesson, just that there are presumably softer breakthroughs that could and have occured).
I also think your point on risk changing over time vs. the uncertainty about a fixed risk getting resolved is important. My perspective is that even granting that the risk decreases with time, it is still non-canonical in the sense that I discuss in my other essay Unprecedented Catastrophes Have Non-Canonical Probabilities. The splice model means that data comes in that looks equally reasonable for "evidence of misalignment" and "evidence of alignment". I don't think pause buys time to resolve this in a convergence to safety, I think it buys more time for genuine disagreement about whether the pause is working.
My major concern is that pause structurally encourages defectors who are adversely selected for recklessness. The actors presumably are the once least-likely to want to build safe superintelligence because they believe they should cut every corner to pass the threshold first.
Overall, thank you again for the thoughtful comment. For me the biggest question is whether alignment progress is measurable and canonically verifiable.
(Also thanks to those who read and are giving me any feedback. I know this is contrarian, but I am always hopeful of receiving a fair shake here!)
Good post, I've thought a lot along similar lines. Except I lean further left than most of LW, and don't trust governments and corporations at all, so my preferred solution would be entangling AI training among people, across borders and class lines as much as possible.
It does lead to some surprising conclusions. For example, someone who works on alignment at BigCo is making the world a worse place (because BigCo's owners will just use the alignment work to race harder, as happened with RLHF). While someone who works on capability but in an open, GPL-like way is making the world a better place (by removing that capability from the race, making the race less winner-take-all). Counterintuitive, but I think I stand by it.
Thank you so much for the support and read through! Truly made my day.
Have you written about your thoughts on entangling AI among people broadly? I'd love to read more about that idea to think more about it.
Why Pause?
The idea of a global pause feels natural: we can see that AI is changing everything, and no one can deny that there is a chance things could go very poorly, so wouldn’t it be good to pause to give us more time to figure things out? Depending on how you view the stakes, the idea could even be viewed as a moral obligation. Unfortunately, I think pause is almost certainly an impossibility due to the structure of the game being played. Pausing ends up as a robustly dominated strategy that no rational self-preserving actor can choose regardless of how dangerous they think superintelligence might be.
For pause to work four things would need to be true, which are not: the game would need to continue indefinitely, compliance would need to be verifiable, pausing would need to be decision-theoretically rational, and uncertainty must favor caution. I am going to work through each of these in turn.
The Game Does Not Continue Indefinitely
Every call for a coordinated pause, regardless of whether it is a voluntary moratorium, a binding treaty, or a formal non-proliferation agreement[2] implicitly relies on the Folk Theorem of repeated games. The concept is simple: cooperation is maintained because the discounted value of continued play in the future is more than the temptation to cheat. This is the basic idea that explains why we didn’t all die in a nuclear apocalypse during the Cold War, and fingers crossed if you are reading this, still haven’t. Everyone understands that this game doesn’t end.
The race to superintelligence is different because it has an “absorbing” state, a state that once entered ends the game forever. In this case, the game could end forever because the first to superintelligence could provide omnipotent, unipolar, control. However; we don’t actually need this strict of condition, we only need a weaker threshold property: that there exists some capability level such that for the strategic advantage becomes self-reinforcing.
Any actor with AI like this can improve faster, beat their competitors more thoroughly, etc. in a way that trailing actors can’t keep up with, eventually causing the trailing actors to discontinuously shift from “compete” to “capitulate”. When that happens, the game is over. This violates the Folk Theorem because it requires the cumulative probability of the game continuing forever to be strictly positive. [3]
This is immensely problematic because in any given period a rational actor can reach , end the game, and win a payoff that is much larger the the result from continued cooperation. The whole apparatus that underpins nuclear deterrence dissolves on this ground alone for AI, it’s just too beneficial to defect.
Compliance is Not Verifiable
Think about this: merely having a nuclear weapon doesn’t win the game, using it does. And if an agent uses a nuclear bomb all the other agents would know in minutes and launch retaliatory strikes. In other words, no one can secretly win the war. This is different than the path to superintelligence because crossing is the decisive act, and it can be won in secret with significant detection lag from others. If the detection lag is longer than the time it takes to solidify a self-reinforcing advantage then any retaliatory mechanism is obsolete. It is actually the case that this detection lag dynamic exists in an unresolvable identification region which makes no retaliation credible.
As I mentioned in my essay on canonical probabilities[4], general purpose software natively contain “splice programs”, research programs that look exactly like normal narrow AI development or safety research until the moment they cross . It’s possible to show these scenarios are indistinguishable mathematically by comparing the observable data generated by both scenarios. Let be the measure over observable traces (compute usage, research, etc.) under genuine compliance, and be the measure under covert defection. The Evidential Screening Property guarantees that the likelihood ratio between these models is bounded by a tiny algorithmic complexity penalty:
where grows at most . The critical issue is that no monitoring scheme, no matter how detailed, can distinguish compliance from covert defection with canonical confidence before the threshold is crossed. Nuclear development has unambiguous evidence leakage (the dormancy defect , the rate at which covert programs leak observable evidence, is meaningfully greater than zero, so data can accumulate against covert nuclear programs and likelihood ratios grow over time), but AI development does not (the dormancy effect is near zero). Please check out my other essay if you want to see more of the reasoning behind this.
Pausing is Not Decision-Theoretically Rational
Let’s backtrack for a moment and assume the two points above were wrong (the game continued indefinitely and pausing was verifiable). Even then, pause is robustly dominated. If you erroneously use infinite payoffs[5] in your payoff matrix, it’s easy to get stuck in Pascal’s Mugging situations, so I don't want to use anything with infinity here.
The probabilities that would allow an agent to figure out if pause is rational are stuck in an uncertain range (an “identification region”). Because the agent has no way of knowing the exact chances, they must assume the worst and make the best decision they can (Minimax Regret). To show this, let’s define prior-free identification regions for the key parameters:
Now we will make only one overall assumption: that we cannot canonically exclude the possibility of secret defection, that we can’t exclude the possibility of success, and that the unipolar threshold advantage is more valuable than the chance of getting caught defecting and the status quo.
Next we apply the minimax regret where the regret of an action is the difference between the payoff you received and would have received if you had chosen optimally:
Because evidential screening ensures the identification region does not shrink over time, a pausing actor can never resolve the ambiguity and they are always exposed to the worst case regret. The regret of pausing scales with which is very large and is finite. So defecting strictly minimizes maximum expected regret across the entire identification region.
Uncertainty Does Not Favor Caution
This section feels bad to write because it is very counter-intuitive. Most of the time when you are uncertain about risks, it makes all the sense in the world to be cautious and pause. If you are in the mountains and want to cross a slope but are not sure if it is going to avalanche, it’s totally reasonable to not cross it or wait and get more data.
But in the AI development game, irreducible uncertainty is anti-cautious. What I mean is that for pause to be a rational strategy, a pausing actor must have canonical confidence that their adversary’s defection probability is very close to zero. But the non-canonical framework shows that an actor literally can’t figure out this probability in any policy-relevant timeline. So pausing isn’t safe in any traditional sense and it asks nationstates to make a sovereignty-level bet on a parameter that is provably unknowable.
Even worse, this has a sort of ratchet effect: once a nation tastes the forbidden sweetness of defecting, each increment of progress lowers the cost to get to so the rational incentive to continue becomes stronger over time. All actors know the other actors face this exact incentive structure so the probability that one adversary has already begun to covertly defect increases monotonically over time. The whole equilibrium unspools from the future back to the present and current actors are forced to preemptively defect today.
Conclusion
In an attempt to offer a constructive solution, one counterintuitive idea I had (one that I only hold loosely) is that more heavily entangling AI development among nations, rather than trying to segregate it under a racing development paradigm, may be beneficial. In other words, if the US opened the floodgates to Chinese firms and they jointly worked together then it is the world that approaches , not a single nation or corporation. This fundamentally disrupts the payoff matrix before the absorbing state is reached. It feels odd even to me, but there is a logic to it.
In conclusion, strategies ported from the 20th century are appealing because they are familiar and have served humanity well. It would be awesome if superintelligence had the same qualities that made the last centuries crises so tractable. It doesn’t. It would be better for humanity if we stopped pretending it did.
I tried to think of one solution at the conclusion, but it may be a really bad idea.
I wrote a more narrow critique on this topic here: The Jackpot Jinx (or why “Superintelligence Strategy” is wrong).
The cumulative probability must be strictly positive. is the probability that someone reaches the absorbing state in period . The infinite product is positive iff . As global compute scales, more actors enter the race, and AI becomes better, is not decaying, rather it is likely bounded away from 0 or increasing, so the sum diverges and the product converges to zero. Importantly, unlike the exogenous termination risks that the Folk Theorem can accommodate, reaching is an endogenous choice that rewards the actor who makes it, meaning the game wouldn’t end randomly, it ends because a player chose to win.
Unprecedented Catastrophes Have Non-Canonical Probabilities.
As I stupidly did in my previous paper The Jackpot Jinx.
Note: this could be contestable if is a nuclear war or something terrible.