superintelligence, implicitly, has always been about a gap: the gap between the current best intelligence and the newly created one
This is neither a good operationalization of "superintelligence" nor a crux for most models of doom. Superintelligence is about qualitatively outperforming humanity (not humanity-plus-other-AIs) at cognition, for everything technologically relevant, including learning and strategic thinking. This is a change in what forms the center of gravity of cognition in the world and drives the plot of the future. The weaker AIs only matter to the extent they are wielded by a power that out-strategizes them. When the power in a position of advantage is humanity, the non-superintelligent AIs make humanity stronger. If that power is instead superintelligence (that doesn't follow humanity), the weaker AIs make superintelligence stronger (likely by getting turned off, so that their compute can be utilized more effectively by the superintelligence).
Gradual disempowerment is directly confronting the situation with no gaps, no superintelligence, and no legible misalignment, that is still headed towards human extinction or permanent disempowerment. I think this is the baseline scenario for LLMs, for as long as there is no breakthrough that enables fast learning of deep skills and therefore superintelligence, which can happen at any moment and lead to irreversible loss of control within months. Perhaps the superintelligence transition to loss of control and possible extinction takes longer if such a breakthrough happens now, but it probably takes no more than months if it happens after 2030-2035, when there's much more compute and a lot of robots.
This is neither a good operationalization of "superintelligence" nor a crux for most models of doom.
Is it not a crux for "classic ai doom scenarios"?
I agree it's not a crux for what should be currently highly rated models of doom, that's in large part why I argue this, to remove mind share from the old scenarios.
If that power is instead superintelligence (that doesn't follow humanity),
My implied best available plan for humanity is to create each successive superintelligence with sufficiently fewer resources that it could not takeover despite its mild efficiency advantage at using resources strategically. Thus, you can create and deploy misaligned superintelligence and not end up in the doom scenarios but get to try again. (This does necessitate that we can catch misalignment, which seems likely given the current nature of Ai systems and that we can put ooms more resources into auditing the models than they can put defending themselves. It also necessitates that we change course when we catch misalignment, thus my recommendations for better corporate and national governance).
It seems to me "we only get one try" continues to be a frequently argued for position and I think it's in practice false (though I've seen some tautological définitions for which it's true but uninformative). This post contibutes to weakening that position.
I broadly agree with your second paragraph
the gap between the current best intelligence and the newly created one ... Is it not a crux for "classic ai doom scenarios"?
It is not. I gestured at the issue here using the "center of gravity" analogy, how it makes the weaker AIs a general resource that doesn't specifically protect humanity, but can be repurposed by superintelligence just as well for its own ends, once it's more capable than humanity at wielding it.
Thus, you can create and deploy misaligned superintelligence and not end up in the doom scenarios but get to try again.
There's a recent post by Yudkowsky on this, though I don't know if it can be helpful in practice when the issue isn't clear enough without needing it. That is, regardless of whether you agree with the claim that superintelligence is extinction-level dangerous in this way, even without gaps, or with many superintelligences in the world rather than one, it's a much more clear-cut claim that this is indeed how the classical AI doom scenarios work, that they don't go away (according to their intended internal logic) even when you manage to set up a no-gaps kind of situation. It's a sorites paradox of AI doom, any reversible and controllable step only takes you closer to the irreversible and qualitatively discontinuous eventuality (or at least so say the classical AI doom scenarios).
In 2022 Soares writes
My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world
It's unlikely that a new AI system would be able to "threaten the entire world" based on its mastery of physics etc if it were not substantially smarter than its predecessors. It would not have enough of an edge to take over, in a world already full of AI systems in place with only slightly lesser capabilities. Do you disagree that this scenario doesn't require a substantial gap?
(Without a substantial gap, an AI system could try to start taking over but would presumably not have enough advantage to never be detected and then be stopped by the existing set of AI systems)
it makes the weaker AIs a general resource that doesn't specifically protect humanity, but can be repurposed by superintelligence just as well for its own ends, once it's more capable than humanity at wielding it.
I clarified that the new slightly more intelligent system would be deployed with fewer resources until we're sure it's aligned. And it's only slightly more intelligent than what's in place, so I don't see why you'd think it could take over the previous AI systems, who are actively suspicious of it and monitoring it.
Re Yudkowsky's post, it notably says
This giant historically unprecedented problem has many ordinary-world valid analogies. Like how you can't determine if someone is trustworthy to handle a billion dollars by seeing how they handle ten dollars, even if it's in fact the same person and they're not getting much smarter, because they can think intelligently about whether it's a good time to steal the money.
Yudkowsky does not engage with the many differences between evaluating AI systems and humans, which in fact make a lot of the problems here quite solvable, in particular under my assumption of no large capability gap. The amount of simulations and tests we can do on AI can allow us to know about them being aligned without hidden motives much better than we can for humans today (but also I wouldn't lose hope at identifying if a human had ulterior motives given billions of dollars of resources and work to solve that). I think other people have already presented many of these differences in various AI control articles.
(I haven't read the whole post again in full just to respond to your comment - if there are more important points you think are relevant to my argument here I'll respond to any you highlight)
The current most intelligent and aligned beings should always be supervising their successor, using more total resources at first, such that they can't effectively be tricked/subverted.
Obviously it helps to do this, but I think it is far from sufficient.
Sufficient for what? I'd agree it's clearly insufficient for getting p(doom) < 1%, but plausibly fine for under 25%. [1]
(assuming my mentioned best available plan from an earlier response to Vladimir_Nesov "My implied best available plan for humanity is to create each successive superintelligence with sufficiently fewer resources that it could not takeover despite its mild efficiency advantage at using resources strategically. Thus, you can create and deploy misaligned superintelligence and not end up in the doom scenarios but get to try again."
This is rather sparse and vague and I'd like to write more on this in the future, but it's vaguely assume that Redwood agenda is implemented at all top labs at least semi competently)
Of course I'd prefer if we lived in the world where we could get p(doom) << 1%, here I'm trying to disambiguate what goes wrong under a given plan.
First of all, you've accidentally messed up the link to Greenblatt's plans for misalignment risk. Additionally, the AI-2027 scenario didn't just have "pressures to premature deployment leading to using a suspected misaligned system", it had Agent-3 obtain merely flimsy evidence of Agent-4 being misaligned. The authors also managed to create a footnote where they doubt that Agent-4 will even be caught.
As for your claim that evidence of recklessness with regards to weak systems isn't strong evidence of recklessness towards strong systems, I suspect that this is either false or not the main mechanism of doom. First of all, @Daniel Kokotajlo described how "In several wargames at AI Futures Project the mildly superhuman AIs told their respective CEOs "We don't think we can reliably align the next generation models we have in the works; we need to pause for a bit or at least go slower to figure out how to make it safe" and the CEOs have overruled them saying "Sorry we don't have time, China/OpenAI/Anthropic/etc. are gonna race ahead, plus also we need smarter AIs to win the war / appease POTUS / keep market share so you just need to do the best with the time you have. Good luck."
Secondly, suppose that Anthropic cared about alignment as well as you assumed, and even paused internal deployment of new models. Then they would have to cause the USG to prevent idiots from xAI (who was careless enough to have Grok become MechaHitler and had the guts to release Grok 4 during the scandal) or China (whose AIs emit evidence of lying about politically sensitive topics) from internally deploying their misaligned AIs and releasing their equivalents of Agent-5.
First of all, you've accidentally messed up the link to Greenblatt's plans for misalignment risk
Thank you, fixed!
it had Agent-3 obtain merely flimsy evidence of Agent-4 being misaligned
If you have flimsy evidence of X, then it'd lead to suspicion of X. Are you disagreeing with that characterization?
The authors also managed to create a footnote where they doubt that Agent-4 will even be caught.
In case it's unclear, I'd have pretty high p(doom) if I thought AI labs will in fact be as reckless and irresponsible as they are in AI 2027 scenario. But I think it's not that hard technically to catch misalignment in only-somewhat-more-capable agents when you're using an oom more resources to catch it, and I think there will be significant efforts at such surveillance (eg. OpenAI is already monitoring 99%+ of internal AI use), with better tools and protocols being developed.
the CEOs have overruled them
CEOs should not have the power to overall the safety teams.
Then they would have to cause the USG to prevent idiots from xAI [...] from internally deploying their misaligned AIs
I support corporate governance, national and international governance that would indeed allow "preventing idiots" from "internally deploying their misaligned AIs".
Good post, I liked the concise and clear exposition.
Two reactions: 1. How do you think recursive self-improvement works in this model? Could this create an super exponential capability growth that create big gaps?
It is also what makes that particular scenario unlikely to happen. The leading companies will be more careful than that if they had that level of evidence of misalignment in powerful systems.
This seems like a big crux! Really unclear that the tension will stay at this level of intensity, they could definitely rise because of international rivalry for instance.
1. How do you think recursive self-improvement works in this model? Could this create an super exponential capability growth that create big gaps?
Assuming we haven't failed the alignment step of any steps of our recursion[1], then each currently in power AI system have the same incentives to not produce a misaligned future system and will only transfer power once they're sure it's aligned. It is thus every actor (including AIs) being prudent that cause each layer to want to insure not too large of a gap to its successor. Each layer is responsible for going at a safe pace and would not want to uncontrollably recursively improve.[2]
at this level of intensity, they could definitely rise because of international rivalry for instance.
Yes that's plausible to me, my claim is only that they could want to be much much more cautious than they currently are, but not that overall this cautiousness will prevail against very high pressures
Say humans have smarts 1, and that they can evaluate and align a 1.1x smarter being to robustly follow human values, then you can kick off a theoretical infinite chain of alignment.
I'm answering about "recursive improvement" and dropping the "self" because that's the general case. If an agent thought "self" was actually a coherent thing and were aligned to self rather than to humanity, then they might do RSI, but that'd mean we failed step 1 of the recursion.
Many classic AI doom scenarios rely on superintelligence using its vastly superior intelligence to outplan, outcompete and outkill you.
I partly believe this: superintelligence would definitely outkill me.
But I don't believe we will build such superintelligence; not because humans are the apex of intelligence, but because superintelligence, implicitly, has always been about a gap: the gap between the current best intelligence and the newly created one.
We're not in the world where AIs are being created with large gaps of intelligence between each other. Rather, we are in an iterative intelligence development and deployment world. It is technically easy to not have large gaps of capabilities between the current best model and the next, it is ~easy (if costly) to evaluate at regular checkpoints, and ~continuous deployment allows there to be no large gap in deployment either.
We can thus steer away from a large number of doom scenarios (those where new AI uses its greater capabilities to take over) by simply not creating&deploying models much smarter than the previous thing. The current most intelligent and aligned beings should always be supervising their successor, using more total resources at first, such that they can't effectively be tricked/subverted.
I guess the above is something many "AI optimists" have in mind and I don't think the technical ease of avoiding large capabilities gaps should be much of a crux. Whether in practice we'll be avoiding these gaps seems the more interesting crux for "fast misaligned AI takeover" scenario discussion. This is correctly done in @Daniel Kokotajlo et al's AI 2027: the bad ending is caused by pressures to premature deployment leading to using a suspected misaligned system, not by technical impossibility of knowing it's misaligned. It is also what makes that particular scenario unlikely to happen. The leading companies will be more careful than that if they had that level of evidence of misalignment in powerful systems. (I don't think evidence of recklessness with regards to weak systems is strong evidence of recklessness towards strong system, though corporate and national governance should be setup to have the mere possibility of not being reckless when the time comes) It's looking like we're in world C or D of @ryan_greenblatt plans for misalignment risk (~we don't get a pause, but the leading companies are somewhat careful) and that this is technically sufficient to avoid most fast misalignment doom scenarios.
Most of my p(doom) is thus not on the chance of misaligned AI takeover, but on gradual disempowerment risks.
I don't think we have good solutions here, but at least we have more time to look for them.