Plans A, B, C, and D for misalignment risk

ryan_greenblatt

Plans A, B, C, and D for misalignment risk — LessWrong

139 Plans A, B, C, and D for misalignment risk

by ryan_greenblatt

8th Oct 2025

AI Alignment Forum

8 min read

139 Ω 61

I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.

In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.

Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:

Plan A: There is enough will for some sort of strong international agreement that mostly eliminates race dynamics and allows for slowing down (at least for some reasonably long period, e.g. 10 years) along with massive investment in security/safety work.
Plan B: The US government agrees that buying lead time for US AI companies is among the top few national security priorities (not necessarily due to misalignment concerns) and we can spend 1-3 years on mitigating misalignment risk.
Plan C: The leading AI company is willing to spend (much of) its lead on misalignment concerns, but there isn't enough government buy-in for serious government involvement to make a big difference to the strategic picture. The leading AI company has a 2-9 month lead (relative to AI companies which aren't willing to spend as much on misalignment concerns) and is sufficiently institutionally functional to actually spend this lead in a basically reasonable way (perhaps subject to some constraints from outside investors), so some decent fraction of it will be spent on safety.
Plan D: The leading AI company doesn't take misalignment concerns very seriously in practice (e.g., they aren't close to willing to spend all of their lead on reducing misalignment risks at least by default) and takeoff isn't going to be exogenously slowed down. However, there are 10-30 people at the company who do take these risks seriously, are working on these risks, and have enough buy-in to get ~3% compute for things which are reasonably well-targeted at misalignment risks. See also Ten people on the inside.

Now here is some commentary on my current favorite plan for each of these levels of political will, though I won't go into much detail.

Plan A

We implement an international agreement to mostly eliminate race dynamics and allow for many years to be spent investing in security/safety while also generally adapting to more powerful AI. The ideal capabilities trajectory would depend on how quickly safety research progresses and the robustness of the international agreement, but I'm imagining something like spreading out takeoff over ~10 years. This might end up roughly equivalent to: ensure that if takeoff would have been fast, it is instead as slow as more optimistic people think it will be. You probably want to start slowing down capabilities around the point when AIs can fully automate engineering in AI companies and want to fully pause, spending down most of the available lead time, slightly above the level of capability needed to fully automate AI R&D.

We'd have time to focus much of our effort on moonshots which could plausibly result in high assurance and which might be scalable to very superhuman AIs. By default—as in, unless the success of some moonshots greatly changes the strategic picture—the plan would basically be to keep capabilities below the maximum controllable level for a while until we can use human labor (and AI labor) to mostly resolve relevant alignment problems. Once alignment issues are resolved or we run out of time, we'd hand off ~all safety work to AIs which are barely superhuman (basically, only a bit above the capability bar needed for handoff to be viable in principle, this is probably somewhat above the level of capability needed for fully automating AI R&D). At the point of handoff, we might or might not have scalable solutions to alignment, but we don't necessarily need arbitrarily scalable solutions to succeed. Obviously I'm omitting many, many details here. (I have draft docs discussing many of these details.)

Plan B

The situation is somewhat similar to Plan A except that we have less time (1-3 additional years rather than 10). The plan looks similar to Plan A, but we focus much less on moonshots and more on straightforward/prosaic approaches. Also, we can't afford to slow down much prior to a level of capability close to full automation of AI R&D. Attempting to hand off to AI systems with only empirical iteration and prosaic methods is key, though we can do a decently high effort version of this. We should still ensure control until handoff (or a bit before this or potentially once we've ruled out scheming strongly). The AI company should put in a bunch of effort (using AI labor) into trying to buy time via increasing the level of political will (e.g., demonstrating capabilities and risks) and helping the US government ensure non-proliferation/lead time. If this goes very well, we might be able to move to Plan A.

Plan C

We focus on very quickly handing off to AIs even if things are very half-baked. We aim to ensure some level of (probably somewhat janky) control until handoff. Helping the US government acquire lead time is much less promising due to limited US government buy-in, though trying to generate more political will remains promising.

Plan D

Many things will be totally YOLO'd. The safety team should focus on being able to extract non-sabotaged safety research from the AI, gathering strong evidence for risk, and making an AI which is plausibly safe to hand off to. Ensuring control at the level of the organization should be mostly given up on, but things that might help avoid AIs escalating permissions/access to a point where they can directly sabotage the relevant safety work or mess with arbitrary things run on the company's compute could be worthwhile. More responsible trailing AI companies should focus on exporting safety work (in addition to policy/coordination work).

Plan E

We can also consider a "Plan E" scenario where the level of will and number of employees who are working on mitigating the relevant misalignment risks is substantially less than in a "Ten people on the inside" style scenario. As in, there are only a few people (or perhaps there is no one) who are worried about these risks, are aiming to mitigate them, and have access to frontier AI systems at the leading developer(s). One extreme scenario would be that the entire project is heavily siloed and secretive with few people (or no one) working on mitigating AI takeover risk and discussion of misalignment concerns is effectively taboo within the project. You can't really have a technical "Plan E" because there is approximately no one to implement the plan; in Plan E situations, the focus should be on moving to a higher level of political will and effort on mitigating risk.

Thoughts on these plans

Another way to think about this is to think about how much lead time we have to spend on x-risk focused safety work in each of these scenarios:

Plan A: 10 years
Plan B: 1-3 years
Plan C: 1-9 months (probably on the lower end of this)
Plan D: ~0 months, but ten people on the inside doing helpful things

What do I think is the chance that we end up in the world of Plan A, B, C or D? As in, do we have the will (and competence) to do something which isn't much worse than the given plan (presumably with many modifications based on the exact situation) while still being worse than the next better plan? (Obviously the details will be less specific than the exact details I gave above.) It depends on timelines, but conditioning on a trajectory where by default (in the absence of active intervention) we would have reached AIs that beat top experts at ~everything prior to 2035, here are my not-very-well-considered guesses:

Plan A: 5%
Plan B: 10%
Plan C: 25%
Plan D: 45%
Plan E: 15%

What level of takeover risk do I expect in each of these situations?^[1] This depends substantially on the quality of execution, which is somewhat correlated with the level of political will. I won't assume that my preferred strategy (given that level of political will) is used. For Plans C and above, I will assume "sufficiently institutionally functional to actually spend this lead time in a basically reasonable way" and that the available lead time is actually spent on safety. Thus, the numbers I give below are somewhat more optimistic than what you'd get just given the level of political will corresponding to each of these scenarios (as this will might be spent incompetently).

Note that I'm ignoring the possibility of switching between these regimes during takeoff while humans are directly in control; for instance, I'm ignoring the possibility of starting in a Plan D scenario, but then having this shift to Plan C due to evidence of misalignment risk.^[2] However, I am including the possibility for (hopefully aligned) AIs to manage the situation very differently after humans voluntarily hand over strategic decision making to AIs (insofar as this happens). I'm also conditioning on a trajectory where by default (in the absence of active intervention) we would have reached AIs that beat top experts at ~everything prior to 2035 like for my probabilities given above.^[3]

Here is the takeover risk I expect given a central version of each of these scenarios (and given the assumptions from the prior paragraph):^[4]

Plan A: 7%
Plan B: 13%
Plan C: 20%
Plan D: 45%
Plan E: 75%

A substantial fraction of the risk in Plan A and Plan B worlds comes from incompetence (as in, if the overall strategy and decision making were better, risk would be much lower) and another substantial fraction comes from the possibility of takeover being very hard to avoid.

What are the main sources of political will in each of these scenarios? In general, Plans A and B are mostly driven by governments (mostly the US government) while Plans C and D are mostly driven by AI company leadership and employees. In Plan A and Plan B, a high level of will from the US government is necessary (and could be sufficient for at least Plan B, though AI company leadership caring is helpful). Plan C likely requires a ton of buy-in from AI company leadership, though sufficiently strong employee pressure could mostly suffice. Additional political will in Plan D could come from (in descending order of importance under my views): employee efforts (both pressure and direct labor), AI company leadership, pressure from something like corporate campaigns (external pressure which mostly operates on customers, suppliers, or maybe investors), and relatively weak regulation.

Given these probabilities and levels of risk, I'm inclined to focus substantially on helping with Plans C and D. This applies to both research and generating marginal political will. Correspondingly, I think what AI company employees and leadership think about AI (existential) safety is very important and political strategies that result in AI company employees/leadership being more dismissive of safety (e.g. due to negative polarization or looking cringe) look less compelling.

Note that risks other than AI takeover are also generally reduced by having more actors take powerful AI seriously and having more coordination. ↩︎
The risk conditional on starting in a Plan D scenario is lower than conditional on remaining in a Plan D scenario and the risk conditional on starting in a Plan A scenario is higher than if we condition on remaining. ↩︎
This sentence was added in an edit because I realized I forgot to include this sort of caveat. ↩︎
Multiplying the probabilities given above by the takeover risk numbers given here doesn't exactly yield my overall probability of takeover because of the optimistic assumption of reasonable execution/competence (making actual risk higher) and also because these risk numbers are for central versions of each scenario while the probabilities are for ranges of plans that include somewhat higher levels of will (making actual risk lower). (Specifically: "will (and competence) to do something which isn't much worse than the given plan while still being worse than the next better plan". So the probabilities for Plan C really include <Plan B while >= Plan C.) ↩︎

Frontpage

139 Ω 61

Plans A, B, C, and D for misalignment risk

New Comment

78 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:14 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]ryan_greenblatt9moΩ1837-7

Side question: what about the "shut it all down" plan proposed in (e.g.) If Anyone Builds It, Everyone Dies?

I think this probably requires substantially more political will than Plan A and seems worse than a well-implemented version of Plan A that leverage the additional political will to spend more time slowing down at high levels of capability (and some at lower levels of capability). That said, shutting it all down is substantially simpler and a well-implemented version would reduce takeover risk substantially in my view (at the cost of delaying the benefits of AI by decades which seems worth it, but I can understand why people would disagree).

"Shut it all down" seems worse to me because:

poor implementation fails more catastrophically
the balance of power in ~30 years seems maybe worse
you eventually probably want to go back to Plan A anyway (with hopefully augmented humans to make this go better???)
exiting the pause regime suddenly seems potentially scary due to overhangs (though less so if you're also extending "shut it all down" to semiconductor progress but this makes the plan even more expensive)
I worry a bit about indefinite pauses or humanity generally becoming more anti-progress as a side effect or requirement to make this happen.

[-]Raemon9moΩ205016

My main question is "why do you think Shut Down actually costs more political will?".

I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells.

I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than "Controlled Takeoff").

Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff.
Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff."
A Global Controlled Takeoff That Works has a lot of moving parts.
You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs.
You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agree

... (read more)

[-]ryan_greenblatt9moΩ9241

I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells.

I think shutting down all AI development is much more costly than not shutting down all AI development in a pretty straightforward sense that will in fact probably be priced into the required level of political will: Nvidia is in fact much worse off if all AI development shuts down versus if AI development proceeds, but with capabilities developing more slowly once they reach a high level of capabilities.

I would guessed the stock market will react pretty different to something like Plan A vs "shut it all down" for reasonable reasons.

I don't understand why you think the opening steps are the most politically challenging part given that the opening steps for Plan A plausibly don't require stopping AI development.

Another point is that many people have pretty reasonable existing objections to "shut it all down". Here are some example objections people might have that apply more to "shut it all down" than "Plan A":

Shoul

... (read more)

[-]Eli Tyre9mo3223

I think the current CCP having control over most/all of the universe seems like 50% as bad as AI takeover in my lights

This is a wild claim to me.

Can you elaborate on why you think this?

[-]ryan_greenblatt9mo171

(I assume you're asking "why isn't it much less bad than AI takeover" as opposed to "isn't it almost as bad as AI takeover, like 98% as bad".)

I care most about the long-run utilization of cosmic resources, so this dominates my thinking about this sort of question. I think it's very easy for humans to use cosmic resources poorly from my perspective and I think this is more likely if resources are controlled by an autocratic regime, especially an autocratic regime where one person holds most of the power (which seems reasonably likely for a post-AGI CCP). In other words, I think it's pretty easy to lose half of the value of the long-run future (or more) based on which humans are in control and how this goes.

I'll compare the CCP having full control to broadly democratic human control (e.g. most cosmic resources are controlled by some kinda democratic system or auctioned while retaining democracy).

We could break this down into likelihood of carefully reflecting and then how much this reflection converges. I think control by an autocratic regime makes reflection less likely and that selection effects around who controls the CCP are bad making post-reflection convergence worse (and it's ... (read more)

2andrew sauer9mo

I think democratic human control is extremely unlikely even with a US actor winning the race.

[-]habryka9mo2015

Shouldn't we at least proceed until we can't very confidently proceed safely?

I mean, I think AI ending up uncontrollably powerful are on the order of 1-3% likely for the next generation of models. That seems far far too high. I think we are right now in a position where we can't very confidently proceed safely.

4ryan_greenblatt9mo

Hmm, we probably disagree about the risk depending on what you mean by "uncontrollably powerful", especially if the AI company didn't have any particular reason to think the jump would be especially high (as is typically the case for new models). I'd guess it's hard for a model to be "uncontrollably powerful" (in the sense that we would be taking on a bunch of risk from a Plan A perspective) unless it is at least pretty close to being able to automate AI R&D so this requires a pretty huge capabilities jump. My guess of direct risk[1] from the next generation of models (as in, the next major release from Anthropic+xAI+OpenAI+GDM) would be like 0.3% and I'd be like 3x lower if we were proceeding decently cautiously in a Plan A style scenario (e.g. if we had an indication the model might be much more powerful, we'd scale only a bit at a time). ---------------------------------------- My estimate for 0.3%: My median is 8.5 years and maybe there are ~3-ish major model releases per year, so assuming uniform we'd get 2% chance of going all the way to AI R&D automation this generation. Then, I cut by a factor of like 10 due to this being a much larger discontinuity than we've seen before and by another factor of 2 from this not being guaranteed to directly result in takeover. Then, I go back up a bunch due to model uncertainty and thinking that we might be especially likely to see a big advance around now. ---------------------------------------- Edit: TBC, I think it's reasonable to describe the current state of affairs as "we can't very confidently proceed safely" but I also think the view "we can very confidently proceed safely (e.g., takeover risk from next model generation is <0.025%) given being decently cautious" is pretty reasonable. ---------------------------------------- 1. By direct risk, I mean including takeover itself and risk increases through mechanisms like self-exfil, rogue internal deployment but not including sabotaging research the AI is suppo

1Canaletto9mo

How about more uhh soft uncontrollability? Like, not "it subverted our whole compute and feeds us lies" but more "we train it to do A, which it sees as only telling it to do A, and does A, but its motivations are completely untouched".

9Raemon9mo

Thanks. I'll leave some responses but feels more fine to leave here for now. [...] First, slight clarification: the thing I had in mind isn't the opening step (which is presumably "do some ad hoc deals that build political momentum without too much cost"). The step I have in mind is "all global compute clusters and fab production is monitored, with buy in from China, UK, Europe etc, with intent for major international escalation of some kind of some violates the monitor-pact". This doesn't directly shut down nVidia, but, it sure is putting some writing on the wall that I would expect nVidian political interests to fight strongly even if it doesn't immediately come with a shut down. I'm imagining a Plan A that doesn't include something like that is more like a Plan A / B hybrid or some other "not the full Plan A." (based on some other internal Plan A docs I've looked at that went into more detail as of a few weeks ago). I don't think there's any way you get to that point without most major world leaders actually believing-in-their-heart "if anyone builds it, something real bad is dangerously likely to happen." And by the point people are actually agreeing to have international inspection of some kind, I would expect people to more thinking "okay will this actually work?" than "what do we have buy-in for?". (There is a version where the US enforces it at gunpoint or at least economicsanction-point without everyone else's buy in but I both don't expect them to do that and don't really expect it to work?) MIRI discusses in the IABIED resources that they would prefer carveouts for narrow bio AI, so it's not like they're even advocating all progress to stop. (Advanced bio AI seems pretty good for the world and to capture a lot of the benefits). ... I certainly do expect you-et-al to disagree with MIRI-et-al on a bunch of implementation details of the treaty. But, it seems like a version of the treaty that doesn't at least have the capacity to shutdown compute t

6ryan_greenblatt9mo

I think compute + fab monitoring with potential for escalation requires much lower political will than shutting down AI development. I agree that both Plan A and shut it all down require something like this. Like I think this monitoring would plausibly not require much more political will than export controls. ---------------------------------------- [...] Huh? No it doesn't capture much of the benefits. I would have guessed it captures a tiny fraction of the benefits for advanced AI, even for AIs around the level where you might want to pause at human level. ---------------------------------------- [...] I agree you will have the capacity to shut down compute temporarily either way; I disagree that there isn't much of a difference between slowing down takeoff and shutting down all further non-narrow AI development.

5Raemon9mo

FYI this is cruxy. I don't have very strong political-viability-intuitions, but seems like this requires export controls that several (sometimes rivalrous) major nations are agreeing to simultaneously, with at least nontrivial trust for establishing the monitoring process together, which eventually is pretty invasive. (maybe you are imagining the monitoring is actually mostly done with spy satellites that don't require much trust or cooperation?) But like, the last draft of Plan A I saw include "we relocate all the compute to centralized locations in third party countries" as an eventual goal. That seems pretty crazy?

5ryan_greenblatt9mo

Yes, this is much harder (from a political will perspective) than compute + fab monitoring which is part of my point? Like my view is that in terms of political will requirements: compute + fab monitoring << Plan A < Shut it all down

4Raemon9mo

Nod, I agree centralizing part is harder than non-centralized fab monitoring. But, I think a sufficient amount of "non-centralized" fab monitoring is still a much bigger ask than export controls, and, the centralization was part of at least one writeup of Plan A, and it seemed pretty weird to include that bit but write off "actual shutdown" as politically intractable.

6ryan_greenblatt9mo

I'm not trying to say "Plan A is doable and shut it all down is intractable". My view is that "shut it all down" probably requires substantially more (but not a huge amount more) political will than Plan A such that it is maybe like 3x less likely to happen given similar amounts of effort from the safety community. You started by saying: [...] So I was trying to respond to this. I think 3x less likely to happen is actually a pretty big deal; this isn't some tiny difference, but neither is it "Plan A is doable and shut it all down is intractable". (And I also think "shut it all down" has various important downsides relative to Plan A, maybe these downsides can be overcome, but by default this makes Plan A look more attractive to me even aside from the political will considerations.) I think something like Plan A or "shut it all down" are both very unlikely to happen and I'd be pretty sympathetic to describing both as politically intractable (e.g., I think something as good/strong as Plan A is only 5% likely). "politically intractable" isn't very precise though, so I think we have to talk more quantitatively. Note that my view is also that I think pushing for Plan A isn't the most leveraged thing for most people to do at the margin; I expect to focus on making Plans C/D go better (with some weight on things like Plan B).

4Raemon9mo

Nod. FYI, I think Shut It Down is approximately as likely to happen as "Full-fledged Plan A that is sufficiently careful enough to actually help much more than [the first several stages of Plan A that Plan A and Shut It Down share]", on account of being simple enough that it's even really possible to coordinate on it. I agree they are both pretty unlikely to happen. (Regardless, I think the thing to do is probably "reach for whatever wins seem achievable near term and try to build coordination capital for more wins") I think it's a major possible failure mode of Plan A is "it turns it a giant regulatory capture molochian boondoggle that both slows thing down for a long time in confused bad ways and reads to the public as a somewhat weirdly cynical plot, which makes people turn against tech progress comparably or more than the average Shut It Down would." (I don't have a strong belief about the relative likelihoods of that None of those beliefs are particularly strong and I could easily learn a lot that would change all my beliefs. Seems fine to leave it here. I dont have more arguments I didn't already write up in "Shut It Down" is simpler than "Controlled Takeoff", just stating for the record I don't think you've put forth an argument that justifies the 3x increase in difficulty of Shut It Down over the fully fledged version of Plan A. (We might still be imagining different things re: Shut It Down)

2Eli Tyre9mo

Where do you think that most of the benefits come from? Edit: My personal consumption patterns are mostly not relevant to this question, so I moved what was formally the rest of this comment to a footnote.[1] 1. ^ Perhaps I am dumb or my personal priorities are different than most people's, but I expect a large share of the benefits from AI, to my life, personally, are going to be biotech advances, that eg could extend my life or make me smarter. Like basically the things that could make my life better are 1) somehow being introduced to a compatible romantic partner, 2) cheaper housing, 3) biotech stuff. There isn't much else. I guess self-driving cars might make travel easier? But most of the cost of travel is housing. I care a lot about ending factory farming, but that's biotechnology again. I guess AI, if it was trustworthy, could also substantially improve governance, which could have huge benefits to society.

8ryan_greenblatt9mo

I don't think you get radical reductions in mortality and radical life extension with (advanced) narrow bio AI without highly capable general AI. (It might be that a key strategy for unlocking much better biotech is highly capable general AIs creating extremely good narrow bio AI, but I don't think the narrow bio AI which humans will create over the next ~30 years is very likely to suffice.) Like narrow bio AI isn't going to get you (arbitrarily good) biotech nearly as fast as building generally capable AI would. This seems especially true given that much, much better biotech might require much higher GDP for e.g. running vastly more experiments and using vastly more compute. (TBC, I don't agree with all aspects of the linked post.) I also think people care about radical increases in material abundance which you also don't get with narrow bio AI. And the same for entertainment, antidepressants (and other drugs/modifications that might massively improve quality of life by giving people much more control over mood, experiences, etc), and becoming an upload such that you can live a radically different life if you want. You also don't have the potential for huge improvements in animal welfare (due to making meat alternatives cheaper, allowing for engineering away suffering in livestock animals, making people wiser, etc.) I'm focusing on neartermist-style benefits; as in, immediate benefits to currently alive (or soon to be born by default) humans or animals. Of course, powerful AI could result in huge numbers of digitial minds in the short run and probably is needed for getting to a great future (with a potentially insane amount of digital minds and good utilization of the cosmic endowment etc.) longer term. The first order effects on benefits of delaying don't matter that much from a longtermist perspective of course, so I assumed we were fully operating a neartermist-style frame when talking about benefits.

2Eli Tyre9mo

It seems like I misunderstood your reading of Ray's claim. I read Ray as saying "a large fraction of the benefits of advanced AI are only in the biotech sector, and so we could get a large fraction of the benefits by pushing forward on only AI for biotech." It sounds like you're pointing at a somewhat different axis, in response, saying "we won't get anything close to the benefits of advanced AI agents with only narrow AI systems, because narrow AI systems are just much less helpful." (And implicitly, the biotech AIs are either narrow AIs (and therefore not very helpful) or they're general AIs that are specialized on biotech, in which case you're not getting the the safety benefits, you're imagining getting by only focusing biotech.)

2Raemon9mo

Ah, I had also misintepreted Ryans response here. "What actually is practical here?" makes sense as a question and I'm not sure about the answers. I think one of the MIRI angles here is variants of STEM AI, which might be more general, but whose training set is filtered to be only materials about bio + some related science (and avoiding as much as possible that'd point towards human psychology, geopolitics, programming, ai hardware, etc). So it both will have less propensity to take over, and be less good at it relative to it's power level at bio. I wasn't thinking about this when I wrote the previous comment, I'd have phrased it differently if I were. I agree it's an open question whether this works. But I feel more optimistic about controlled-takeoff world that's taking a step back from "LLMs are trained on the whole internet." Also, noting: I don't believing in a safe, full handoff to artificial AI alignment researchers (because of gradual disempowerment reasons). But, fwiw I think I'd feel pretty good about STEM AI that's focused on various flavors of math and conceptual reasoning that somehow avoids human psychology, hardware, and geopolitics, which you don't do a full handoff to, but, it's able to assist pretty substantially with larger subproblems that come up.

5Eli Tyre9mo

Thank you for writing this! Some important things that I learned / clarified for myself from this comment: * Many plans depend on preserving the political will to maintain a geopolitical regime that isn't the nash equilibrium, for years or decades. A key consideration for those plans is "how much much of the benefit of this plan will we have gotten, if the controlled regime breaks down early?" * Plans that depend on having human level AIs do alignment work (if those plans work at all), don't have linear payoff in time spent working, but they are much closer to linear than plans that depend on genetically engineered super geniuses doing the alignment work. * In the AI alignment researcher plan, the AIs can be making progress as soon as they're developed. In the super-genius plan, we need to develop the genetic engineering techniques and (potentially) have the super-geniuses grow up before they can get to work. The benefits to super-geniuses are backloaded, instead of linear. * (I don't want to overstate this difference however, because if the plan of automating alignment research is just fundamentally unworkable, it doesn't matter that the returns to automated alignment research would be closer to linear in time, if it did work. The more important crux is "could this work at all?") * The complexity of "controlled takeoff" is in setting up the situation so that things are actually being done responsibly and safely, instead of only seeming so to people that aren't equipped to judge. The complexity of "shut it all down" is in setting up an off-ramp. If "shut it all down" is also including "genetically engineer super-geniuses" as part of the plan, then it's not clearly simpler than "controlled takeoff."

4Raemon9mo

Responding to some disagree reacts: [...] Seems good to register disagreement, but, fyi I have no idea why you think that. Re: [...] One way the geopolitical situation might get worse is "time passes, and, all kinds of stuff can change when time passes." Another way it can get worse is "the current dynamics still involve a feeling of being rushed, and time pressure, and meanwhile the international agreements we have leave a lot more wiggle room and more confused spirit-of-the-law about how people are allowed to maneuever." This could cause the geopolitical situation to get worse faster than it would otherwise. Which of those is worse? idk, I'm not a geopolitical expert. But, it's why it seems pretty obviously not 'strictly worse' (which is a high bar, with IMO a higher burden of proof) under Shut It Down. (Also, note "shut it all down" is not like it's actually going to be permanent. Any international treaty/agreement at any time can be reversed by the involved nations deciding "guys, actually we have now voted to to leave this agreement", with some associated negotiations along the way)

2ryan_greenblatt9mo

I'm going to default to bowing out, but if you want to bid for me to engage a bunch, you can.

[-]Raemon9moΩ5122

I dunno, this seems really important and I am really confused why y'all are oriented this way.

Yes, I very much would like responses on these and my other comment, although no worries if you want to take a bit more time to address more thoroughly.

4ryan_greenblatt9mo

Wouldn't the narrative for Plan A be more like "we should be cautious and slow down if we aren't confident about safety, and we'll need to build the ability to slow down a lot"? While the narrative for "shut it all down" would have to involve something like "proceeding with any further development is too risky given the current situation".

4Raemon9mo

I'm not 100% sure what Nate/Eliezer believe. I know they do think eventually we should build superintelligence, and that it'd be an existential catastrophe if we didn't. I think they think (and, I agree) that we should be at least prepared for things that are more like 20-50 year pauses, if it turns out to take that long, but (at least speaking for myself), this isn't because it's intrinsically desireable to pause for 50 years. It's because you should remain shut-down until you're actually confidently know what you're doing, with no pressure to convince yourself/each-other than you're ready when you are not. It might be that AI-accelerated alignment researchmeans you don't need a 20-50 year pause, but, that should be a decision the governing body makes based on how things are playing out, not baked into the initial assumption, so we don't need to take risks like "run tons of very smart AIs in parallel very fast" when we're only somewhat confident about their longterm alignment which opens us up to more gradual disempowerment / slowly-outmanuevered risk, or eventual death by evolution. I haven't read the entirety of the IABIED website proposed treaty draft yet, but it includes this line, which includes flavor of "re-evaluate how things are going." [...]

4ryan_greenblatt9mo

Sure, I agree that Nate/Eliezer think we should eventually build superintelligence and don't want to causal a pause that lasts forever. In the comment you're responding to, I'm just talking about difficulty in getting people to buy the narrative. More generally, what Nate/Eliezer think is best is doesn't resolve concerns with the pause going poorly because something else happens in practice. This includes the pause going on too long or leading to a general anti-AI/anti-digital-minds/anti-progress view which is costly for the longer run future.) (This applies to the proposed Plan A as well, but I think poor implementation is less scary in various ways and the particular risk of ~anti-progress forever is less strong.)

[-]Thomas Larsen9moΩ20380

One framing that I think might be helpful for thinking about "Plan A" vs "shut it all down" is: "Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?" This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the "shut it all down" plan.

I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown).

this makes me think that particularly for a shorter slowdown (e.g. 5 years), you want to go fast at the beginning (e.g. scale to ~max controllable AI over the first year or two), and then elicit lots of work out of those AIs for the rest of the time period.
A key concern for the above plan is that govts/labs botch the measurement of "max controllable AI", and scale too far.
But it's not clear to me ho

... (read more)

4Raemon9mo

This framing feels reasonable-ish, with some caveats.[1] I am assuming we're starting the question at the first stage where either "shut it down" or "have a strong degree of control over global takeoff" becomes plausibly politically viable. (i.e. assume early stages of Shut It Down and Controlled Takeoff both include various partial measures that are more immediately viable and don't give you the ability to steer capability-growth that hard) But, once it becomes a serious question "how quickly should we progress through capabilities", then one thing to flag is, it's not like you know "we get 5 years, therefore, we want to proceed through those years at X rate." It's "we seem to have this amount of buy-in currently..." and the amount of buy-in could change (positively or negatively). Some random thoughts on things that seem important: * I would want to do at least some early global pause on large training runs, to check if you are actually capable of doing that at all. (in conjunction with some efforts attempting to build international goodwill about it) * One of the more important things to do as soon as it's viable, is to stop production of more compute in an uncontrolled fashion. (I'm guessing this plays out with some kind of pork deals for nVidia and other leaders[2], where the early steps are 'consolidate compute', and then them producing the chips that are more monitorable, and which they get to make money from, but also are sort of nationalized). This prevents a big overhang. * Before I did a rapid-growth of capabilities, I would want a globally set target of "we are able to make some kind of interpretability strides or evals that let us make better able to predict the outcome of the next training run." ( If it's not viable to do that, well, then we don't. (but, then we're not really having a real convo about how slow the takeoff should ideally be, just riding the same incentive wave we're currently riding with slightly more steering). ((We can instead

4Thomas Larsen9mo

I think I mostly am on board with this comment. Some thoughts: [...] * this feels a bit overly binary to me. I think that understanding-based safety cases will be necessary for ASI. But behavioral methods seem like they might be sufficient before hand. * I don't know what you mean by "rapid growth". It seems like you might be imagining the "shut it all down -> solve alignment during pause -> rapidly scale after you've solved alignment" plan. I think we probably should never do a "rapid scaleup" Another reaction I have is that a constraint to coordination will probably be "is the other guy doing a blacksite which will screw us over". So I think there's a viability bump at the point of "allow legal capabiliites scaling at least as fast as the max size blacksite that you would have a hard time detecting". [...] So I think this paragraph isn't really right, because "slowdown' != 'pause', and slowdowns might still be really really helpful and enough to get you a long way. [...] I actually currently think that you want to accelerate compute production, because hardware scaling seems safer than software scaling. I'm not sure exactly what you mean by "in an uncontrolled fashion".. if you mean "have a bunch of inspectors making sure the flow of new chips isn't being smuggled to illegal projects", then I agree with this, on my initial read I thought you meant something like "pause chip production until they start producing GPUS with HEMs in them", which I think is probably bad. In other words I think that you want to create a big compute overhang during a pause. The downside is obvious, but the upsides are: 1. compute is controllable, far more than software, and so differentially advances legal projects. 2. more compute for safety. We want to be able to pay a big safety tax, more compute straightforwardly helps. 3. extra compute progress funges against software progress, which is scarier. 4. compute is destroyable (e.g. we can reverse and destroy compute

4Raemon9mo

I think "everyone agrees to a noticeably smaller next-run-size" seems like a fine thing to do as the first coordination attempt. I think there is something good about having an early step (maybe after that one), which somehow forces people to actually orient on "okay, suppose we actually had to prioritize interpretability and evals now until they were able to keep pace with capabilities, how would we seriously do that?" (I don't currently have a good operationalize of this that seems robust, but, it seems plausible by the time we're meaningfully able to decide to do anything like this, someone may have come up with a good policy with that effect. I can definitely see this backfiring and causing people to get better at some kind of software that is then harder to control). [...] Mmm, nod I can see it. I'd need to think more to figure out a considered opinion on this but seems a-priori reasonable. I think one of the things I want is to have executed each type of control you might want to exert, at least for a shorter period of time, to test whether you're able to do it at all. But, having the early compute steps be more focused on "they have remote-shutdown options but can continue production" or at least a policy-level "there are enforcers sitting outside the compute centers that could choose to forcibly shut it down fairly quickly".

[-]Thomas Larsen9moΩ11159

One upside of shut it all down is that it does in fact buy more time: in Plan A it is difficult to secure algorithmic secrets without extremely aggressive security measures, hence any rogue projects (e.g. nation state blacksites) can just coast off the algos developed by the verified projects. Then, a few years in, they fire up their cluster and try to do an intelligence explosion with the extra algorithmic progress.

[-]ryan_greenblatt9moΩ4100

Maybe I should clarify my view a bit on Plan A vs "shut it all down":

Both seem really hard to pull off from a political will and actually making it happen.
- Plan A is complicated and looking at I go "oh jeez, this seems really hard to make happen well, idk if the US government has the institutional capacity to pull off something like this". But, I also think pulling off "shut it all down" seems pretty rough and pulling off a shutdown that lasts for a really long time seems hard.
- Generally, it seems like Plan A might be hard to pull off and it's easy for me to imagine it going wrong. This also mostly applies to "shut it all down" though.
- I still think Plan A is better than other options and Plan A going poorly can still nicely degrade into Plan B (or C).
It generally looks to me like Plan A is strictly better on most axes, though maybe "shut it all down" would be better if we were very confident in maintaining extremely high political will (e.g. at or above peak WW2 level political will) for a very long time.
- There would still be a question of whether the reduction in risk is worth delaying the benefits of AI. I tenatively think yes even from a normal moral perspective, but th

... (read more)

2[comment deleted]9mo

[-]Daniel Kokotajlo9moΩ72125

Plan C: 20%
Plan D: 45%
Plan E: 75%

I feel like these numbers are too low.

[-]Peter Wildeford9mo170

the problem with C though "The leading AI company is willing to spend (much of) its lead on misalignment concerns" is that I think the leading AI company has ~0 lead?

3ryan_greenblatt9mo

I guess I expect 0-9 month lead and coordination between companies seems possible.

[-]ryan_greenblatt11dΩ9142

As part of AI 2040, I give updated likelihoods for one particular operationalization here and updated chances of AI takeover here. Note that my views are very sensitive to the operationalization and implementation quality. E.g., a near ideal implementation Plan A might be roughly 2x lower risk.

[-]Vladimir_Nesov9moΩ4124

What happens with gradual disempowerment in this picture? Even Plan A seems compatible with handing off increasing levels of influence to AIs. One benefit of "shut it all down" (AGI Pause) is ruling out this problem by not having AGIs around (at least while the Pause lasts, which is also when the exit strategy needs to be prepared, not merely technical alignment).

Gradual disempowerment risks transitioning into permanent disempowerment (if not extinction), where a successful solution to technical ASI-grade alignment by the AIs might result in the future of humanity surviving, but only getting a tiny sliver of resources compared to the AIs, with no way of ever changing that even on cosmic timescales. Permanent disempowerment doesn't even need to involve a takeover.

Also, in the absence of "shut it all down", at some point targeting misalignment risks might be less impactful on the margin than targeting improvements in education (about AI risks and cruxes of mitigation strategies), coordination technologies, and AI Control. These enable directing more resources to misalignment risk mitigation as appropriate, including getting back to "shut it all down", a more robust ASI Pause, or maki... (read more)

[-]cousin_it9moΩ2100

Do you know any people working at frontier labs who would be willing to do the kind of thing you describe in plan D, some kind of covert alignment against the wishes of the larger company? Who would physically press keys on their terminal to do it, as opposed to quitting or trying to sway the company? Not asking to name names, just my hunch is that there are very few such people now, maybe none at all. And if that's the case, we're in E world already.

8ryan_greenblatt9mo

I don't think Plan D particularly involves covert alignment and going against the will of the larger company, though going against the will of the company might come up in practice. I think there are people working in frontier labs who would be willing to try to make some version of Plan D happen.

2cousin_it9mo

Can you maybe describe in more detail how you imagine it? What specifically do the "ten people on the inside" do, if company leadership disagrees with them about safety?

[-]Daniel Kokotajlo9moΩ8138

I don't think the idea is that the 10 people on the inside violate the wishes of company leadership. Rather, the idea is that they use whatever tiny amount of resources and political capital they do have as best as possible. E.g. leadership might be like "Fine, before we erase the logs of AI activity we can have your monitor system look over them and flag anything suspicious -- but you have to build the monitor by next week because we aren't delaying, and also, it can't cost more than 0.01% of overall compute."

[-]cousin_it9moΩ41618

The OP says takeover risk is 45% under plan D and 75% under plan E. We're supposed to gain an extra 30% of safety from this feeble "build something by next week with 1% of compute"? Not happening.

My point is that if the "ten people on the inside" obey their managers, plan D will have a tiny effect at best. And if we instead postulate that they won't obey their managers, then there are no such "ten people on the inside" in the first place. So we should already behave as if we're in world E.

2ryan_greenblatt9mo

A general point is that going from "no human cares at all" to "a small group of people with limited resources cares" might be a big difference, especially given the potential leverage of using a bunch of AI labor and importing cheap measures developed elsewhere.

2Cleo Nardo9mo

To clarify what I think is Ryan's point: * In D-labs, both the safety faction and the non-safety faction are leveraging AI labour. * AI labour makes D-labs seem more like C-labs and less like E-labs, directionally. * This is because the effectiveness ratio between (10 humans) and (990 humans) is greater than the ratio between (10 humans and 1M AIs) and (990 humans and 990M AIs). * This is because of diminishing returns to cognitive labour, i.e. cheap interventions.

2ryan_greenblatt9mo

(Yes, also I think that a small number of employees working on safety might get proportionally more compute than the average company employee, e.g. this currently seems to be the case.)

2cousin_it9mo

Yeah, that partly makes sense to me. I guess my intuition is like, if 95% of the company is focused on racing as hard as possible (and using AI leverage for that too, AI coming up with new unsafe tricks and all that), then the 5% who care about safety probably won't have that much impact.

2Daniel Kokotajlo9mo

I disagree with the probabilities given by the OP. Also, the thing I mentioned was just one example, and probably not the best example; the idea is that the 10 people on the inside would be implementing a whole bunch of things like this.

[-]Vladimir_Nesov9moΩ480

You can't really have a technical "Plan E" because there is approximately no one to implement the plan

AGIs themselves will be implementing some sort of plan (perhaps at very vague and disorganized prompting from humans, or without any prompting at all; which might be influenced by blog posts and such, in publicly available Internet text). This could be relevant for mitigating ASI misalignment if these AGIs are sufficiently aligned to the future of humanity, more so than some of the hypothetical future ASIs (created without following such a plan).

4ryan_greenblatt9mo

Sure, I agree with this, but it's harder for us to usefully help these AIs.

2Vladimir_Nesov9mo

The "ten people on the inside" direct AIs to useful projects within their resource allocation. The AGIs themselves direct their own projects according to their propensities, which might be influenced by publicly available Internet text, possibly to a greater extent if it's old enough to be part of pretraining datasets. The amount of resources that AGIs direct on their own initiative might dwarf the amount of resources of the "ten people on the inside", so the impact of openly published technical plans (that make sense on their own merits) might be significant. While AGIs could come up with any ideas independently on their own, path dependence of the acute risk period might still make their initial propensities to pay attention to particular plans matter.

[-]Bronson Schoen9mo712

Why are we modeling the leading labs as (1) having a legible, unambiguous lead (i.e. they all agree internally that there’s an N month lead), (2) being willing to spend down their lead at all?

My whole understanding of the “spending down the lead” argument was always between countries. If it’s now set as being between companies I don’t see any reason to think realistically this would happen?

4ryan_greenblatt9mo

I wasn't trying to assume in the post that leading AI companies have a legible, unambiguous and are willing to spend down this lead. I was just noting that my "Plan C" proposal requires that companies have a lead they are willing to spend down (which might not be legible or unambiguous, though I do think ambiguity gets less bad as AIs get more capable and their impacts are more clear).

3Bronson Schoen9mo

I do think a lab being willing to spend down depends on there being concensus among lab leadership that there is such a lead with a high degree of confidence. For example, given that Plan C requires that one of the current frontier labs pulls ahead, this would mean that it also would need to pull ahead by enough of a margin where their leadership agrees that they definitely are ahead. Concretely, it seems to me like the “1-3 month lead” worlds are likely to collapse into Plan D. It’s also plausible to me that the “amount of margin the leading lab would need before they’d be willing to spend any on safety” is very high, and that they wouldn’t “spend down to zero”, so in practice you would need one lab, soon, to start pulling very far ahead. Note this is somewhat minor, I found the overall post and similar posts very useful!

[-]romeo9mo71

Do you have a take on 'all things considered p(doom) / p(bad outcome)' by plan? My guess is the EV from moving to plan A doesn't look great on pure takeover risk but looks better on the all things considered risk.

In terms of how likely each regime is to happen, I feel like Plan C is modal, not Plan D and Plan B is already similarly likely to Plan D. My A / B / C / D / E is probably 5 / 25 / 35 / 30 / 5. Main reasons is that I don't see current lab leaders actually wanting to do D, seems like they probably all will want to do C to me when the time com... (read more)

3romeo5mo

I've come back to register here that i've updated much closer to your views on likelihood but i'm (seemingly) more pessimistic on doom (unless your all things considered p(doom) is a bunch higher than takeover risk). These numbers imply overall p(doom) around 55%

3ryan_greenblatt5mo

My all considered doom is a decent amount high but this depends on what you count as doom.

[-]ryan_greenblatt9moΩ460

One alternative way of thinking about this is to decompose plans by which actor the plan is for:

Plan A: Most countries, at least the US and China
Plan B: The US government (and domestic industry)
Plan C: The leading AI company (or maybe a few of the leading AI companies)
Plan D: A small team with a bit of buy in within the leading AI company

This isn't a perfect breakdown, e.g. Plan A might focus mostly on what the US should do, but it might still be helpful.

This decomposition was proposed by @Lukas Finnveden.

[-]Charbel-Raphaël8mo*Ω240

Summary

Plan A: Most countries, at least the US and China
Plan B: The US government (and domestic industry)
Plan C: The leading AI company (or maybe a few of the leading AI companies)
Plan D: A small team with a bit of buy in within the leading AI company
Plan E: No will

Plan	Probability of Scenario	Takeover Risk Given Scenario	Expected Risk Contribution
Plan A	5%	7%	0.35%
Plan B	10%	13%	1.30%
Plan C	25%	20%	5.00%
Plan D	45%	45%	20.25%
Plan E	15%	75%	11.25%
Total	100%	-	38.15%

How much lead time we have to spend on x-risk focused safety work in each of these scenarios:

Plan A: 10 year

... (read more)

4Charbel-Raphaël8mo

I have three main critiques: 1. The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act. 2. Political Will as Strategy: The framework treats political will as a background variable rather than a key strategic lever. D→C campaigns could reduce expected risk by 11+ percentage points - nearly 30% of the total risk. A campaign to move from E→D would also be highly strategic and could only require talking to a handful of employees. 3. Missing “Plan A-Minus”: No need to lose your lead necessarily. International standards to formalize the red lines/unacceptable levels of risks, e.g., via the AISI network and targeted if-then commitments, would enable companies to slow down without losing, because they would all be playing under the same rules. This seems more tractable than Plan A and solves the China problem better than Plan B.

2ryan_greenblatt8mo

What about the US trying somewhat hard to buy lead time, e.g., by sabotaging Chinese AI companies? [...] I roughly agree with this. It's useful to condition on (initial) political will when making a technical plan, but I agree raising political will is important and one issue with this perspective is it might incorrectly make this less salient.

0Charbel-Raphaël8mo

I've asked Claude to make a rough assessment on this. Tldr, the proba goes from 13% to ~27 and this propagates to plan C and D. [...] Claude: Ryan's response is suggestive but incomplete. "Sabotaging Chinese AI companies" gestures at a possible answer but doesn't constitute a full defense because: 1. It's extremely escalatory and might not be politically viable even with high US government buy-in 2. Its effectiveness is uncertain—how much lead time would successful sabotage actually buy? Months? Years? 3. It's not obviously repeatable; China would harden against further attacks 4. It could provoke dangerous counter-responses To be fair to Ryan, the original post does mention "helping the US government ensure non-proliferation/lead time" under Plan B, so the concern isn't entirely unaddressed. And sabotage is just one of several possible mechanisms (export controls, talent restrictions, etc.). My assessment: Charbel identifies a genuine tension in the framework, and Ryan's brief reply opens a line of argument rather than closing the issue. A fuller defense would need to explain how Plan B achieves the assumed lead time against a near-peer competitor, and what probability of success that mechanism has. Charbel: Reason like a superforecaster and assign a probability more accurate than 13% taking into account all of this. Claude: This requires decomposing the problem carefully. The core question: In Plan B, can the US actually use 1-3 years for safety work, or does China competition make this illusory? Key sub-scenarios within Plan B: ScenarioDescriptionProbabilityTakeover RiskB1US successfully maintains lead via export controls, compute advantages, possible sabotage; safety work proceeds~25%~12%B2Lead partially erodes, ~1 year of real safety time, rushed execution~35%~28%B3Plan B is unstable—slowing means losing, race dynamics reassert, collapses toward Plan C/D~25%~40%B4China reaches transformative AI first or simultaneously despite US efforts~15%~65% W

2ryan_greenblatt8mo

I'm just literally assuming that Plan B involves a moderate amount of lead time via the US having a lead or trying pretty hard to sabotage China, this is part of the plan/assumptions.

[-]Raemon9moΩ340

Plan A: 10 years
Plan B: 1-3 years
Plan C: 1-9 months (probably on the lower end of this)
Plan D: ~0 months, but ten people on the inside doing helpful things

I think you mean "starting from fully automated AI R&D" but not 100% sure.

4ryan_greenblatt9mo

I just mean "amount of additional lead time to spend on safety". This could be spent at different points.

[-]habryka9moΩ342

Thus, the numbers I give below are somewhat more optimistic than what you'd get just given the level of political will corresponding to each of these scenarios (as this will might be spent incompetently).

FWIW, for at least plan A and plan B, I feel like the realistic multiplier on how optimistic these are is like at least 3x? Like, I don't see an argument for this kind of plan working with 90%+ probability given realistic assumptions about execution quality.

(I also have disagreements about whether this will work, but at least plan A well-executed seems like it would notice it was starting to be very reckless and then be in a good position to slow down more)

2ryan_greenblatt9mo

Yeah fair, I don't think I've thought about this very carefully. I currently feel like 3x is too high, but I don't feel very reflectively stable.

[-]Orpheus169mo30

Plan C: The leading AI company has a 2-9 month lead (relative to AI companies which aren't willing to spend as much on misalignment concerns) and is sufficiently institutionally functional to actually spend this lead in a basically reasonable way (perhaps subject to some constraints from outside investors), so some decent fraction of it will be spent on safety.

TLDR: I expect it will be pretty difficult for a "Plan C Leading Lab" to stop scaling, even conditional on having a 2-9 month lead. There are enough uncertainties & Forces of Inertia that will ma... (read more)

[-]StanislavKrym9mo3-2

I suspect that we'll end up in Plan E for two different reasons and would like @Daniel Kokotajlo to comment on them.

Assuming superhuman coders in 2030 instead of 2027, we are likely to see Taiwan being invaded, forcing the USA and China to build compute factories at home. While the AI-2027 scenario assumed that Taiwan wouldn't be invaded and that OpenBrain would obtain a lion's share of compute, GPT-5 estimates that the ratio of compute produced in the USA and in China is between 1:1 and 4:1, and not 5:1 as in AI-2027.^[1] If China merges its com

... (read more)

5Daniel Kokotajlo9mo

I think Taiwan invasion is very plausible but I wouldn't say it's likely by 2030 even assuming superhuman coders. Maybe 50/50? I agree that in the 2030s, especially the late 2030s, the US might be in big trouble w.r.t. competition with China. Not confident of course, the future is uncertain. Overall I think plan E is quite plausible; Ryan's breakdown of probabilities overall seems reasonable to me (I might put a bit more into Plan A) [...]

1StanislavKrym9mo

Is it likely that the USA in in big trouble before 2030? If it is, then it might, say, prompt the invasion or prompt AI companies to race hard...

[-]Raemon9mo*Ω120

(Having otherwise complained a bunch about some of the commentary/framing around Plan A vs Shut It Down, I do overall like this post and think having the lens of the different worlds is pretty good for planning).

(I am also appreciating how people are using inline reacts)

[-]Cleo Nardo9mo20

You can't really have a technical "Plan E" because there is approximately no one to implement the plan; in Plan E situations, the focus should be on moving to a higher level of political will and effort on mitigating risk.

There are no employee who could implement Plan E, but is there nothing that non-employees could do?

4ryan_greenblatt9mo

It's hard for some output to mitigate risk if no one will implement it. You can try to generate more political will or I suppose you can just add things to the pretraining corpus in the hopes this will help. You can imagine regimes where external actors are allowed to implement things, but this would move you closer to more like a Plan D or D/E scenario. (These are more like points along a spectrum rather than an exhaustive break down.)

3Cleo Nardo9mo

1. Try to make deals with the AIs (maybe this counts as 'add things to corpus') 2. Try to make deals with the lab (maybe this counts as 'Move to Plan C/D') 3. Try to disrupt the compute supply chain or lab employees 4. Harden the external environment, d/acc stuff (probably hopeless, but maybe worthwhile on slow takeoff) 5. YOLO human intelligence enhancement / uploads Yeah, E seems hard. Especially because from the outside E is indistinguishable from D, and many of the E strategies would be negatively-polarising, hence counterproductive on Plan D.

[-]Ariel_8mo10

Any good posts/papers discussing "handover"? e.g. the handover of AI research to AI R&D agents (the plan of the original OpenAI Superalignment team). I'm also interested in any adjacent research agendas which might help the handover succeed.

Some of the more relevant work i've read (other than this post) are Wentworth's slop post, various scalable oversight/safety case papers, automation collapse.

[-]Towards_Keeperhood8mo10

Here is the takeover risk I expect given a central version of each of these scenarios (and given the assumptions from the prior paragraph):^[4]
Plan A: 7%
Plan B: 13%
Plan C: 20%
Plan D: 45%
Plan E: 75%

I think it makes more sense to state overall risk instead of takeover risk, because that's what we care about. Could you give very rough guesses on what fraction of achievable utility we would get in expectation conditional on each Plan? ("achievable utility" is the utility we would get if the future goes optimally, like CEV aligned superintelligence.) Or just rou... (read more)

[-]Ihor Kendiukhov9mo10

I think my plan E fits neatly in this framework: https://www.lesswrong.com/posts/2xHhe4EBHAFofkQJf/plan-e-for-ai-doom

Moderation Log