Side question: what about the "shut it all down" plan proposed in (e.g.) If Anyone Builds It, Everyone Dies?
I think this probably requires substantially more political will than Plan A and seems worse than a well-implemented version of Plan A that leverage the additional political will to spend more time slowing down at high levels of capability (and some at lower levels of capability). That said, shutting it all down is substantially simpler and a well-implemented version would reduce takeover risk substantially in my view (at the cost of delaying the benefits of AI by decades which seems worth it, but I can understand why people would disagree).
"Shut it all down" seems worse to me because:
My main question is "why do you think Shut Down actually costs more political will?".
I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells.
I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than "Controlled Takeoff").
...Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff.
Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff."
A Global Controlled Takeoff That Works has a lot of moving parts.
You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs.
You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agree
I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells.
I think shutting down all AI development is much more costly than not shutting down all AI development in a pretty straightforward sense that will in fact probably be priced into the required level of political will: Nvidia is in fact much worse off if all AI development shuts down versus if AI development proceeds, but with capabilities developing more slowly once they reach a high level of capabilities.
I would guessed the stock market will react pretty different to something like Plan A vs "shut it all down" for reasonable reasons.
I don't understand why you think the opening steps are the most politically challenging part given that the opening steps for Plan A plausibly don't require stopping AI development.
Another point is that many people have pretty reasonable existing objections to "shut it all down". Here are some example objections people might have that apply more to "shut it all down" than "Plan A":
I think the current CCP having control over most/all of the universe seems like 50% as bad as AI takeover in my lights
This is a wild claim to me.
Can you elaborate on why you think this?
(I assume you're asking "why isn't it much less bad than AI takeover" as opposed to "isn't it almost as bad as AI takeover, like 98% as bad".)
I care most about the long-run utilization of cosmic resources, so this dominates my thinking about this sort of question. I think it's very easy for humans to use cosmic resources poorly from my perspective and I think this is more likely if resources are controlled by an autocratic regime, especially an autocratic regime where one person holds most of the power (which seems reasonably likely for a post-AGI CCP). In other words, I think it's pretty easy to lose half of the value of the long-run future (or more) based on which humans are in control and how this goes.
I'll compare the CCP having full control to broadly democratic human control (e.g. most cosmic resources are controlled by some kinda democratic system or auctioned while retaining democracy).
We could break this down into likelihood of carefully reflecting and then how much this reflection converges. I think control by an autocratic regime makes reflection less likely and that selection effects around who controls the CCP are bad making post-reflection convergence worse (and it's ...
Shouldn't we at least proceed until we can't very confidently proceed safely?
I mean, I think AI ending up uncontrollably powerful are on the order of 1-3% likely for the next generation of models. That seems far far too high. I think we are right now in a position where we can't very confidently proceed safely.
I dunno, this seems really important and I am really confused why y'all are oriented this way.
Yes, I very much would like responses on these and my other comment, although no worries if you want to take a bit more time to address more thoroughly.
One framing that I think might be helpful for thinking about "Plan A" vs "shut it all down" is: "Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?" This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the "shut it all down" plan.
I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown).
One upside of shut it all down is that it does in fact buy more time: in Plan A it is difficult to secure algorithmic secrets without extremely aggressive security measures, hence any rogue projects (e.g. nation state blacksites) can just coast off the algos developed by the verified projects. Then, a few years in, they fire up their cluster and try to do an intelligence explosion with the extra algorithmic progress.
Maybe I should clarify my view a bit on Plan A vs "shut it all down":
the problem with C though "The leading AI company is willing to spend (much of) its lead on misalignment concerns" is that I think the leading AI company has ~0 lead?
What happens with gradual disempowerment in this picture? Even Plan A seems compatible with handing off increasing levels of influence to AIs. One benefit of "shut it all down" (AGI Pause) is ruling out this problem by not having AGIs around (at least while the Pause lasts, which is also when the exit strategy needs to be prepared, not merely technical alignment).
Gradual disempowerment risks transitioning into permanent disempowerment (if not extinction), where a successful solution to technical ASI-grade alignment by the AIs might result in the future of humanity surviving, but only getting a tiny sliver of resources compared to the AIs, with no way of ever changing that even on cosmic timescales. Permanent disempowerment doesn't even need to involve a takeover.
Also, in the absence of "shut it all down", at some point targeting misalignment risks might be less impactful on the margin than targeting improvements in education (about AI risks and cruxes of mitigation strategies), coordination technologies, and AI Control. These enable directing more resources to misalignment risk mitigation as appropriate, including getting back to "shut it all down", a more robust ASI Pause, or maki...
Do you know any people working at frontier labs who would be willing to do the kind of thing you describe in plan D, some kind of covert alignment against the wishes of the larger company? Who would physically press keys on their terminal to do it, as opposed to quitting or trying to sway the company? Not asking to name names, just my hunch is that there are very few such people now, maybe none at all. And if that's the case, we're in E world already.
I don't think the idea is that the 10 people on the inside violate the wishes of company leadership. Rather, the idea is that they use whatever tiny amount of resources and political capital they do have as best as possible. E.g. leadership might be like "Fine, before we erase the logs of AI activity we can have your monitor system look over them and flag anything suspicious -- but you have to build the monitor by next week because we aren't delaying, and also, it can't cost more than 0.01% of overall compute."
The OP says takeover risk is 45% under plan D and 75% under plan E. We're supposed to gain an extra 30% of safety from this feeble "build something by next week with 1% of compute"? Not happening.
My point is that if the "ten people on the inside" obey their managers, plan D will have a tiny effect at best. And if we instead postulate that they won't obey their managers, then there are no such "ten people on the inside" in the first place. So we should already behave as if we're in world E.
Do you have a take on 'all things considered p(doom) / p(bad outcome)' by plan? My guess is the EV from moving to plan A doesn't look great on pure takeover risk but looks better on the all things considered risk.
In terms of how likely each regime is to happen, I feel like Plan C is modal, not Plan D and Plan B is already similarly likely to Plan D. My A / B / C / D / E is probably 5 / 25 / 35 / 30 / 5. Main reasons is that I don't see current lab leaders actually wanting to do D, seems like they probably all will want to do C to me when the time com...
One alternative way of thinking about this is to decompose plans by which actor the plan is for:
This isn't a perfect breakdown, e.g. Plan A might focus mostly on what the US should do, but it might still be helpful.
This decomposition was proposed by @Lukas Finnveden.
You can't really have a technical "Plan E" because there is approximately no one to implement the plan
AGIs themselves will be implementing some sort of plan (perhaps at very vague and disorganized prompting from humans, or without any prompting at all; which might be influenced by blog posts and such, in publicly available Internet text). This could be relevant for mitigating ASI misalignment if these AGIs are sufficiently aligned to the future of humanity, more so than some of the hypothetical future ASIs (created without following such a plan).
Why are we modeling the leading labs as (1) having a legible, unambiguous lead (i.e. they all agree internally that there’s an N month lead), (2) being willing to spend down their lead at all?
My whole understanding of the “spending down the lead” argument was always between countries. If it’s now set as being between companies I don’t see any reason to think realistically this would happen?
Plan A: 10 years
Plan B: 1-3 years
Plan C: 1-9 months (probably on the lower end of this)
Plan D: ~0 months, but ten people on the inside doing helpful things
I think you mean "starting from fully automated AI R&D" but not 100% sure.
Thus, the numbers I give below are somewhat more optimistic than what you'd get just given the level of political will corresponding to each of these scenarios (as this will might be spent incompetently).
FWIW, for at least plan A and plan B, I feel like the realistic multiplier on how optimistic these are is like at least 3x? Like, I don't see an argument for this kind of plan working with 90%+ probability given realistic assumptions about execution quality.
(I also have disagreements about whether this will work, but at least plan A well-executed seems like it would notice it was starting to be very reckless and then be in a good position to slow down more)
Plan C: The leading AI company has a 2-9 month lead (relative to AI companies which aren't willing to spend as much on misalignment concerns) and is sufficiently institutionally functional to actually spend this lead in a basically reasonable way (perhaps subject to some constraints from outside investors), so some decent fraction of it will be spent on safety.
TLDR: I expect it will be pretty difficult for a "Plan C Leading Lab" to stop scaling, even conditional on having a 2-9 month lead. There are enough uncertainties & Forces of Inertia that will ma...
I suspect that we'll end up in Plan E for two different reasons and would like @Daniel Kokotajlo to comment on them.
how much lead time we have to spend on x-risk focused safety work in each of these scenarios:
| Plan | Probability of Scenario | Takeover Risk Given Scenario | Expected Risk Contribution |
|---|---|---|---|
| Plan A | 5% | 7% | 0.35% |
| Plan B | 10% | 13% | 1.30% |
| Plan C | 25% | 20% | 5.00% |
| Plan D | 45% | 45% | 20.25% |
| Plan E | 15% | 75% | 11.25% |
| Total | 100% | - | 38.15% |
(Having otherwise complained a bunch about some of the commentary/framing around Plan A vs Shut It Down, I do overall like this post and think having the lens of the different worlds is pretty good for planning).
(I am also appreciating how people are using inline reacts)
You can't really have a technical "Plan E" because there is approximately no one to implement the plan; in Plan E situations, the focus should be on moving to a higher level of political will and effort on mitigating risk.
There are no employee who could implement Plan E, but is there nothing that non-employees could do?
Any good posts/papers discussing "handover"? e.g. the handover of AI research to AI R&D agents (the plan of the original OpenAI Superalignment team). I'm also interested in any adjacent research agendas which might help the handover succeed.
Some of the more relevant work i've read (other than this post) are Wentworth's slop post, various scalable oversight/safety case papers, automation collapse.
Here is the takeover risk I expect given a central version of each of these scenarios (and given the assumptions from the prior paragraph):[4]
- Plan A: 7%
- Plan B: 13%
- Plan C: 20%
- Plan D: 45%
- Plan E: 75%
I think it makes more sense to state overall risk instead of takeover risk, because that's what we care about. Could you give very rough guesses on what fraction of achievable utility we would get in expectation conditional on each Plan? ("achievable utility" is the utility we would get if the future goes optimally, like CEV aligned superintelligence.) Or just rou...
I think my plan E fits neatly in this framework: https://www.lesswrong.com/posts/2xHhe4EBHAFofkQJf/plan-e-for-ai-doom
I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.
In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.
Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:
Now here is some commentary on my current favorite plan for each of these levels of political will, though I won't go into much detail.
We implement an international agreement to mostly eliminate race dynamics and allow for many years to be spent investing in security/safety while also generally adapting to more powerful AI. The ideal capabilities trajectory would depend on how quickly safety research progresses and the robustness of the international agreement, but I'm imagining something like spreading out takeoff over ~10 years. This might end up roughly equivalent to: ensure that if takeoff would have been fast, it is instead as slow as more optimistic people think it will be. You probably want to start slowing down capabilities around the point when AIs can fully automate engineering in AI companies and want to fully pause, spending down most of the available lead time, slightly above the level of capability needed to fully automate AI R&D.
We'd have time to focus much of our effort on moonshots which could plausibly result in high assurance and which might be scalable to very superhuman AIs. By default—as in, unless the success of some moonshots greatly changes the strategic picture—the plan would basically be to keep capabilities below the maximum controllable level for a while until we can use human labor (and AI labor) to mostly resolve relevant alignment problems. Once alignment issues are resolved or we run out of time, we'd hand off ~all safety work to AIs which are barely superhuman (basically, only a bit above the capability bar needed for handoff to be viable in principle, this is probably somewhat above the level of capability needed for fully automating AI R&D). At the point of handoff, we might or might not have scalable solutions to alignment, but we don't necessarily need arbitrarily scalable solutions to succeed. Obviously I'm omitting many, many details here. (I have draft docs discussing many of these details.)
The situation is somewhat similar to Plan A except that we have less time (1-3 additional years rather than 10). The plan looks similar to Plan A, but we focus much less on moonshots and more on straightforward/prosaic approaches. Also, we can't afford to slow down much prior to a level of capability close to full automation of AI R&D. Attempting to hand off to AI systems with only empirical iteration and prosaic methods is key, though we can do a decently high effort version of this. We should still ensure control until handoff (or a bit before this or potentially once we've ruled out scheming strongly). The AI company should put in a bunch of effort (using AI labor) into trying to buy time via increasing the level of political will (e.g., demonstrating capabilities and risks) and helping the US government ensure non-proliferation/lead time. If this goes very well, we might be able to move to Plan A.
We focus on very quickly handing off to AIs even if things are very half-baked. We aim to ensure some level of (probably somewhat janky) control until handoff. Helping the US government acquire lead time is much less promising due to limited US government buy-in, though trying to generate more political will remains promising.
Many things will be totally YOLO'd. The safety team should focus on being able to extract non-sabotaged safety research from the AI, gathering strong evidence for risk, and making an AI which is plausibly safe to hand off to. Ensuring control at the level of the organization should be mostly given up on, but things that might help avoid AIs escalating permissions/access to a point where they can directly sabotage the relevant safety work or mess with arbitrary things run on the company's compute could be worthwhile. More responsible trailing AI companies should focus on exporting safety work (in addition to policy/coordination work).
We can also consider a "Plan E" scenario where the level of will and number of employees who are working on mitigating the relevant misalignment risks is substantially less than in a "Ten people on the inside" style scenario. As in, there are only a few people (or perhaps there is no one) who are worried about these risks, are aiming to mitigate them, and have access to frontier AI systems at the leading developer(s). One extreme scenario would be that the entire project is heavily siloed and secretive with few people (or no one) working on mitigating AI takeover risk and discussion of misalignment concerns is effectively taboo within the project. You can't really have a technical "Plan E" because there is approximately no one to implement the plan; in Plan E situations, the focus should be on moving to a higher level of political will and effort on mitigating risk.
Another way to think about this is to think about how much lead time we have to spend on x-risk focused safety work in each of these scenarios:
What do I think is the chance that we end up in the world of Plan A, B, C or D? As in, do we have the will (and competence) to do something which isn't much worse than the given plan (presumably with many modifications based on the exact situation) while still being worse than the next better plan? (Obviously the details will be less specific than the exact details I gave above.) It depends on timelines, but conditioning on a trajectory where by default (in the absence of active intervention) we would have reached AIs that beat top experts at ~everything prior to 2035, here are my not-very-well-considered guesses:
What level of takeover risk do I expect in each of these situations?[1] This depends substantially on the quality of execution, which is somewhat correlated with the level of political will. I won't assume that my preferred strategy (given that level of political will) is used. For Plans C and above, I will assume "sufficiently institutionally functional to actually spend this lead time in a basically reasonable way" and that the available lead time is actually spent on safety. Thus, the numbers I give below are somewhat more optimistic than what you'd get just given the level of political will corresponding to each of these scenarios (as this will might be spent incompetently).
Note that I'm ignoring the possibility of switching between these regimes during takeoff while humans are directly in control; for instance, I'm ignoring the possibility of starting in a Plan D scenario, but then having this shift to Plan C due to evidence of misalignment risk.[2] However, I am including the possibility for (hopefully aligned) AIs to manage the situation very differently after humans voluntarily hand over strategic decision making to AIs (insofar as this happens). I'm also conditioning on a trajectory where by default (in the absence of active intervention) we would have reached AIs that beat top experts at ~everything prior to 2035 like for my probabilities given above.[3]
Here is the takeover risk I expect given a central version of each of these scenarios (and given the assumptions from the prior paragraph):[4]
A substantial fraction of the risk in Plan A and Plan B worlds comes from incompetence (as in, if the overall strategy and decision making were better, risk would be much lower) and another substantial fraction comes from the possibility of takeover being very hard to avoid.
What are the main sources of political will in each of these scenarios? In general, Plans A and B are mostly driven by governments (mostly the US government) while Plans C and D are mostly driven by AI company leadership and employees. In Plan A and Plan B, a high level of will from the US government is necessary (and could be sufficient for at least Plan B, though AI company leadership caring is helpful). Plan C likely requires a ton of buy-in from AI company leadership, though sufficiently strong employee pressure could mostly suffice. Additional political will in Plan D could come from (in descending order of importance under my views): employee efforts (both pressure and direct labor), AI company leadership, pressure from something like corporate campaigns (external pressure which mostly operates on customers, suppliers, or maybe investors), and relatively weak regulation.
Given these probabilities and levels of risk, I'm inclined to focus substantially on helping with Plans C and D. This applies to both research and generating marginal political will. Correspondingly, I think what AI company employees and leadership think about AI (existential) safety is very important and political strategies that result in AI company employees/leadership being more dismissive of safety (e.g. due to negative polarization or looking cringe) look less compelling.
Note that risks other than AI takeover are also generally reduced by having more actors take powerful AI seriously and having more coordination. ↩︎
The risk conditional on starting in a Plan D scenario is lower than conditional on remaining in a Plan D scenario and the risk conditional on starting in a Plan A scenario is higher than if we condition on remaining. ↩︎
This sentence was added in an edit because I realized I forgot to include this sort of caveat. ↩︎
Multiplying the probabilities given above by the takeover risk numbers given here doesn't exactly yield my overall probability of takeover because of the optimistic assumption of reasonable execution/competence (making actual risk higher) and also because these risk numbers are for central versions of each scenario while the probabilities are for ranges of plans that include somewhat higher levels of will (making actual risk lower). (Specifically: "will (and competence) to do something which isn't much worse than the given plan while still being worse than the next better plan". So the probabilities for Plan C really include <Plan B while >= Plan C.) ↩︎