If one thinks the chance of an existential disaster is "anywhere between 10% and 90%", one should definitely worry about the potential of any plan to counter it to backfire.
Is permanent disempowerment (where the future of humanity only gets a tiny sliver of the reachable universe) an "existential disaster"? It's not literal extinction, "existential risk" can mean either, the distinction can be crucial.
I think permanent disempowerment (or extinction) is somewhere between 90% and 95% unconditionally, and north of 95% conditional on building a superintelligence by 2050. But literal extinction is only between 10% and 30% (on current trajectory). The chances improve with interventions such as a lasting ASI Pause, including an AGI-led ASI Pause, which makes it more likely that ASIs are at least aligned with the AGIs. A lasting AGI Pause (rather than an ASI Pause) is the only straightforward and predictably effective way to avoid permanent disempowerment, and a sane civilization would just do that, with some margin of even weaker AIs and even worse hardware.
Dodging permanent disempowerment (rather than merely extinction) without an AGI Pause likely needs AGIs that somehow both haven't taken over and simultaneously effective enough at helping with the ASI Pause effort. This could just take the form of allocating 80% of AGI labor or whatever to ASI alignment projects, so that capabilities never outpace the ability to either contain or avoid misalignment. So not necessarily a literal Pause, when there are AGIs around that can set up lasting institutions with inhuman levels of robustness capable of implementing commitments to pursue ASI alignment that are less blunt than a literal Pause yet still effective. But for the same reasons that this kind of thing might work, it seems unlikely to work without an AGI takeover.
where the future of humanity only gets a tiny sliver of the reachable universe
I am not sure how to think about this. "Canned primates" are not going to reach a big part of the physically reachable universe. For the purposes of thinking about "the light cone", one should still think about "merge with AI", "uploading", and so on. That line of reasoning should be not about "humans vs AIs", but about ways to have a "good merge" (that is, without succumbing to S-risks, and without doing bad things to unmodified biologicals).
Also, I tend to privilege already living humans or their close descendants over the more remote ones, so achieving personal immortality is important if one wants to enjoy a sizable chunk of "the light cone" (it takes time to reach it). Of course, we need personal immortality ASAP anyway, otherwise their "everyones dies" would really become true (although not all at once, and not without replacement, but that's cold comfort for those currently alive).
That's the intended meaning, I go into more detail in the linked post. Hence "the future of humanity" rather than simply "humanity", something humanity would endorse as its future, which is not exclusively (or at all) biological humans. Currently living humans could in principle develop tools to uplift themselves all the way to star-sized superintelligences, but that requires a star, while what humans might instead get is a metaphorical server rack, hence permanent disempowerment.
My comment is primarily an objection about vague terminology not distinguishing permanent disempowerment from extinction. Avoiding permanent disempowerment seems like the correct shared cause, while the cause of merely non-extinction has many ways of endorsing plans that lead to permanent disempowerment. And not being content with permanent disempowerment (even under the conditions of eutopia within strict constraints on resources) depends on noticing that more is possible.
Yes.
What I am going to say is semi-off-topic for this post (I was trying not to consider potential object-level disagreements), but I have noticed that when discussing human intelligence augmentation, the authors of IABIED always talk only about genetic enhancements and never about direct merge between humans and electronic devices (which seems to also be consistent with their past writings on this). So it seems that for unspecified (but perhaps very rational) reasons, they want to keep enhanced humans purely biological for quite a while.
(Perhaps they think that we can't handle close coupling of humans and electronics in a way which is existentially safe at this time.)
Whereas, sufficient uplifting requires fairly radical changes. And, in any case, intelligence augmentation via coupling with electronics is likely to be a much faster path and to produce a more radical intelligence augmentation. But, perhaps, they think that the associated existential risks are too high...
Genetic enhancement seems like a safe-ish way of getting a few standard deviations without yet knowing what you are really doing, that current humanity could actually attempt in practice. And that might help a lot with both the "knowing what you are doing" part, and with not doing irreversible things without knowing what you are doing. Any change risks misalignment, uplifting to a superintelligence requires ASI-grade alignment theory and technology, even lifespans for baseline biological humans that run into centuries risk misalignment (since this never happened before). There's always cryonics, which enables waiting for future progress, if civilization was at all serious about it.
So when you talk about "merging with AI", that is very suspicious, because a well-developed uplifting methodology doesn't obviously look anything like "merging with AI". You become some kind of more capable mind, that's different from what you were before, not taking irreversible steps towards something you wouldn't endorse. Without such a methodology, it's a priori about as bad an idea as building superintelligence in 2029.
I usually think about “reversible merges” for the purpose of intelligence augmentation (not for the purpose of space travel, though).
I tend to think that high-end non-invasive BCI are powerful enough for that and safer than implants. But yes, there still might be serious risks, both personal and existential.
I take 'backfire' to mean 'get more of the thing you don't want than you would otherwise, as a direct result of your attempt to get less of it.' If you mean it some other way, then the rest of my comment isn't really useful.
This all makes sense. I do think underground orgs exist; right now their chances of beating the leaders are not too high.
There are even “intermediate status orgs”; e.g. we know about Ilya’s org because he told us, but we don’t know much about it.
The post is for people to ponder how likely is all this to backfire in these (or other) fashions. I am not optimistic personally (all these estimates depend on the difficulty of achieving the ASI, and my personal estimates are that ASI is relatively easy to achieve, and that timelines are really short; the more tricky the task of creating ASI is, the stronger are the chances of such a plan not backfiring).
If one thinks about the original Eliezer’s threat model, someone launching a true recursive self-improvement in their garage… the only reason we don’t think about that much at the moment is because the high compute orgs are moving fast; the moment they stop, the original Eliezer’s threat model will start coming back into play more prominently.
I think positions here come out on four quadrants (but is of course spectral), based on how likely you think Doom is and how easy or hard (that is: resource intensive) you expect ASI development to be.
ASI Easy/Doom Very Unlikely: Plan obviously backfires; you could have had nice things, but were too cautious!
ASI Hard/Doom Very Unlikely: Unlikely to backfire, but you might have been better off pressing ahead, because there was nothing to worry about anyway.
ASI Easy/Doom Very Likely: We're kinda fucked anyway in this world, so I'd want to have pretty high confidence it's the world we're in before attempting any plan optimized for it. But yes, here it looks like the plan backfires (in that we're selecting even harder than in default-world for power-seeking, willingness to break norms, and non-transparency in coordinating around who gets to build it). My guess is this is the world you think we're in. I think this is irresponsibly fatalistic and also unlikely, but I don't think it matters to get into it here.
ASI Hard/Doom Very Likely: Plan plausibly works.
I expect near-term ASI development to be resource-intensive, or to rely on not-yet-complete resource-intensive research. I remain concerned about the brain-in-a-box scenarios, but it's not obvious to me that they're much more pressing in 2025 than they were in 2020, except in ways that are downstream of LLM development (I haven't looked super-close), which is more tractable to coordinate action around anyway, and that action plausibly leads to decreased risk on the margin even if the principle threat is from a brain-in-a-box. I assume you disagree with all of this.
I think your post is just aimed at a completely different threat model than the book even attempts to address, and I think you would be having more of the discussion you want to have if you opened by talking explicitly about your actual crux (which you seemed to know was the real crux ahead of time), than to incite an object-level discussion colored by such a powerful background disagreement. As-is, it feels like you just disagree with the way the book is scoped, and would rather talk to Eliezer about brain-in-a-box than talk to William about the tenative-proposals-to-solve-a-very-different-problem.
Yeah, I think my position is ASI is easy/Doom likelihood is medium.
And, more specifically, I think that a good part of doom likelihood is due to people seeming to disagree super radically with each other about the details of anything related to AI existential safety.
This an empirical observation, I don’t have a good model for where this radical disagreement comes from; but this seeming inability to even start narrowing the disagreements down is not a good sign; first of all, this means that people are likely to keep disagreeing sharply with each other even when working together under the umbrella of a possible ban treaty.
So, without a ban, this seems to suggest that the views of the “race winners” on AI safety are very unpredictable (that’s not good, very difficult to predict what would happen). And with a ban, this seems to suggest high likelihood of ideologically driven rebellions against the ban (perhaps covert rather than overt, given the threat of the armed force).
With people being able to talk to each and converge to something, I would expect the doom likelihood to be reducible to more palatable levels. But without being able to narrow disagreements down somewhat, the doom chances have to be quite significant.
I think targeting specific policy points rather than highlighting your actual crux makes this worse, not better
The book is neutral about timelines, or whether ASI is easy or not. It specifically calls those “hard issues” which it is going to sidestep.
So it would be difficult to make this a crux. The book implies that its recommendations don’t depend on one’s position on those “hard issues”.
If, instead, its recommendations were specific depending on the views on timelines and on the difficulty to achieve the ASI, that would be different.
As it is, the only specific dependency is the assumption that AI safety is very hard (in this sense, I did mention that I am speaking from the viewpoint of people who think it’s hard, but not very hard, the “10%-90% doom probability”).
I agree that humans are “disaster monkeys” poorly equipped to handle this, but I think they are also poorly equipped to handle the ban. They barely handle nuclear controls, and this one is way more difficult. I doubt the P(doom) conditional on the ban will be lower than P(doom) conditional on no ban. (Here I very much agree with Eliezer’s methodology of not talking of absolute values of P(doom), but of comparing conditional P(doom) values.)
For most, LLMs are the salient threat vector at this time, and the practical recommendations in the book are toward that. You did not say in your post ‘I believe that brain-in-a-box is the true concern; the book’s recommendations don’t work for this, because Chapter 13 is mostly about LLMs.’ That would be a different post (and redundant with a bunch of Steve Byrnes stuff).
Instead, you completely buried the lede and made a post inviting people to talk in circles with you unless they magically divined your true objection (which is only distantly related to the topic of the post). That does not look like a good faith attempt to get people on the same page.
I am agnostic. I don’t think humans necessarily need to modify the GPT architecture, I think GPT-6 would be perfectly capable of doing that for them.
But I also think that those “brains-in-the-box” systems will use (open weights or closed weights) LLMs as important components. It’s a different ball game now, we are more than halfway through towards the “true self-improvement”, because one can incorporate open weight LLMs (until the system progresses so much they can be phased out).
The leading LLMs systems are starting to provide real assistance in AI research + even their open versions are pretty good at being the components of the next big thing. So yes, this is purely from LLM trends (the labs insiders are tweeting that chatbots are saturated and that their current focus is how much LLMs can assist in AI research; and plenty of people express openness to modifying the architecture at will). I don’t know if we are going to continue to call them LLMs, but it does not matter, there is no strict boundary between LLMs and what is being built next.
I don’t want to continue to elaborate the technical details (what I am saying above is reasonably common knowledge, but if I start giving further technical details, I might say something actually useful for acceleration efforts).
But yes, I am saying that one should expect the trends to be faster than what follows from previous LLMs trends, because of how the labs are using them more and more for AI research. METR doubling periods should start shrinking soon.
would the AI safety research itself slow down orders of magnitude?
As far as I understood, the IABIED plan is to ensure that no one ever creates anything except for Verifiably Incapable Systems until AI alignment gets solved. But they didn't prevent mankind from uniting the AI companies into a megaproject, then confining AI research to said project and letting anyone send their takes on the project's forum and the public view anything approved by the forum's admins (e.g. capability evaluations, but not architecture discussions).
In addition, the public is allowed to create tiny models like the ones on which Agent-4 from the AI-2027 forecast did experiments to solve mechinterp. And to run verifiably incapable models, finetune them by approved[1] finetuning data, steer them.
What I don't understand is why the underground lab wouldn't join the INTERNATIONAL megaproject. This behaviour would require them to be too reckless or omnicidal maniacs or to want to take over the world. And no, anti-woke stance isn't an explanation because China would also participate and the CCP isn't pro-woke.
Unfortunately, your second point still stands: before Yudkowsky-style AI research takeover the labs could actually counteract.
Finetuning the models with anything unapproved (e.g. due to misaligning the models) should lead to the finetuner being invited to the project or prohibited to inform anyone else that the dataset is unapproved.
What I don't understand is why the underground lab wouldn't join the INTERNATIONAL megaproject.
Because they don't want to be known. That's what the word "underground" means.
The enforcement regime of this kind is prone to abuses, so there will be a lot of distrust; also they might feel that everyone is too incapacitated, and while they would not normally have a chance against larger above-the-ground orgs, the new situation is different.
to want to take over the world
Yes, this would be their plan, to take over the world, or to pass the control to the ASI which they would presume to be friendly to them (and, if they have an altruistic mindset, to everyone else too, but even in this case, the problem is that their assumptions of friendliness might be mistaken).
If one thinks the chance of an existential disaster is close to 100%, one might tend to worry less about the potential of a plan to counter it to backfire. It's not clear if that is a correct approach even if one thinks the chances of an existential disaster are that high, but I am going to set that aside.
If one thinks the chance of an existential disaster is "anywhere between 10% and 90%", one should definitely worry about the potential of any plan to counter it to backfire.
Out of all ways the IABIED plan to ban AI development and to ban publication of AI research could potentially backfire, I want to list three most obvious ways which seem to be particularly salient. I think it's useful to have them separately from object-level discussions.
1. Change of the winner. The most obvious possibility is that the plan would fail to stop ASI, but would change the winner of the race. If one thinks that the chance of an existential disaster is "anywhere between 10% and 90%", but that the actual probability depends on the identity and practices of the race winner(s), this might make the chances much worse. Unless one thinks the chances of an existential disaster are already very close to 100%, one should not like the potential of an underground lab winning the race during the prohibition period.
2. Intensified race and other possible countermeasures. A road to prohibition is a gradual process, it's not a switch one can immediately flip on. This plan is not talking about a "prohibition via a coup". When it starts looking like the chances of a prohibition to be enacted are significant, this can spur a particularly intense race (a number of AI orgs would view the threat of prohibition on par with the threat of a competitor winning). Again, if one thinks the chances of an existential disaster are already very close to 100%, this might not matter too much, but otherwise the further accelerated race might make the chances of avoiding existential disasters worse. Before succeeding at "shutting it all down", gradual advancement of this plan will have an effect of creating a "crisis mode", and various actors doing various things in "crisis mode".
3. Various impairments for AI safety research. Regarding the proposed ban on publication of AI research, one needs to ask where various branches of AI safety research stand. The boundary between safety research and capability research is thin, there is a large overlap. For example, talking about interpretability research, Nate was saying (April 2023, https://www.lesswrong.com/posts/BinkknLBYxskMXuME/if-interpretability-research-goes-well-it-may-get-dangerous)
I'm still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed.
It would be good to have some clarity on this from the authors of the plan. Do they propose the ban on publications to cover all research that might advance AI capabilities, including AI safety research that might advance the capabilities? Where do they stand on this? For those of us who have the chance of an existential disaster "anywhere between 10% and 90%", this feels like something with strong potential of making our chances worse. Not only this whole plan is increasing the chances of shifting the ASI race winner to be an underground lab, but would that underground lab also be deprived of benefits of being aware of advances in AI safety research, and would the AI safety research itself slow down orders of magnitude?