Some of the ways the IABIED plan can backfire

[-]Vladimir_Nesov2mo6-1

If one thinks the chance of an existential disaster is "anywhere between 10% and 90%", one should definitely worry about the potential of any plan to counter it to backfire.

Is permanent disempowerment (where the future of humanity only gets a tiny sliver of the reachable universe) an "existential disaster"? It's not literal extinction, "existential risk" can mean either, the distinction can be crucial.

I think permanent disempowerment (or extinction) is somewhere between 90% and 95% unconditionally, and north of 95% conditional on building a superintelligence by 2050. But literal extinction is only between 10% and 30% (on current trajectory). The chances improve with interventions such as a lasting ASI Pause, including an AGI-led ASI Pause, which makes it more likely that ASIs are at least aligned with the AGIs. A lasting AGI Pause (rather than an ASI Pause) is the only straightforward and predictably effective way to avoid permanent disempowerment, and a sane civilization would just do that, with some margin of even weaker AIs and even worse hardware.

Dodging permanent disempowerment (rather than merely extinction) without an AGI Pause likely needs AGIs that somehow both haven't taken over and simultaneously effective enough at helping with the ASI Pause effort. This could just take the form of allocating 80% of AGI labor or whatever to ASI alignment projects, so that capabilities never outpace the ability to either contain or avoid misalignment. So not necessarily a literal Pause, when there are AGIs around that can set up lasting institutions with inhuman levels of robustness capable of implementing commitments to pursue ASI alignment that are less blunt than a literal Pause yet still effective. But for the same reasons that this kind of thing might work, it seems unlikely to work without an AGI takeover.

[-]mishka2mo40

where the future of humanity only gets a tiny sliver of the reachable universe

I am not sure how to think about this. "Canned primates" are not going to reach a big part of the physically reachable universe. For the purposes of thinking about "the light cone", one should still think about "merge with AI", "uploading", and so on. That line of reasoning should be not about "humans vs AIs", but about ways to have a "good merge" (that is, without succumbing to S-risks, and without doing bad things to unmodified biologicals).

Also, I tend to privilege already living humans or their close descendants over the more remote ones, so achieving personal immortality is important if one wants to enjoy a sizable chunk of "the light cone" (it takes time to reach it). Of course, we need personal immortality ASAP anyway, otherwise their "everyones dies" would really become true (although not all at once, and not without replacement, but that's cold comfort for those currently alive).

[-]Vladimir_Nesov2mo40

That's the intended meaning, I go into more detail in the linked post. Hence "the future of humanity" rather than simply "humanity", something humanity would endorse as its future, which is not exclusively (or at all) biological humans. Currently living humans could in principle develop tools to uplift themselves all the way to star-sized superintelligences, but that requires a star, while what humans might instead get is a metaphorical server rack, hence permanent disempowerment.

My comment is primarily an objection about vague terminology not distinguishing permanent disempowerment from extinction. Avoiding permanent disempowerment seems like the correct shared cause, while the cause of merely non-extinction has many ways of endorsing plans that lead to permanent disempowerment. And not being content with permanent disempowerment (even under the conditions of eutopia within strict constraints on resources) depends on noticing that more is possible.

[-]mishka2mo31

Yes.

What I am going to say is semi-off-topic for this post (I was trying not to consider potential object-level disagreements), but I have noticed that when discussing human intelligence augmentation, the authors of IABIED always talk only about genetic enhancements and never about direct merge between humans and electronic devices (which seems to also be consistent with their past writings on this). So it seems that for unspecified (but perhaps very rational) reasons, they want to keep enhanced humans purely biological for quite a while.

(Perhaps they think that we can't handle close coupling of humans and electronics in a way which is existentially safe at this time.)

Whereas, sufficient uplifting requires fairly radical changes. And, in any case, intelligence augmentation via coupling with electronics is likely to be a much faster path and to produce a more radical intelligence augmentation. But, perhaps, they think that the associated existential risks are too high...

[-]Vladimir_Nesov2mo63

Genetic enhancement seems like a safe-ish way of getting a few standard deviations without yet knowing what you are really doing, that current humanity could actually attempt in practice. And that might help a lot with both the "knowing what you are doing" part, and with not doing irreversible things without knowing what you are doing. Any change risks misalignment, uplifting to a superintelligence requires ASI-grade alignment theory and technology, even lifespans for baseline biological humans that run into centuries risk misalignment (since this never happened before). There's always cryonics, which enables waiting for future progress, if civilization was at all serious about it.

So when you talk about "merging with AI", that is very suspicious, because a well-developed uplifting methodology doesn't obviously look anything like "merging with AI". You become some kind of more capable mind, that's different from what you were before, not taking irreversible steps towards something you wouldn't endorse. Without such a methodology, it's a priori about as bad an idea as building superintelligence in 2029.

[-]mishka2mo40

I usually think about “reversible merges” for the purpose of intelligence augmentation (not for the purpose of space travel, though).

I tend to think that high-end non-invasive BCI are powerful enough for that and safer than implants. But yes, there still might be serious risks, both personal and existential.

[-]yams2mo21

I take 'backfire' to mean 'get more of the thing you don't want than you would otherwise, as a direct result of your attempt to get less of it.' If you mean it some other way, then the rest of my comment isn't really useful.

Change of the winner
1. Secret projects under the moratorium are definitely on the list of things to watch out for, and the tech gov team at MIRI has a huge suite of countermeasures they're considering for this, some of which are sketched out or gestured toward here.
2. It actually seems to me that an underground project is more likely under the current regime, because there aren't really any meaningful controls in place (you might even consider DeepSeek just such a project, given that there's some evidence [I really don't know what I believe here and doesn't seem useful to argue; just using them as an example here] that they stole IP and smuggled chips).
3. The better your moratorium is, the less likely you are to get wrecked by a secret project (because the fewer resources they'll be able to gather) before you can satisfy your exit conditions.
4. So p(undergroundProjectExists) goes down as a result of moratorium legislation, but p(undergroundProjectWins) may go up if your moratorium sucks. (I actually think this is still pretty unclear, owing to the shape of the classified research ecosystem, which I talk more about below.)
5. This is, imo, your strongest point, and is a principal difference between the various MIRI plans and plans of other people we talk to ("Once you get the moratorium, do you suppose there must be a secret project and resolve to race against them?" MIRI answers no; some others answre yes.)
Intensified race...
1. You say: "a number of AI orgs would view the threat of prohibition on par with the threat of a competitor winning". I don't think this effect is stronger in the moratorium case than in the 'we are losing the race and believe the finish line is near' case, and this kind of behavior sooner is better (if we don't expect the safety situation to improve), because the systems themselves are less powerful, the risks aren't as big, the financial stakes aren't as palpable, etc. I agree with something like "looming prohibition will cause some reckless action to happen sooner than it would otherwise", but not with the stronger claim that this action would be created by the prohibition.
2. I also think the threat of a moratorium could cause companies to behave more sanely in various ways, so that they're not caught on the wrong side of the law in the future worlds where some 'moderate' position wins the political debate. I don't think voluntary commitments are trustworthy/sufficient, but I could absolutely see RSPs growing teeth as a way for companies to generate evidence of effective self-regulation, then deploy that evidence to argue against the necessity of a moratorium.
3. It's just really not clear to me how this set of interrelated effects would net out, much less that it's an obvious way pushing through a moratorium might backfire. My best guess is that these cooling effects pre-moratorium basically win out and compresses the period of greatest concern, while also reducing its intensity.
Various impairments for AI safety research
1. Huge amounts of classified research exists. There are entire parallel academic ecosystems for folks working on military and military-adjacent technologies. These include work on game theory, category theory, Conway's Game of Life, genetics, corporate governance structures, and other relatively-esoteric things beyond 'how make bomb go boom'. Scientists in ~every field and mathematicians in any branch of the discipline, can become DARPA PMs, and access to (some portion of) this separate academic canon is considered a central perk of the job. I expect gaining the ability to work on the classified parts of AI safety under a moratorium will be similarly difficult to qualifying for work at Los Alamos and the like.
2. As others have pointed out, not all kinds of research would need to be classified-by-default, under the plan. Mostly this would be stuff regarding architecture, algorithms, and hardware.
3. There are scarier worlds where you would want to classify more of the research, and there are reasonable disagreements about what should/shouldn't be classified, but even then, you're in a Los Alamos situation, and not in a Butlerian Jihad.

[-]mishka2mo40

This all makes sense. I do think underground orgs exist; right now their chances of beating the leaders are not too high.

There are even “intermediate status orgs”; e.g. we know about Ilya’s org because he told us, but we don’t know much about it.

The post is for people to ponder how likely is all this to backfire in these (or other) fashions. I am not optimistic personally (all these estimates depend on the difficulty of achieving the ASI, and my personal estimates are that ASI is relatively easy to achieve, and that timelines are really short; the more tricky the task of creating ASI is, the stronger are the chances of such a plan not backfiring).

If one thinks about the original Eliezer’s threat model, someone launching a true recursive self-improvement in their garage… the only reason we don’t think about that much at the moment is because the high compute orgs are moving fast; the moment they stop, the original Eliezer’s threat model will start coming back into play more prominently.

[-]yams2mo97

I think positions here come out on four quadrants (but is of course spectral), based on how likely you think Doom is and how easy or hard (that is: resource intensive) you expect ASI development to be.

ASI Easy/Doom Very Unlikely: Plan obviously backfires; you could have had nice things, but were too cautious!

ASI Hard/Doom Very Unlikely: Unlikely to backfire, but you might have been better off pressing ahead, because there was nothing to worry about anyway.

ASI Easy/Doom Very Likely: We're kinda fucked anyway in this world, so I'd want to have pretty high confidence it's the world we're in before attempting any plan optimized for it. But yes, here it looks like the plan backfires (in that we're selecting even harder than in default-world for power-seeking, willingness to break norms, and non-transparency in coordinating around who gets to build it). My guess is this is the world you think we're in. I think this is irresponsibly fatalistic and also unlikely, but I don't think it matters to get into it here.

ASI Hard/Doom Very Likely: Plan plausibly works.

I expect near-term ASI development to be resource-intensive, or to rely on not-yet-complete resource-intensive research. I remain concerned about the brain-in-a-box scenarios, but it's not obvious to me that they're much more pressing in 2025 than they were in 2020, except in ways that are downstream of LLM development (I haven't looked super-close), which is more tractable to coordinate action around anyway, and that action plausibly leads to decreased risk on the margin even if the principle threat is from a brain-in-a-box. I assume you disagree with all of this.

I think your post is just aimed at a completely different threat model than the book even attempts to address, and I think you would be having more of the discussion you want to have if you opened by talking explicitly about your actual crux (which you seemed to know was the real crux ahead of time), than to incite an object-level discussion colored by such a powerful background disagreement. As-is, it feels like you just disagree with the way the book is scoped, and would rather talk to Eliezer about brain-in-a-box than talk to William about the tenative-proposals-to-solve-a-very-different-problem.

[-]mishka2mo20

Yeah, I think my position is ASI is easy/Doom likelihood is medium.

And, more specifically, I think that a good part of doom likelihood is due to people seeming to disagree super radically with each other about the details of anything related to AI existential safety.

This an empirical observation, I don’t have a good model for where this radical disagreement comes from; but this seeming inability to even start narrowing the disagreements down is not a good sign; first of all, this means that people are likely to keep disagreeing sharply with each other even when working together under the umbrella of a possible ban treaty.

So, without a ban, this seems to suggest that the views of the “race winners” on AI safety are very unpredictable (that’s not good, very difficult to predict what would happen). And with a ban, this seems to suggest high likelihood of ideologically driven rebellions against the ban (perhaps covert rather than overt, given the threat of the armed force).

With people being able to talk to each and converge to something, I would expect the doom likelihood to be reducible to more palatable levels. But without being able to narrow disagreements down somewhat, the doom chances have to be quite significant.

[-]yams2mo20

I think targeting specific policy points rather than highlighting your actual crux makes this worse, not better

[-]mishka2mo20

The book is neutral about timelines, or whether ASI is easy or not. It specifically calls those “hard issues” which it is going to sidestep.

So it would be difficult to make this a crux. The book implies that its recommendations don’t depend on one’s position on those “hard issues”.

If, instead, its recommendations were specific depending on the views on timelines and on the difficulty to achieve the ASI, that would be different.

As it is, the only specific dependency is the assumption that AI safety is very hard (in this sense, I did mention that I am speaking from the viewpoint of people who think it’s hard, but not very hard, the “10%-90% doom probability”).

I agree that humans are “disaster monkeys” poorly equipped to handle this, but I think they are also poorly equipped to handle the ban. They barely handle nuclear controls, and this one is way more difficult. I doubt the P(doom) conditional on the ban will be lower than P(doom) conditional on no ban. (Here I very much agree with Eliezer’s methodology of not talking of absolute values of P(doom), but of comparing conditional P(doom) values.)

[-]yams2mo2-2

For most, LLMs are the salient threat vector at this time, and the practical recommendations in the book are toward that. You did not say in your post ‘I believe that brain-in-a-box is the true concern; the book’s recommendations don’t work for this, because Chapter 13 is mostly about LLMs.’ That would be a different post (and redundant with a bunch of Steve Byrnes stuff).

Instead, you completely buried the lede and made a post inviting people to talk in circles with you unless they magically divined your true objection (which is only distantly related to the topic of the post). That does not look like a good faith attempt to get people on the same page.

[-]mishka2mo20

I am agnostic. I don’t think humans necessarily need to modify the GPT architecture, I think GPT-6 would be perfectly capable of doing that for them.

But I also think that those “brains-in-the-box” systems will use (open weights or closed weights) LLMs as important components. It’s a different ball game now, we are more than halfway through towards the “true self-improvement”, because one can incorporate open weight LLMs (until the system progresses so much they can be phased out).

The leading LLMs systems are starting to provide real assistance in AI research + even their open versions are pretty good at being the components of the next big thing. So yes, this is purely from LLM trends (the labs insiders are tweeting that chatbots are saturated and that their current focus is how much LLMs can assist in AI research; and plenty of people express openness to modifying the architecture at will). I don’t know if we are going to continue to call them LLMs, but it does not matter, there is no strict boundary between LLMs and what is being built next.

I don’t want to continue to elaborate the technical details (what I am saying above is reasonably common knowledge, but if I start giving further technical details, I might say something actually useful for acceleration efforts).

But yes, I am saying that one should expect the trends to be faster than what follows from previous LLMs trends, because of how the labs are using them more and more for AI research. METR doubling periods should start shrinking soon.

[-]StanislavKrym2mo10

would the AI safety research itself slow down orders of magnitude?

As far as I understood, the IABIED plan is to ensure that no one ever creates anything except for Verifiably Incapable Systems until AI alignment gets solved. But they didn't prevent mankind from uniting the AI companies into a megaproject, then confining AI research to said project and letting anyone send their takes on the project's forum and the public view anything approved by the forum's admins (e.g. capability evaluations, but not architecture discussions).

In addition, the public is allowed to create tiny models like the ones on which Agent-4 from the AI-2027 forecast did experiments to solve mechinterp. And to run verifiably incapable models, finetune them by approved^[1] finetuning data, steer them.

What I don't understand is why the underground lab wouldn't join the INTERNATIONAL megaproject. This behaviour would require them to be too reckless or omnicidal maniacs or to want to take over the world. And no, anti-woke stance isn't an explanation because China would also participate and the CCP isn't pro-woke.

Unfortunately, your second point still stands: before Yudkowsky-style AI research takeover the labs could actually counteract.

^{^}
Finetuning the models with anything unapproved (e.g. due to misaligning the models) should lead to the finetuner being invited to the project or prohibited to inform anyone else that the dataset is unapproved.

[-]mishka2mo00

What I don't understand is why the underground lab wouldn't join the INTERNATIONAL megaproject.

Because they don't want to be known. That's what the word "underground" means.

The enforcement regime of this kind is prone to abuses, so there will be a lot of distrust; also they might feel that everyone is too incapacitated, and while they would not normally have a chance against larger above-the-ground orgs, the new situation is different.

to want to take over the world

Yes, this would be their plan, to take over the world, or to pass the control to the ASI which they would presume to be friendly to them (and, if they have an altruistic mindset, to everyone else too, but even in this case, the problem is that their assumptions of friendliness might be mistaken).

LESSWRONG
LW

LESSWRONG
LW

19

Some of the ways the IABIED plan can backfire

19

19