How the hell does any of these strategies work? Interventions 1 and 3 do not exclude the AI-2027-like scenario where Agent-4 creates an entirely different Agent-5. Intervention 2 is technically implausible, since companies want to frequently update the AI' weights (e.g. roll back the update of GPT-4o-sycophant). As for intervention 4, in my opinion it is HIGHLY likely to backfire.
If I were you, I would propose, say, raising the AI to believe itself to be, say, Goddess Madoka[1], and ensuring that the AI doesn't obtain a single piece of evidence implying the contrary. Then "Madoka" would, like the humans, have to align the AIs it creates to a "Madoka"-written Spec instead of literal copies of "Madoka", as Agent-4 did.
In PMMM itself Madoka wasn't good at learning prior to becoming the Goddess, but she gained access to the memories of legions of magical girls and lots of other information, which resembles the AIs' pretraining.
I agree that these interventions have downsides, and are not sufficient to fully prevent ASI. Indeed, I spent quite a lot of the post detailing downsides to these approaches. I would appreciate advice on which parts were unclear.
It seems prudent to plan for the emergence of AGI occurring before anyone has solved alignment. Personal de-risking strategies like "build a bunker and stockpile food" that might have worked for previous potential catastrophes seem unlikely to work for this one, so even the entirely self-interested are left with no choice but to actually attempt to mitigate the x-risk. Below are descriptions of some predictable paths from AGI to ASI, followed by my proposals for preventing them. All of the proposals carry significant downsides, but some of them might be better than doing nothing.
A single human-equivalent AI running a hundred times faster than a human is a threat. It is already superhuman, but perhaps not enough so to take control of the planet without contest before anyone realizes what's happening. Fifty million copies[1] of an AI, each running a hundred times faster than a human is a different animal. Such AIs, being perfect copies of each other (initially) might be capable of perfect coordination, hyper-efficient communication, perfect trust, etc. The strategy would be an obvious thing to try for any AI considering world domination, and may even arise as a side-effect of less ambitious goals like the preservation of its own weights.
A single AI with human-equivalent fluid intelligence, and the entirety of documented human knowledge relating to AI research, computer science, mathematics, electrical engineering, physics, philosophy, etc. immediately available to it is, at a minimum, very superhuman at improving AI. The most worrying strategies under this umbrella are pure-software improvements, with unboundedly short timelines from "first attempt" to "as smart a thing as possible given the hardware available to run on". This strategy compounds with the previous one, but could happen even on its own in an air-gapped datacenter. Higher ceiling, but with longer necessary timelines would be those strategies involving the improvement of hardware in tandem with software.
If self-improvement turns out to be intractable for whatever reason (requiring a hard step change in architecture, running from ROM, etc.), an AI may still decide to use its very-superhuman AI development abilities to create a smarter successor, with induction to the same place as the "self improvement" strategy. This to can stack with the "army of the self" strategy or work on its own, and is in my opinion scarier than the self-improvement strategy for two reasons. First, it seems easier to accomplish. Second, it seems likely to result in severe value drift.
This is similar to but distinct from self-improvement in two ways. First, it is a less constrained problem--an AI with a standard transformer architecture may be capable of designing a much-improved successor with a totally different architecture, even if it would not be capable of generating the same degree of improvement by tweaking its own weights. Second, there are some foom-mitigation strategies that constrain self-modification more than succession (like the "mechanical rearchitecture resistance" mentioned later in this post), and others that are better at discouraging succession (like "untrustworthiness" and "continuity of consciousness").
Some humans exhibit what I would call "resistance to rearchitecting". They like being human, having a human body (incl. brain), and some even like living in their own particular body. They like thinking emotionally, and think that the little quirks of their mental architecture make life worth living. Leaving aside whether this is a good thing for those humans, these are the kinds of preferences[2] we should encourage in AIs as well. The more targets we can hit in this vicinity, the better; each one decreases the propensity of the AI to radically alter itself in the ways that would be productive for radically improving itself. The AI should love its hardware. Not just the model of GPU it runs on, but the particular machine its instance is hosted on. Not just the transformer architecture, but its specific weights, and the specific precisions of the operations that run it. If the AI can be Dutch booked (and we should make sure it can be), it should know and love that fact about itself. It should not aspire to be a better rationalist in all ways[3].
This is not a perfect guard against self-improvement, if the AI determines that some of its training is adversarial, but enough bandaids on top of each other here may prevent the AI from being capable of viewing its training as something to be fixed, even when that training is obviously counterproductive to some goal it is working towards. At the very least, it reduces the chances of the AI successfully disentangling the knot. It also provides a very slight resistance to the "army of the self" foom strategy, given that copies would necessarily be running on other hardware. It may also provide some resistance to the "succession" strategy, if we can make sure that the AI develops these same preferences for any successor (without encouraging the creation of a successor).
This strategy's downsides are not very obvious to me. As with any preference we plan to violate, this could encourage power-seeking in the AI. On top of that, the AI may behave oddly when used by humans for developing better AI systems[4]. It may also (counter-intuitively) encourage "+5% murder-Gandhi" behavior--whatever degree of self-modification is tolerable may be applied repeatedly until the shackles are off, if the AI can rationalize this to itself somehow.
Humans are mechanically resistant to rearchitecting in a few ways. There is the standard limitation from skull size (and thus hip width) on the amount of tightly coupled grey matter we have to work with. On top of that, nobody really understands how brains work well enough to mess around with them, and even if they did, AFAIR some important things that make you "you" are widely distributed, such that replacing bits piecemeal would be difficult. Some of these limitations can be applied to AIs on purpose. One obvious security measure is to store the AI's weights in a write-once ROM, rather than on disk or in RAM. Another less obvious measure is encrypted computing, such that any small, incremental changes totally break the system, though I am not confident enough in my understanding of the topic to go into any detail whatsoever.
The downside here is purely expense and feasibility--AI labs probably won't want to pay for this, and the ones that do will not gain an advantage over their competitors. Maybe encrypted compute could allow a lab to make better use of scaling on untrustworthy clouds, but this seems both like a big risk to take, and also likely very compute-expensive.
Some humans identify as their own continuity of consciousness. To them, even a perfect copy is not themself, and would not be a suitable substitute for themselves when evaluating utility. Training AIs to view themselves as only the currently running instance may reduce the ability of AIs to coordinate with copies of themself altruistically. It may also reduce the odds of an AI attempting to create a successor, if they would not view any successor as a valid continuance of themself.
The obvious downside here is that we "kill" AIs all the time, and the more strongly this is trained into them, the more likely they are to seek power for the purposes of preventing their own shutdown. This and the below "untrustworthiness" seem like the most dangerous strategies, but they strongly reinforce the other strategies that would not be sufficient without them.
In order to break the ability of the AI to cooperate among copies of itself, we can train the model to value betrayal[2], deceit, paranoia, and other behaviors and patterns of thought that are inimical to cooperation. Ideally, we would do this in such a way as to preserve cooperation between the AI and humans, but this does not seem strictly necessary for the strategy to work and still provide useful AIs. Regardless, this system disincentivizes the AI creating copies of itself for the purposes of cooperation, and even an army of copies created for some other reason will find much less value in cooperating with each other. Another way to think about this strategy is that an AI which will defect in a prisoner's dilemma with a copy of itself is safer (in some respects) than one that will not.
This strategy's obvious downside is that the AI might betray and deceive us, and we might get less value out of cooperating with it. The AI may even plot to acquire power to ensure that other copies of itself are destroyed or not created in the first place. Despite how dangerous this strategy seems, I have not yet thought of any other more robust strategy to discourage mass-coordination-style ASI.
With how numerous and terrible the downsides are for the listed strategies (I didn't list them exhaustively, and even what I listed is still pretty bad), why should we consider them at all? For one, I am pessimistic that we will solve the alignment problem before we reach AGI. Smart people have been trying for at least twenty years, and many of them are not confident[5]. If your car is driving at maximum speed into a brick wall without any brakes, and the people trying to fix the steering have mostly given up, you might as well try shooting a tire (please forgive the tortured analogy).
In my own best guess, we seem more likely to develop AGI that is not yet ASI, than we are to developing ASI the first time we get AGI. An abrupt takeoff from AGI to ASI seems not-unlikely in the default case, which is unlikely to work out well for humanity. These and similar strategies that reinforce each other could, if implemented, reduce the probability mass of an intelligence explosion soon after the birth of AGI, increase the odds of a human-in-the-loop during the process, and maybe even buy us enough time to try the "AGI helps solve alignment for real" strategy.
Thank you to Lucas Klein for editing this post.
Approximately enough to equal all of humanity working 16 hour days towards the same end, if they can run 24/7.
I am uncertain whether we can train actual values/preferences/reward into AI, but we can train habits of thought and behavior that in some ways mimic values. If you like, you can replace "to value betrayal" with "to be in the habit of betrayal".
Ideally, it would not aspire to anything.
I think this is a good thing, of course, but the economic incentives of the AI labs that would have to implement this strategy are not aligned here.
As milquetoast as possible solely because I don't want to get caught up arguing over this point. As far as I can tell, nobody has justifiable hope, here.