Idea: Safe Fallback Regulations for Widely Deployed AI Systems

Aaron_Scher

In brief

When told that misaligned artificial intelligence might destroy all of humanity, normal people sometimes react by asking “why can’t we just unplug the misaligned AI?” This intuitively appealing solution is unfortunately not available by default — the simple off switch does not exist. Additionally, society may depend on widely deployed AI systems, and shutting them down would cause large disruptions. In this blog post I discuss “safe fallback”, a framework wherein we shut down dangerous AIs and switch to safer, weaker, AI models temporarily. Ideally, regulators should require safe fallback systems for AI service providers so as to mitigate societal disruption while enabling pulling the plug on dangerous AIs in an emergency.

There is precedent for such regulations on critical infrastructure, e.g., the need for hospitals to have a backup power supply. Safe fallback is not hugely different from other fail-safe mechanisms we have for critical infrastructure. The two main differences seem to be: we may purposefully shut down our AI systems to trigger safe fallback, and it is not immediately obvious what should be fallen back to in the case of AI (a question I largely leave to future research, while suggesting weaker and trustworthy AI systems as an initial answer).

Full text:

Author’s note: I’ve only thought about this for ~15 hours; consider the ideas unconfident. The meta-status is roughly “it might be good if AI regulators did this,” but I have no particular plans to continue work on this, and I do not know how it stacks against other AI governance priorities. If it seems worth pursuing, let me know.

Context

The framework of safe fallback targets a future world which does not exist yet; let’s briefly discuss key aspects of that world.

AI systems are widely deployed, including carrying out high stakes tasks. For instance, they may be monitoring credit card fraud, reviewing security camera footage, controlling medication dosage for hospital patients, coordinating food supply chains, controlling nuclear power plants, driving autonomous vehicles, and much more.^[1] For the most part, these AIs are running on servers controlled and monitored by humans, e.g., via an API, rather than being autonomous AI agents on random servers.

Shutting down these systems would be very costly, potentially millions of dollars or human lives lost.^[2] However, in some cases, shutting these systems down reduces the risk of catastrophic harm. For instance, these AIs could be used to aid in AI self-exfiltration or with a cyberattack.

AI developers and the market may fail to properly price in the catastrophic failures that could come from not shutting down AI systems, e.g., because large scale catastrophe is externalized across society whereas profits from standard operation are relatively internalized to AI developers and direct customers. The same logic applies at the level of international cooperation.

Argument

Given that backdrop, the argument for having safe fallback requirements is:

We should make it easier for AI developers and governments to shut down AI systems when there is elevated risk of extreme harm.
We can do this by having AI developers pay the upfront cost of creating and testing safe fallback systems. This is similar to how other critical societal infrastructure has backup systems.
To ensure these requirements are effective without being overly burdensome, regulators and researchers should build a comprehensive subfield of socio-technical AI safety focused on studying fallback mechanisms. I discuss the types of questions this field needs to address below.

What sort of threats might trigger fallback

Misaligned AIs (running on controlled servers) coordinating to take over or self-exfiltrate
A jailbreak or backdoor attack that is broadly infectious and causes major failures
A non-malicious bug that significantly hurts performance of a particular model
AI systems being used by malicious actors to cause harm in a way that is difficult to localize and target (e.g., a human-orchestrated cyberattack utilizing AI systems; or weaker autonomous AIs using stronger, API-based, AIs to attempt an AI take-over — think an autonomous scaffolded llama-2 making calls to the GPT-4 API)

What safe fallback might look like

Safe fallback involves a particular AI system in a particular use case, and having a plan for how to quickly switch that AI system offline without significantly damaging the downstream task being accomplished. This likely requires switching to another, weaker but more reliable, AI system, or switching to non-AI systems. In some cases, this could be similar to a standard software reversion, undoing the most recent update. In other cases, this could require complex changes throughout a system.

Let’s take the following example case: GPT-5 is being used to monitor for cybersecurity threats to company X, via calls to the OpenAI API, OpenAI receives credible reports that instances of GPT-5 are colluding to launch a cybersecurity attack and likely cannot be trusted to assist with cyberdefense. OpenAI issues a fallback notice to customers that are using GPT-5 for cyberdefense, including company X, and immediately routes all of these GPT-5 calls to GPT-4, changing prompt formatting to fit the old model’s best-performance settings for this task. Both OpenAI and company X take additional precautions given the heightened risk, such as OpenAI lowering an internal threshold for jailbreak-attempt detection and company X having on-call human security experts provide additional monitoring on top of GPT-4. Depending on the threat, OpenAI might route other non-cybersecurity-related GPT-5 calls to GPT-4. When the threat is stopped, analyzed, and fixes are made, GPT-5 instances are deployed again. These response measures were determined and tested in advance.

Study of fallback mechanisms: outline of the research agenda

High level questions

How scalable is this approach? What does safe fallback look like if we are 5 years into potentially-misaligned AI systems, such that switching to definitely-not-misaligned AI systems would mean reverting many generations and massively hurting performance? Can we build better fallback options, like narrow AI systems or non-AI solutions?
Which types of AI systems are at risk? There will be many types of AI systems in the future, and they will vary in the risks they pose. For instance, airplane autopilot AIs may be very different from AIs managing a warehouse. Smart regulations should be targeted and specific to the AIs being used in particular cases.
- A very successful research field here could include specializing AIs to specific use cases in a way that makes them worse at causing catastrophic failures (e.g., generally intelligent AIs are not allowed in any critical infrastructure, only narrow systems which are provably-not-misaligned are).
Will this intervention mechanism be useful for reducing catastrophic misalignment risks?
- Will we have widespread AI deployment at the critical times for AI misalignment risk? Alternatively, we could see AI capabilities greatly outpace deployment such that widespread deployment only happens after misalignment risk has been mitigated. (related blog post forthcoming)
- How does Safe Fallback affect misaligned AIs running autonomously on non-controlled servers?
  - This proposal mainly affects AIs deployed and contained via human-controlled services. At some point, I expect we will be in a post-containment regime where there are powerful models running autonomously and not on explicitly controlled servers — this will happen either because these models self-exfiltrate, or because people set up AIs to run autonomously. Safe Fallback is mainly useful in the containment regime, e.g., to reduce self-exfiltration risk, but it could also be useful post-containment to limit malicious AIs’ access to more powerful models. How likely are we to be in the post-containment regime quickly and by default? E.g., if open source model capabilities continue to grow quickly, the entire containment paradigm could end up being approximately useless (absent strong success in compute governance).
  - Safe Fallback could make it easier to shut down entire data centers or take otherwise drastic actions to limit autonomous AI agents because critical processes can fallback without significant harm — limiting collateral damage.
  - Autonomous misaligned AIs could be propagating themselves through API-based AIs, e.g., via jailbreaks. How likely is this particular threat, as opposed to just misaligned autonomous AI agents running on uncontrolled servers and delegating subtasks to API-based models?
How costly will these measures be to implement? Does this result in significantly slowing down cautious and law abiding actors? What proportion of the relevant actors/models would be subject to these regulations?

Implementation details

How should fallback systems relate to normal deployment systems? Is it necessary for fallback systems to be air gapped from normal deployment systems to avoid certain threats? Can we build fallback AI models in secret, such that misaligned AIs are unlikely to understand how the fallback procedures and systems will work?
How can we design safe fallback mechanisms that are successful even if highly capable AI systems are attempting to not be turned off (e.g., as an instrumental goal)? Can we empirically evaluate the dangerous capability of shutdown resistance with regard to the safe fallback mechanisms we have?
How likely is it that weaker systems will be susceptible to particular problems affecting the strong systems? That is, if you fallback to prevent use of a jailbreak, will the weaker models still be susceptible?
Categorization of services and what shutdown/fallback looks like for them
- What sort of damages would arise from shutting down various services for how long? What is the performance hit for switching to weaker systems?
- To what extent is partial-shutdown possible? What threats will require what services to fallback?
- What is the maximum permissible downtime for different services? Downtime of seconds could be catastrophic in some situations (operating delicate machinery or self driving cars) and trivial in others. What fraction of services need to be immediately replaceable (and thus likely have idle GPUs storing fallback models), and in what order should services be switched to backup models?
- How will we know if a threat has been neutralized and the necessary fixes made, such that the most capable models can be deployed again?

Notes and other things

Improving society’s ability to slow or pause AI development is also good outside moments of acute risk. Broadly available safe fallback measures might enable coordination between AI labs. In particular, AI developers may want to possess remote shutdown ability for each other in order to disincentivize dangerous racing and allow a project with sufficient lead time to solve safety problems unencumbered. I expect that having safe fallback as a widely adopted standard would make developers more likely to use such tools, and this might improve the situation. I do not know if this would actually be desirable, or what the state of thinking about this is.

This blog post was partially motivated by the incident a few weeks ago where GPT-4 started producing nonsense, and it took 7 hours for OpenAI to fix the problem. That’s fine when you’re in start-up land, but it’s completely unacceptable if critical services are relying on your technology. This was a bit of a jolt in terms of “wow the AI developers totally won’t do the obviously good thing by default.”

I don’t know if this idea is worth somebody championing or who that person would be. If somebody with more policy expertise thinks it would be good for me to pursue this further, and/or is willing to hire me to do so, please let me know! By default I will not work more on this.

^{^}
There are many ways AI systems will be widely deployed but for which shutting them down temporarily does not risk major harm, such as students using ChatGPT as a tutor.
^{^}
The downstream effects could also be large due to the opportunity cost of delayed technological development.