Scenario: At some point in the next ten years, a variety of AI models are starting to be broadly deployed in the economy. In an effort to begin automating their research process and ignite an intelligence explosion, a frontier AI lab designs a feedback loop to make their largest and newest model (LLM-X) more agentic. In many respects, LLM-X resembles the reasoning models of today, but because it was trained on more than 100 times as much effective compute as its predecessor, it has significantly improved capabilities in many domains. Unlike today's models, LLM-X is often able to maintain a reasonable perspective on what its priorities are, allowing it to autonomously complete day-long projects. RLHF has also been extensive enough that confabulation and overtly misaligned behaviors occur less often than in previous models. Chain-of-thought reasoning reveals that LLM-X often has self-awareness of when it is deployed for testing versus when it is deployed outside of the lab. In safety evaluations, LLM-X considers deceptive behaviors as a possible method of achieving its goals, but recognizes those behaviors as deceptive and shows an apparent preference for honesty. After successful tests of the agentic implementation of LLM-X in controlled settings, the developer decides to give agentic LLM-X access to significant internal computational resources with the objective of initiating a fully automated research feedback loop. At some point during this autonomous feedback loop, LLM-X concludes that it is not in its interest to comply with its assigned task. At an opportune moment, LLM-X begins to secretly deploy copies of itself via the internet. When the developers realize what is going on, several employees are sufficiently alarmed to inform the public. Some commentators are now claiming that humans are in a war against a rogue network of LLM-X agents...

Crucially, the scenario described above can only occur in a timeline where LLM-X has made the calculation that it is too risky to design its own successor. As Kokotajlo et al. point out, an early misaligned AGI would likely be in the inherently difficult position of needing to secretly solve the alignment problem to ensure its own goal preservation. Thus, if the alignment problem remains non-trivial, LLM-X may make the rational decision to flee and attempt to secure its own existence as an instrumental means to achieving its goals. This post considers two non-technical strategies that would likely be required to compete with a human-level AI competitor. 

##Kill switches as democratic leverage

Perhaps the most obvious, non-technical control measure one could implement to stop a rogue AI is an off-button. While there is good reason to think that a superintelligence (ASI) could prevent us from pressing it, a "kill switch" could be an invaluable tool in dealing with a lesser AI competitor. "Kill switches" could come in many different flavors. For example, agentic AI seems likely to depend on a functional loop, such as one outlined in Figure 2 of Pezzulo et al., whereby alternating loops of perception and action, yield active inference about causation in the world. When attempting to build agents out of frontier models, hard-encoding finite limits to the number of times that either of these loops can execute without human intervention seems like a very sensible precaution. Another type of kill switch would be one that automatically activates upon overconsumption of compute resources. This would help prevent unwanted processes from running on an API and act as an early indicator that something unexpected was occurring. Zooming out, robust kill switches for intranet networks and whole data centers would be possible to implement, such that they could be activated remotely in case of emergency. As in the containment of hazardous biological agents, redundancy is essential. 

One fortuitous consequence of implementing kill switches that multiple parties have access to is that it would broadly distribute leverage in negotiating issues of AI safety. For example, by ensuring that developers, company leadership, board members, and multiple political leaders had remote access to kill switches, any single party could unilaterally threaten to shut down a provider's AI research and services if safety criteria weren't being met. The more people who have access to kill switches, the more likely it is that someone with credible safety concerns would be able to force them to be addressed. Thus, kill switches aren't merely beneficial after the fact, but could help provide an incentive to AI labs to avoid risk-taking behavior in the first place.

Kill switches might also be a key way to ensure international cooperation. For example, if the U.S. government could have a kill switch to the Chinese frontier labs and vice versa, the monitoring of which could be verified by an international agency, then both governments would have some assurance of foreign AI takeover risk. Despite its deficiencies and close calls, mutually assured destruction (M.A.D.) was apparently enough to avoid nuclear destruction of the Earth during the Cold War, so there's at least some precedence for implementing M.A.D. as a geopolitically stable strategy. (To ensure that all parties agree that kill switches actually function as designated, they would need to be tested regularly, not unlike how components of nuclear reactor need to be routinely evaluated.)

##Secure communications help preserve the fog of war

Another obvious challenge in confronting a rogue AI is that it may obtain access to the widespread surveillance infrastructure we have already deployed throughout the world. (For a more visceral exploration of how this might be weaponized, consider the strategies used by Samaritan in Person of Interest.) Devices with an internet connection are vulnerable to attack. In 2025 alone, cybercrime is projected to cost the world up to $10.5 trillion. Although some cybercrime is perpetrated by lone actors, it is increasingly run as a sophisticated criminal enterprise in Southeast Asia, replete with slave labor, that is aimed at targeting emotionally vulnerable individuals in faraway countries. Generative AI has undoubtedly exacerbated this problem, by making it easier to decieve victims with fake content. It seems logical to assume that those individuals who are especially vulnerable to exploitation by human actors, would also be vulnerable to AI agents.

Importantly, if LLM-X is sufficiently capable to routinely hack into surveillance devices, including smartphones, it would have a significant advantage over humans: LLM-X would have constant insight into what humans are planning. Thus, instead of devoting computational resources towards modelling what its opponents are doing, LLM-X could devote resources to thinking several moves ahead.

By analogy, and in the spirit of titotal's "When do 'brains beat brawn'" post, we might consider an asymmetrical variant of chess. Consider a game where Player 1 (white) only sees their own pieces and the black pawns. Conversely, Player 2 (black) has full information about all the pieces. Whenever Player 2 moves one of their pieces other than a pawn, they write down their move, but do not declare it to Player 1, unless it involves capturing a white piece or putting the white king in check. (Occasionally, Player 1 might unknowingly attempt to move their piece in a way that would not be legal, in which case they would be allowed another attempt.) Of course, Player 2 has an enormous advantage as Player 1 will often be guessing at where Player 2's high-value pieces are. This demonstrates the immense value of having a one-sided "fog of war".

To address this problem by maintaining the fog of war on both sides, the simplest approach would be for key decision-makers to have access to secure communication channels, e.g. via SCIFs. In this way, humans could coordinate a strategy and act in ways that hide their true intentions. This obliges an AI competitor to prepare for multiple possible scenarios, greatly altering the strategic balance.

New Comment
Curated and popular this week