I tried to model a best possible confinement strategy in Multilevel AI Boxing.
I wrote it a few years ago and most ideas will unlikely work for current situation with many instances of chats and open weight models.
However, the idea of landmines - secret stop words or puzzles which stop AI - may still hold. It is like jail breaking in reverse: unaligned AI finds some secret message which stops it. It could be realized on hardware level, or through anomalous tokens or "philosophical landmines'.
Very interesting paper. Thanks for sharing! I agree with several of the limitations suggested in the paper, such as the correlation between number of uses of the oracle AI and catastrophic risk, the analogy to AI to a nuclear power plant (obviously with the former having potentially much worse consequences), and the disincentives for corporations to cooperate with containment safety measures. However, one area I would like to question you on is the potential dangers of super intelligence. Its referred to throughout the paper, but never really explicitly ex...
When discussing AI safety, alignment—ensuring AI systems pursue human-approved goals—is often the primary focus. However, containment, which restricts AI’s ability to exert influence beyond controlled environments, is in my opinion a more intuitive and less complex approach. This post will outline common objections to AI containment, explain why they may be overstated, and invite counterarguments. You can tell me why I am wrong in the comments.
For an AI to pose a risk, it must influence the external world. Strict air-gapping and controlled local communication channels (ie, insuring the AI has no access to the internet) can essentially eliminate the risk of AI leaking out directly. However, one of the strongest objections to containment is that no system can permanently constrain a super intelligent AI, as it will eventually exploit loopholes, manipulate humans, or find an escape route.
Counterarguments:
AI manipulating humans into circumventing containment is a valid concern, but countermeasures can reduce the risk. For instance:
- Training personnel to recognize and resist AI persuasion tactics.
- Implementing shift-based oversight to prevent prolonged exposure to the AI.
- Restricting AI interactions to personnel without the capability to unbox it.
- Structuring AI outputs to be dry and unpersuasive.
- Limiting the AI from addressing certain sensitive topics.
- Screening out personnel with certain risk factors for manipulation.
Another objection is that containment limits AI’s potential. If AI is confined, it cannot fully participate in scientific advancements or real-world decision-making. This argument is less relevant from a safety perspective, but AI companies may be less incentivized to pursue control for this reason, and thus make this a worse strategy overall.
Counterarguments:
1. The AI companies need to be regulated anyways.
At some point, we will probably need to regulate AI safety if we take it seriously. Hoping that all these countless AI companies will benevolently adopt effective AI safety standards without government action seems to me to be naive and overly optimistic. We have regulations on nuclear powerplants despite the much smaller risks involved (one region being affected vs the entire human race), and there is just about as much of an incentive not to mess things up. While I understand the desire to not rely on governments, ultimately I feel corners will be cut no matter what approach is forwarded unless the government gets involved.
2. The Utility of a Contained Oracle AI
3. Strategic Considerations
Regulations can ensure AI is not applied in high-risk areas where misalignment could be dangerous.
Sub-Objection: If AI Is Heavily Contained, Why Build It?
Perfect alignment is an extremely high bar. Properly contained AI, even if only somewhat aligned, should provide revolutionary amounts of utility while minimizing risk.
Alignment is a long-term challenge requiring solutions to both outer alignment (defining human-compatible goals) and inner alignment (ensuring AI actually follows them). These problems could take decades to solve under current levels of investment.
Containment, by contrast, provides a clear, actionable goal: physically and informationally isolating AI systems to prevent unintended influence. Unlike alignment, which requires near-perfection for safety, containment is incremental—even partial containment buys time, whereas partial alignment could be catastrophic.
While alignment remains a crucial research goal, prioritizing containment as an immediate safety strategy—alongside alignment and interpretability research—offers a more pragmatic approach (in my opinion). Where am I going wrong?
Writing Assisted by ChatGPT.
Greenblatt, Ryan, et al. “AI Control: Improving Safety despite Intentional Subversion.” ArXiv.org, 2023, arxiv.org/abs/2312.06942. Accessed 5 Feb. 2025.