FHE Can't Save Us: The Case Against Cryptographic AI Boxing

Bart Jaworski

tl;dr - Cryptographic containment of AI (using techniques like Fully Homomorphic Encryption to mathematically guarantee a model's behaviour before decrypting it) is theoretically elegant. But the computational overhead (1,000x to 1,000,000x relative to standard operations) makes it practically infeasible, likely for the foreseeable future. Even setting aside the cost, the economic incentives are against it and successful implementation would concentrate dangerous amounts of power. This post estimates the magnitude of that alignment tax and argues that, even accounting for hardware breakthroughs, cryptographic boxing is unlikely to arrive in time to matter.

Introduction

Most prior discussion of cryptographic AI containment has focused on the in-principle question: could it solve alignment?^[1] Christiano's 2010 proposal showed that FHE-based boxing could, in principle, allow us to decrypt an AI only once it has passed safety evaluations. Others have argued that AI boxing may become necessary if alignment research continues to lag behind capabilities, and that social engineering would likely be the primary failure mode in practice.^[2]

What has received less attention is the magnitude of the alignment tax - the concrete computational cost of cryptographic containment - and what that cost implies for feasibility. That gap is what this post attempts to fill.

My central claim:

The computational overhead inherent to the cryptographic techniques required to contain a powerful AI is so enormous as to make the implementation infeasible, even accounting for plausible breakthroughs in hardware.

I develop this argument in four steps:

I define cryptographic boxing clearly and show what it would look like in practice.
I estimate the alignment tax and argue it is prohibitively large.
I examine why the economic incentives of AI development work against containment even in more optimistic scenarios.
I consider what success would actually mean, and find that it introduces its own serious risks through centralisation of power. I also describe how Secure Multi-Party Computation could, in principle, address that centralisation problem, while making the underlying computational infeasibility worse.

1. What Cryptographic AI Boxing Actually Means

AI Boxing refers broadly to limiting the information channels available to an AI - restricting what it can communicate to humans, and under what conditions. For a sufficiently advanced system, including any that plausibly reaches or exceeds human-level intelligence, this must extend to all possible leakage channels, even side channels like electromagnetic emissions or thermal patterns.

The cryptographic version of this idea makes containment mathematically enforceable. The most natural approach would use Fully Homomorphic Encryption (FHE), which allows arbitrary computation on encrypted data. The idea is to encrypt the entire AI development pipeline - training, alignment, and evaluation - such that the model is only ever decrypted if it satisfies predefined safety criteria. If it fails, it is deleted and we receive only a confirmation of failure.

Basic FHE boxing pipeline:

A more ambitious variant, inspired by Christiano's original proposal, requires the AI to generate a verified aligned copy of itself before being deleted. This could in principle be combined with the GS AI agenda, where a formal safety specification and world model are verified before decryption is permitted. The full sequence looks like this:

In principle, this would get us close to getting a mathematical certainty on safely solving the alignment problem assuming that the evaluator is capable of assessing alignment correctly. But the obstacle here is not conceptual, it is computational.

2. The Alignment Tax Is Enormous

FHE is the cryptographic primitive that makes all of this possible, and it is extraordinarily expensive. It has not yet reached commercial viability^[3] for any application, let alone for something as computationally intensive as training a frontier AI model.

The computational overhead of FHE relative to plaintext operations ranges from 1,000x to 1,000,000x. To put this in a concrete perspective: multiplying two encrypted 32-bit integers using state-of-the-art FHE currently takes around 9 seconds, compared to nanoseconds for the unencrypted equivalent. The time complexity of FHE can be approximated as O(λ⁴ · 2^L), where λ is the security parameter and L is the circuit depth - in the AI training context, the number of sequential operations required.^[4]

The depth of modern transformer training is staggering. Expecting FHE to be applied to one of the most computationally intensive endeavours humanity has undertaken is, to say the least, unrealistic.

What About Hardware Acceleration?

The natural response is: can't we build ASICs or other specialised hardware to bring that overhead down?

Hardware acceleration helps when the bottleneck is processor inefficiency at the same workload. FHE's overhead is different in character: encryption inherently increases the amount of data and the number of operations required, typically by two to three orders of magnitude. This is a fundamental property of the scheme, not a property of the hardware executing it.

That said, specialised hardware could still meaningfully reduce the overhead. Under a highly optimistic scenario - the kind that requires multiple breakthrough developments in both algorithm design and manufacturing - FHE overhead might eventually reach the 10x–100x range. Even then, several problems remain.

First, the gap between FHE hardware and the hardware used to train frontier AI would still be wide. Training currently runs on TPUs and purpose-built GPU clusters optimised for dense matrix operations - hardware that will continue to improve over time. The efficiency gap between FHE and plaintext AI compute is likely to widen in absolute terms, not narrow.

Second, even with optimistic hardware timelines there won’t be enough time to use it considering current AI progress timelines. Getting specialised FHE chips to production scale would take years, during which AI capabilities would continue advancing.

In short: hardware acceleration might improve the situation at the margins, but there is no credible scenario in which it reduces a million-fold overhead to something practically manageable within any relevant near-term timeframe.

3. The Economic Incentives Work Against Containment

Even if the computational obstacle were resolved, there is a structural reason to doubt that cryptographic boxing would be adopted.

AI capability development is motivated primarily by the prospect of deployment. Containment that restricts deployment is in direct tension with the investment thesis behind building advanced AI in the first place. The labs most capable of implementing cryptographic boxing are also the ones with the strongest financial disincentives to do so.

More concretely: the cost of training state-of-the-art AI is measured in tens or hundreds of millions of dollars. The dollar-to-FLOP ratio is one of the most closely watched metrics in the field. An alignment tax of 10x the training cost (which would require heroic breakthroughs to achieve from today's baseline) would mean that a lab would need to allocate ten times its current compute budget just to cover the overhead of boxing. That is not a plausible ask for any organisation operating under real-world competitive pressures.

This is not an argument that safety doesn't matter to AI labs. It is an argument that the form of safety offered by cryptographic boxing imposes costs so large that no voluntary adoption is likely, and that mandating it would require a level of regulatory coordination that does not currently exist.

4. If It Worked, What Would Success Look Like?

Suppose we set aside the computational and economic obstacles entirely. What would a world that successfully implemented cryptographic AI boxing actually look like?

The answer is: a world with a significant and dangerous concentration of power.

Cryptographic containment, by design, constrains who can access or deploy the model. The more effective the containment, the more centralised the remaining access becomes. Whoever controls the decryption keys and the evaluation infrastructure controls whether and when a superintelligent system is released into the world. The risks associated with that concentration may be comparable to the risks of a misaligned AI - not because the people in control are necessarily bad actors, but because concentrating that much leverage in any single entity is itself a systemic risk. ^[5]

A Decentralised Alternative

Modern cryptography does offer tools that could in principle address this. Secure Multi-Party Computation (MPC) allows a computation to be performed across multiple parties such that no single party sees the full computation or its inputs. Threshold cryptography allows decryption only when a minimum number of key-holders m out of n agree to participate.

A decentralised boxing scheme might work roughly as follows:

FHE keys are distributed across n trusted parties (e.g., representatives of different nations or institutions).
The training, alignment, and evaluation codebase is encrypted, with all inputs split across parties.
MPC is used to execute the computation, with no single party observing the full process.
The result is decrypted conditionally on passing safety requirements, requiring agreement from at least m parties.

Two configurations are possible: one where the same group of people holds both keys and compute, and another where these roles are split across separate groups (the latter being harder to coordinate but more resistant to single-party control).

This approach would distribute power rather than concentrating it, and in theory could represent a genuine form of global cooperative governance of superintelligent AI. If AGI is coming regardless, having its development distributed across a broad coalition of humanity's representatives seems strictly preferable to having it controlled by a single lab or government.

The problem is that MPC compounds the computational overhead rather than alleviating it. The additional cost of MPC - depending on protocol, number of parties, and network bandwidth - is roughly 100x or more on top of the underlying FHE cost^[6]. An already infeasible scheme becomes significantly more infeasible.

Conclusion

The idea of mathematically guaranteeing that a superintelligent AI can only be decrypted after satisfying formal safety criteria is genuinely compelling. If the obstacles were solvable, it would represent one of the most robust safety mechanisms imaginable. The problem is that the primary obstacle - the computational cost of Fully Homomorphic Encryption - is enormous, structural, and unlikely to be overcome within any timeframe relevant to the current pace of AI development.

This post's argument is not that cryptographic boxing is theoretically flawed. It is that the alignment tax associated with it is so large as to make practical implementation effectively impossible under foreseeable conditions, even in generous scenarios where hardware breakthroughs occur. Add to this the economic disincentives of the labs most capable of building such systems, and the centralisation risks that would follow from success, and the picture is bleak.

There may be ways to achieve comparable safety guarantees through cryptographic means that do not require encrypting the entire training process. That seems like a promising direction for future work, and one where I would not be surprised to find approaches I have missed.

Appendix: Computational Complexity of FHE

The time complexity of FHE (including state-of-the-art bootstrapping for noise management) can be estimated as O(λ⁴ · 2^L), where:

λ is the security parameter (typically ~128 bits)
L is the circuit depth (the number of sequential operations in the computation)

For a full breakdown of this estimate, see this document. For further reading on MPC overhead, see Rosulek (2021) and Ishai et al. (2010).

^{^}
Famously argued by Youdkowsky by performing the box experiment. Also argued in this post.
^{^}
Argued generally in this post: How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)
^{^}
A great breakdown of the shortcomings of the applications of FHE can be found in this paper.
^{^}
This includes SOTA noise management (Bootstrapping). The full breakdown of my estimate of the computational complexity of FHE can be found in the table on page 3 of this document
^{^}
It is not impossible that some social schemes like mandatory government overwatch or other “anti-centralisation of access to ASI” practices could be implemented, however, the effective and practical nature of these is highly speculative.
^{^}
For garbled circuits, the complexity is typically 𝑂(∣𝐶∣⋅𝑛), while for secret sharing-based protocols, it can range from 𝑂(𝑛2 ⋅∣𝐶∣) to 𝑂(𝑛3⋅∣𝐶∣) depending on the security model. I recommend this paper, and this paper if you want to read more.