Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a brief post arguing that, although "side-channels are inevitable" is pretty good common advice, actually, you can prevent attackers inside a computation from learning about what's outside.

We can prevent a task-specific AI from learning any particular facts about, say, human psychology, virology, or biochemistry—if:

  1. we are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model
  2. we use relatively elementary sandboxing (no clock access, no networking APIs, no randomness, none of these sources of nondeterminism, error-correcting RAM, and that’s about it)

I don't think either of these happens by default and if you are in an AGI lab I suggest you advocate for either (or both if you can, but one at a time is good too).

Regarding item 1, self-play in Go is an example par excellence and this may be one reason why people tend to have a strong intuition that arbitrarily strong AlphaZero fails to kill you. An example that trades off more potential risk for engineering applicability would be inputs from a Newtonian physics simulator with semi-rigid objects of randomly generated shapes.

I think it is good to be aware that

  • this is a real tradeoff; we don't need to just assume that superintelligent models will be able to deduce whatever physical facts they want through pure logical induction
  • taking advantage of the argument here would require being careful about data
    • datasets like CommonCrawl, MassiveText, and YouTube-8M are, uh, not careful


If an attacker on the outside of a computation is trying to manipulate events inside the computation, they get to use any and all physical mechanisms: they can open up the computer, stick some electrodes on the memory bus, and have at it.

If an attacker is inside a pure mathematical function and trying to manipulate events outside the computer, well, they can influence the output of the function (or even, potentially, whether the function is tractable to compute). If the output of the computation is propagated into human minds, this influence could have cascading effects that are hard to avoid or even notice.

If an attacker is trying to exfiltrate information from a computer, they get to use any and all physical mechanisms: they can run a very accurate clock, use a sensitive microphone, measure magnetic fields, etc. For better or worse, the nature of our physics is that everything that happens tends to radiate information in all directions and it's very difficult to fully neutralise; physical distance from attackers is the best defence, and even that is far from perfect (for example, timing information still propagates as far away as outputs do).

If an attacker is inside a pure mathematical function and trying to exfiltrate information from outside the computer, they do not get to use any physics. That same mathematical function is, presumably, also being computed in various places all over the multiverse. Only the explicitly given inputs—and what can be inferred from them on a Solomonoff-style prior—narrow it down. If a deterministic function is computed correctly, its result cannot depend further on the specific physical properties of the computation or facts about the environment in which it was running. All the attacker can learn is that, if they are in a simulation at all, the simulator has at least a certain amount of computational resources.[1]

  1. ^

    In fact, maybe they cannot be very sure of this either, since who knows what kind of fancy compiler-optimisation static-analysis techniques are out there in the universe that's simulating them. They could try doing some cryptographic operations that are supposed to be irreducibly expensive, but we tend to have a hard time proving those sorts of computational hardness assumptions. Or maybe the simulators can just fudge those parts of the computation. I'm not confident either way about how much the attacker can learn about the available computational resources.

New Comment
16 comments, sorted by Click to highlight new comments since:

Deterministic (but complicated) code can leak info about processor architecture. 

Firstly even simple integer overflow and floating point precision indicate being on a 32 or 64 bit based system.


Suppose you have a sum function. And this function sums a list by getting each of the 8 cores to add up 1/8th of the list, and then combining the results. The floating point roundoff errors reveal this internal structure. 

Add some other things like that, and it is now possible for your AI to guess your computer architecture. Is this enough to guess where a buffer overflow attack might work? Maybe. Maybe there aren't that many different processor architectures running code like it in the multiverse. 


I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).

If we were still using 10-digit decimal words like the original ENIAC and other early computers, I'd be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.

Sure, binary is fairly natural, but there are a lot of details of IEEE floats that aren't.


I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.

Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.

I think each little decision is throwing another few bits of info. A few bits for deciding how big the mantisa and exponent should be. A few bits for it being a 64 bit float. A few bits for subnormals. A few bits for inf and Nan.  A few bits for rounding errors. A bit for -0. And it all adds up. Not that we know how many bits the AI needs. If there is one standard computer architecture that all aliens use, then the AI can hack with very little info. If all alien computers have wildly different architectures, then floats carry a fair bit of info. 

He's saying that since floating point arithmetic isn't necessarily associative, you can tell something about how some abstract function like the sum of a list is actually implemented / computed; and that partial info points at some architectures more than others. 


Binary might be a attractor, but there's a lot of ways of implementing floating point in binary.

This implies that we could use relatively elementary sandboxing (no clock access, no networking APIs, no randomness, none of these sources of nondeterminism, and that’s about it) to prevent a task-specific AI from learning any particular facts

It's probably very hard to create such a sandbox though, your list is definitely not exhaustive. Modern CPUs leak information like a sieve. (The known ones are mostly patched of course but with this track record plenty more unknown vulnerabilities should exist.)

Maybe if you build the purest lambda calculus interpreter with absolutely no JIT and a deterministic memory allocator you could prove some security properties even when running on a buggy CPU? This seems like a bit of a stretch though. (And maybe while running it like this on a single thread you can prevent the computation from being able to measure time, any current practical AI needs massive parallelism to execute. With that probably all hopes of determinism and preventing timing information from leaking in go out the window.)


Yes, CPUs leak information: that is the output kind of side-channel, where an attacker can transfer information about the computation into the outside world. That is not the kind I am saying one can rule out with merely diligent pursuit of determinism.

Concurrency is a bigger concern. Concurrent algorithms can have deterministic dataflow, of course, but enforcing that naively does involve some performance penalty versus HOGWILD algorithms because, if executing a deterministic dataflow, some compute nodes will inevitably sit idle sometimes while waiting for others to catch up. However, it turns out that modern approaches to massive scale training, in order to obtain better opportunities to optimise communication complexity, forego nondeterminism anyway!

Yes, CPUs leak information: that is the output kind of side-channel, where an attacker can transfer information about the computation into the outside world. That is not the kind I am saying one can rule out with merely diligent pursuit of determinism.

I think you are misunderstanding this part, input side channels absolutely exist as well, Spectre for instance:

On most processors, the speculative execution resulting from a branch misprediction may leave observable side effects that may reveal private data to attackers.

Note that the attacker in this case is the computation that is being sandboxed.


I understand that Spectre-type vulnerabilities allow “sandboxed” computations, such as JS in web pages, to exfiltrate information from the environment in which they are embedded. However, this is, necessarily, done via access to nondeterministic APIs, such as, setTimeout(), SharedArrayBuffer, postMessage(), etc. If no nondeterministic primitives were provided to, let’s say, a WASM runtime with canonicalized NaNs and only Promise/Future-based (rather than shared-memory or message-passing) concurrency, I am confident that this direction of exfiltration would be robustly impossible.


The assumption here is that we can implement a system that deterministically computes mathematical functions with no side effects. We can get much closer for this than we can for leaking information outward, but we still fail to do this perfectly. There are real-world exploits that have broken this abstraction to cause undesired behaviour, and which could be used to gather information about some properties of the real world.

For example, cosmic rays cause bit errors and we can already write software that observes them with high probability of not crashing. We can harden our hardware such as by adding error correction, but this reduces the error rate without eliminating it. There are also CPU bugs, heat-related errors, and DRAM charge levels that have been exploited in real applications, and no doubt many other potential vectors that we haven't discovered yet.

It would certainly be extremely difficult to discern much about the real world purely via such channels, but it only takes one exploitable bug that opens a completely unintended inward channel for the game to be over.


I agree that, even for side-channels exposing external information to a mathematical attacker, we cannot get this absolutely perfect. Error-correction in microelectronics is an engineering problem and engineering is never absolutely fault-free.

However, per this recent US government study, RAM error rates in high-performance compute clusters range from 0.2 to 20 faults per billion device-hours. For comparison, training GPT-3 (175B parameters) from scratch takes roughly 1-3 million device-hours. An attacker inside a deep learning training run probably gets zero bits of information via the RAM-error channel.

But suppose they get a few bits. Those bits are about as random as they come. Nor is there anything clever to do from within an algorithm to amplify the extent to which cosmic rays reflect useful information about life on Earth.

I disbelieve that your claims about real-world exploits, if cashed out, would break the abstraction of deterministic execution such as is implemented in practice for blockchain smart contracts.

I do think it’s prudent to use strong hardware and software error-correction techniques in high-stakes situations, such as advanced AI, but mostly because errors are generally bad, for reliability and ability to reason about systems and their behaviours. The absolute worst would be if the sign bit got flipped somewhere in a mesa-optimiser’s utility function. So I’m not saying we can just completely neglect concerns about cosmic rays in an AI safety context. But I am prepared to bet the farm on task-specific AIs being completely unable to learn any virology via side-channels if the AI lab training it musters a decent effort to be careful about deterministic execution (which, I stress again, is not something I think happens by default—I hope this post has some causal influence towards making it more likely).

Yeah, it reduces the probability from massive problem to Pascal's mugging probabilities. (Or it reduces it below the noise floor.)

Arguably, they reduce the probability to literally 0, i.e it is impossible for AI to break out of the box.


It is worth noting that sandboxing with full determinism is unusual. By “elementary sandboxing” I do not mean to imply that it’s so easy as just applying your favourite off-the-shelf sandbox, like Docker or Xen or even an off-the-shelf WASM runtime. But full determinism is hardly unprecedented either. For example, WASM or EVM code that runs in blockchain smart contracts must be fully deterministic (for global-consensus reasons, completely unrelated to exfiltration), and it has been straightforward to modify a WASM runtime to meet this requirement.

Enforcing determinism for machine learning will require more effort because of the involvement of GPUs. One must ensure that code is executed deterministically on the GPU as well as on the CPU, and that GPU/CPU concurrency is appropriately synchronized (to enforce deterministic dataflow). But I claim this is still eminently doable, and with no noticeable performance penalty versus contemporary best practices for scalable ML, if an AI lab understood what determinism is and cared about it even a little bit.

This argument seems a bit circular, nondeterminism is indeed a necessary condition for exfiltrating outside information, so obviously if you prevent all nondeterminism you prevent exfiltration.

You are also completely right that removing access to obviously nondeterministic APIs would massively reduce the attack surface. (AFAIK most known CPU side-channel require timing information.)

But I am not confident that this kind of attack would be "robustly impossible". All you need is finding some kind of nondeterminism that can be used as a janky timer and suddenly all Spectre-class vulnerabilities are accessible again.

For instance I am pretty sure that rowhammer depends on the frequency of the writes. If you insert some instruction between the writes to RAM, you can suddenly measure the execution time of said instruction by looking at how many cycles it took to flip a bit with rowhammer. (I am not saying that this particular attack would work, I am just saying that I am not confident you couldn't construct something similar that would.)

I am confident that this direction of exfiltration would be robustly impossible.

If you have some deeper reason for believing this it would probably be worth its own post. I am not saying that its impossible to construct some clever sandbox environment that ensures determinism even on a buggy CPU with unknown classes of bugs, I am just saying that I don't know of existing solutions.

(Also in my opinion it would be much easier to just make a non-buggy CPU instead of trying to prove correctness of something executing on a buggy one. (Though proving your RAM correct seems quite hard, e.g. deriving the lack of rowhammer-like attacks from Maxwell's laws or something.))