The goal of the post is to highlight the necessity for mechanistic interpretability from an information security perspective:
In 1984, Ken Thompson authored a seminal paper on information security, Reflection on Trusting Trust, in which he popularized the idea of digital Trojan horses [1]. In his three-page article, Thompson demonstrated a technique for injecting an undetectable backdoor into a compiler, posing a threat to the integrity of the entire computer infrastructure. He summarized the moral lesson of his attack in the following words:
You can’t trust code that you did not totally create yourself
This attack can be broken down into three simple steps:
Considering that every component of your system was once compiled, and that the original compiler is long forgotten, the backdoor will be undetectable in any source code available to you. The backdoor will however, remain in the machine’s assembly instructions and anyone with the tools and skills of software reverse-engineering will be able to locate the backdoor and eliminate it from the system. (see notes section below)
Researchers in the field of artificial intelligence are often unable to agree on the following critical point: Deep Neural Networks execute code. Although neural networks can be described solely with architecture and floating point matrices, they could also be described algorithmically. Any digital software can be described in terms of bits or electrons moving, however these are only ways to implement some high-level functionality. This distinction was formulated beautifully by David Marr and Pogio as part of “the three levels of analysis” [2], and was demonstrated multiple times in the computer-science community: the turing completeness of the Conway’s game of life is an example of a simple system that can compute any algorithm, but could also be described solely in terms of simple rules for updating cells [3]. Similarly, a known example from the information-security domain, movfuscator is a tool that converts any software into repeated use of a single type of instruction: MOV [4]. Using the MOV instruction, the author managed to implement all the functionality needed for it to be a turing-complete system. It is obvious that the original algorithm has not changed, only its implementation. In neural networks, matrix multiplications and non-linearity operations are used to implement some functionality that may extend beyond the statistical properties of the data.
This paper suggests that we cannot trust code we didn’t create ourself. When we optimize neural-networks we face the same problem. The weights of the network implement some functionality which we have very limited control over. Since neural-networks are universal approximators we could modify the weights to implement any function we want, including a backdoor. In the context of neural networks, a backdoor could be the implementation of selective processing for specially crafted inputs. Or Zamir wrote a paper exactly on that: Planting Undetectable Backdoors in Machine Learning Models [5].
This suggests that as long as we are unable to fully reverse-engineer and steer the functionality of neural networks, an inherent risk will exist. Even if we manage to solve the alignment problem and make AI human-friendly, attackers that gains write-access to the weights will be able to implant their backdoors.
Notes
[1] Thompson, K., 1984. Reflections on trusting trust. Communications of the ACM, 27(8), pp.761-763.
[2] Marr, D. and Poggio, T., 1976. From understanding computation to understanding neural circuitry.
[3] Wolfram Stephen - Computing a theory of all knowledge. TED lecture
[4] Domas Christopher - The MoVfuscator. Derbycon 2015