This is a linkpost for

Reflections on Trusting Trust & AI

The goal of the post is to highlight the necessity for mechanistic interpretability from an information security perspective:


In 1984, Ken Thompson authored a seminal paper on information security, Reflection on Trusting Trust, in which he popularized the idea of digital Trojan horses [1]. In his three-page article, Thompson demonstrated a technique for injecting an undetectable backdoor into a compiler, posing a threat to the integrity of the entire computer infrastructure. He summarized the moral lesson of his attack in the following words:

You can’t trust code that you did not totally create yourself

This attack can be broken down into three simple steps:

  1. It is possible to insert a backdoor into software by modifying the source code. For example, one could modify the code responsible for the operating system login screen. Once the compiled OS is deployed, an attacker will be able to bypass the login screen and gain admin access.
  2. An attacker could modify a compiler so that it inspects the input source code. If the source code is part of the OS login, the compiler will inject the backdoor from step one. [As a result, the malicious code would not appear on the operating system, but anyone who has access to the compiler’s source will be able to detect the backdoor.]
  3. Moreover, an attacker could modify a compiler so that it inspects the input source code in a different manner. If the source code is of a compiler, the malicious compiler will inject the backdoor from step two into it. This adds a self-replication mechanism to the backdoor that will propagate itself to all compilers and software thereafter.

Considering that every component of your system was once compiled, and that the original compiler is long forgotten, the backdoor will be undetectable in any source code available to you. The backdoor will however, remain in the machine’s assembly instructions and anyone with the tools and skills of software reverse-engineering will be able to locate the backdoor and eliminate it from the system. (see notes section below)

AI risk

Researchers in the field of artificial intelligence are often unable to agree on the following critical point: Deep Neural Networks execute code. Although neural networks can be described solely with architecture and floating point matrices, they could also be described algorithmically. Any digital software can be described in terms of bits or electrons moving, however these are only ways to implement some high-level functionality. This distinction was formulated beautifully by David Marr and Pogio as part of “the three levels of analysis” [2], and was demonstrated multiple times in the computer-science community: the turing completeness of the Conway’s game of life is an example of a simple system that can compute any algorithm, but could also be described solely in terms of simple rules for updating cells [3]. Similarly, a known example from the information-security domain, movfuscator is a tool that converts any software into repeated use of a single type of instruction: MOV [4]. Using the MOV instruction, the author managed to implement all the functionality needed for it to be a turing-complete system. It is obvious that the original algorithm has not changed, only its implementation. In neural networks, matrix multiplications and non-linearity operations are used to implement some functionality that may extend beyond the statistical properties of the data.

This paper suggests that we cannot trust code we didn’t create ourself. When we optimize neural-networks we face the same problem. The weights of the network implement some functionality which we have very limited control over. Since neural-networks are universal approximators we could modify the weights to implement any function we want, including a backdoor. In the context of neural networks, a backdoor could be the implementation of selective processing for specially crafted inputs. Or Zamir wrote a paper exactly on that: Planting Undetectable Backdoors in Machine Learning Models [5].


This suggests that as long as we are unable to fully reverse-engineer and steer the functionality of neural networks, an inherent risk will exist. Even if we manage to solve the alignment problem and make AI human-friendly, attackers that gains write-access to the weights will be able to implant their backdoors.


  • The backdoored compiler could inject itself into a disassembler, or any of the tools that are used by the defender. For the foreseeable future, any advanced general-purpose AI system (AGI) will not be able to employ an analogous attack that hides its weights from a defender, nevertheless, it should be considered!


[1] Thompson, K., 1984. Reflections on trusting trust. Communications of the ACM, 27(8), pp.761-763.

[2] Marr, D. and Poggio, T., 1976. From understanding computation to understanding neural circuitry.

[3] Wolfram Stephen - Computing a theory of all knowledge. TED lecture

[4] Domas Christopher - The MoVfuscator. Derbycon 2015

[5] Goldwasser, S., Kim, M.P., Vaikuntanathan, V. and Zamir, O., 2022. Planting undetectable backdoors in machine learning models. arXiv preprint arXiv:2204.06974.



New Comment
1 comment, sorted by Click to highlight new comments since: Today at 8:26 AM

You can’t trust code that you did not totally create yourself.

With AGI on the table, at some point there is a question of veracity of your own memory, and of everyone else's memory.