Reflections on Trusting Trust & AI
The goal of the post is to highlight the necessity for mechanistic interpretability from an information security perspective:
Background
In 1984, Ken Thompson authored a seminal paper on information security, Reflection on Trusting Trust, in which he popularized the idea of digital Trojan horses [1]. In his three-page article, Thompson demonstrated a technique for injecting an undetectable backdoor into a compiler, posing a threat to the integrity of the entire computer infrastructure. He summarized the moral lesson of his attack in the following words:
You can’t trust code that you did not totally create yourself
This attack can be broken down into three simple steps:
- It is possible to insert a backdoor
... (read 683 more words →)
Great stuff! Shamelessly sharing my work: this emergent behavior is also the reason for a known LLM failure mode in repeating the same token.
Interpreting the Repeated Token Phenomenon in Large Language Models
https://arxiv.org/abs/2503.08908