This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
I am writing this short commentary because in machine learning interpretability research, when presenting results on small models using toy data, I have time and again encountered the pushback: yes, but where is the proof that this will apply to LLMs.
It reminds me of a story about Judah Folkman, the father of anti-angiogenesis therapy for cancer. Some 30 years ago, after he discovered the first two angiogenesis-inhibiting drugs that cured cancer in laboratory mice, a reporter asked him what this astonishing result meant for cancer research. Folkman answered: “Well, it says that if you are a mouse, and you have cancer, we can probably cure you.” It is a model scientific sentence: ambitious, sober, and exactly limited by the evidence.
Anti-angiogenesis is a therapy used today to treat millions of people with drugs selling for billions of dollars. And yet Folkman’s discovery would probably not be “Accepted to Neurips” (topic mismatch not withstanding), as the prevailing mindset in ML is that without proof that a drug can cure cancer in humans, the discovery would not be publishable. We hear Folkman’s modesty as a limitation.
Part of the reason for this attitude is that machine learning changes what engineering itself looks like. In most engineering disciplines, one designs the mechanism directly. In ML, one often designs a procedure instead: an architecture, a loss function, a dataset, an optimization process. The mechanism that finally matters is not fully specified in advance. It is discovered through training. In that sense, the engineer does not simply build a machine; the engineer builds the conditions under which a machine will be learned. That is an extraordinarily powerful recipe for progress, but it also creates a peculiar opacity. In earlier fields, engineering sometimes raced ahead, yet practitioners still had some intelligible picture of the basic artifact they were building. In ML, by contrast, we have systems with striking capabilities, immense practical impact, and growing social importance, while our scientific understanding of their internal workings remains partial and local. It can feel as though we have learned to go to the moon before understanding why a basic rocket flies.
The temptation to take this position is understandable since we are able to build really large systems without understanding them. But basic science has almost never begun with the biggest, most complicated object available. It begins where the object is simple enough to be tractable and rich enough to be meaningful. Mendel did not begin with the full bewildering complexity of heredity in the natural world. He began with peas. That was not because peas were the whole of biology. It was because peas made a deep regularity visible. Their traits could be isolated, counted, crossed, and tracked. The small system did not replace the larger world; it opened the first clear window into it.
Folkman’s mice played the same role. A mouse is not a human being, and many treatments that succeed in mice fail in people. That gap matters enormously. But it does not follow that mouse models are unimportant. Quite the opposite. They are indispensable because they make controlled intervention possible. They let us see causal structure before complexity overwhelms it. Science advances by finding systems that are small enough to study seriously and rich enough to teach us something real.
I believe machine learning needs to accept the same logic. Its peas and mice are toy problems, Boolean formulas, synthetic data, modular arithmetic, small transformers, and other stripped-down settings in which the target computation is known and the learned computation can sometimes be exhaustively traced. These are not embarrassments. They are instruments. They are the first proper laboratory of the field. If a small network is trained to compute a Boolean formula, we can ask whether it recovers the natural compositional structure of the task, whether it invents shortcuts, whether similar mechanisms emerge across different runs, and whether its internal organization matches the function it is meant to compute. In such settings, understanding is not a slogan. It can become an actual research program.
I quickly get an objection: small models are not large models. Some phenomena only appear at scale. Some mechanisms that matter in frontier systems may never show up in toy settings. That is certainly true. But it does not weaken the case for studying the small. It clarifies it. The point of working in the small is not to pretend that the small already is the large. The point is to learn how understanding works at all. Once one has genuine mechanistic understanding in tractable systems, one can ask the next question: which principles survive scale, which fail, and which reappear in disguised form inside larger models? That is how mature sciences grow. They do not begin with total complexity. They begin with a controllable case and then build outward.
This matters especially in ML because the systems are already in the world. We don't have the luxury of waiting for complete theory before deployment. The engineering will continue. But that only makes the scientific deficit more serious. A discipline that can produce powerful artifacts without understanding them has achieved something remarkable, but it has also put itself in a precarious position. What we cannot explain, we cannot reliably predict. What we cannot predict, we cannot confidently trust, align, or govern.
So perhaps Folkman’s restraint is the right model for our own claims. If you are a small neural network, and you are computing Boolean formulas, we can probably understand you. That may sound modest beside the scale of contemporary ML, but scientific progress often begins in exactly that register. Mendel had peas. Folkman had mice. We have small networks. A young science earns the right to speak about the large only after it has learned to say something precise about the small.
I might be drifting off topic a bit but there is an apt literary version of the same thought in Carson McCullers’s “A Tree. A Rock. A Cloud.” The story suggests that it is better to begin learning to love in the small -- to love a tree, to love a rock, to love a cloud -- before expecting to know how to love a person rightly. The advice sounds strange until one notices its method: begin where attention can be disciplined, begin where the object is simple enough to be fully seen. I believe Interpretability Research should do the same. That is not a retreat from ambition. It is how ambition becomes good science. It is how a field that can already go to the moon may finally learn why the rocket flies.
I am writing this short commentary because in machine learning interpretability research, when presenting results on small models using toy data, I have time and again encountered the pushback: yes, but where is the proof that this will apply to LLMs.
It reminds me of a story about Judah Folkman, the father of anti-angiogenesis therapy for cancer. Some 30 years ago, after he discovered the first two angiogenesis-inhibiting drugs that cured cancer in laboratory mice, a reporter asked him what this astonishing result meant for cancer research. Folkman answered: “Well, it says that if you are a mouse, and you have cancer, we can probably cure you.” It is a model scientific sentence: ambitious, sober, and exactly limited by the evidence.
Anti-angiogenesis is a therapy used today to treat millions of people with drugs selling for billions of dollars. And yet Folkman’s discovery would probably not be “Accepted to Neurips” (topic mismatch not withstanding), as the prevailing mindset in ML is that without proof that a drug can cure cancer in humans, the discovery would not be publishable. We hear Folkman’s modesty as a limitation.
Part of the reason for this attitude is that machine learning changes what engineering itself looks like. In most engineering disciplines, one designs the mechanism directly. In ML, one often designs a procedure instead: an architecture, a loss function, a dataset, an optimization process. The mechanism that finally matters is not fully specified in advance. It is discovered through training. In that sense, the engineer does not simply build a machine; the engineer builds the conditions under which a machine will be learned. That is an extraordinarily powerful recipe for progress, but it also creates a peculiar opacity. In earlier fields, engineering sometimes raced ahead, yet practitioners still had some intelligible picture of the basic artifact they were building. In ML, by contrast, we have systems with striking capabilities, immense practical impact, and growing social importance, while our scientific understanding of their internal workings remains partial and local. It can feel as though we have learned to go to the moon before understanding why a basic rocket flies.
The temptation to take this position is understandable since we are able to build really large systems without understanding them. But basic science has almost never begun with the biggest, most complicated object available. It begins where the object is simple enough to be tractable and rich enough to be meaningful. Mendel did not begin with the full bewildering complexity of heredity in the natural world. He began with peas. That was not because peas were the whole of biology. It was because peas made a deep regularity visible. Their traits could be isolated, counted, crossed, and tracked. The small system did not replace the larger world; it opened the first clear window into it.
Folkman’s mice played the same role. A mouse is not a human being, and many treatments that succeed in mice fail in people. That gap matters enormously. But it does not follow that mouse models are unimportant. Quite the opposite. They are indispensable because they make controlled intervention possible. They let us see causal structure before complexity overwhelms it. Science advances by finding systems that are small enough to study seriously and rich enough to teach us something real.
I believe machine learning needs to accept the same logic. Its peas and mice are toy problems, Boolean formulas, synthetic data, modular arithmetic, small transformers, and other stripped-down settings in which the target computation is known and the learned computation can sometimes be exhaustively traced. These are not embarrassments. They are instruments. They are the first proper laboratory of the field. If a small network is trained to compute a Boolean formula, we can ask whether it recovers the natural compositional structure of the task, whether it invents shortcuts, whether similar mechanisms emerge across different runs, and whether its internal organization matches the function it is meant to compute. In such settings, understanding is not a slogan. It can become an actual research program.
I quickly get an objection: small models are not large models. Some phenomena only appear at scale. Some mechanisms that matter in frontier systems may never show up in toy settings. That is certainly true. But it does not weaken the case for studying the small. It clarifies it. The point of working in the small is not to pretend that the small already is the large. The point is to learn how understanding works at all. Once one has genuine mechanistic understanding in tractable systems, one can ask the next question: which principles survive scale, which fail, and which reappear in disguised form inside larger models? That is how mature sciences grow. They do not begin with total complexity. They begin with a controllable case and then build outward.
This matters especially in ML because the systems are already in the world. We don't have the luxury of waiting for complete theory before deployment. The engineering will continue. But that only makes the scientific deficit more serious. A discipline that can produce powerful artifacts without understanding them has achieved something remarkable, but it has also put itself in a precarious position. What we cannot explain, we cannot reliably predict. What we cannot predict, we cannot confidently trust, align, or govern.
So perhaps Folkman’s restraint is the right model for our own claims. If you are a small neural network, and you are computing Boolean formulas, we can probably understand you. That may sound modest beside the scale of contemporary ML, but scientific progress often begins in exactly that register. Mendel had peas. Folkman had mice. We have small networks. A young science earns the right to speak about the large only after it has learned to say something precise about the small.
I might be drifting off topic a bit but there is an apt literary version of the same thought in Carson McCullers’s “A Tree. A Rock. A Cloud.” The story suggests that it is better to begin learning to love in the small -- to love a tree, to love a rock, to love a cloud -- before expecting to know how to love a person rightly. The advice sounds strange until one notices its method: begin where attention can be disciplined, begin where the object is simple enough to be fully seen. I believe Interpretability Research should do the same. That is not a retreat from ambition. It is how ambition becomes good science. It is how a field that can already go to the moon may finally learn why the rocket flies.