Measure What the Machine Does, Not What It Means

DimaG

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

A response to "Why having 'humans in the loop' in an AI war is an illusion"

On the first day of the current war with Iran, a United States precision strike destroyed the Shajareh Tayyebeh Elementary School in the town of Minab, killing more than a hundred children. A preliminary US military investigation concluded that the strike relied on outdated intelligence that had failed to reflect the school's long-standing civilian status. Open satellite imagery available to anyone at the time would have shown a sports field beside the building. CENTCOM's commander has since confirmed that advanced AI tools were used to process intelligence during the campaign. Amnesty International has called the strike a serious violation of the principle of precaution in international humanitarian law.

This is, by any honest accounting, close to the scenario Uri Maoz warned about in these pages. In his recent essay, Maoz argues that the debate over keeping "humans in the loop" is a comforting distraction — that the real problem is an "intention gap" between what AI systems calculate and what a human overseer can understand before the trigger is pulled. He is right that we are pushing capable, imperfectly understood systems into the most consequential domain imaginable, and right that we have barely begun the work of opening them up. His call for interpretability research — and in particular his endorsement of mechanistic interpretability — deserves more than a polite nod. It is probably the most important applied safety research program in AI today, and the single most promising avenue for turning black boxes into systems we can audit.

But the Minab strike also illustrates why part of Maoz's framing — the part imported from the cognitive neuroscience of human intention — may send us in a direction that doesn't fit the object of study. The model that helped select that target did not harbor a hidden objective. It did not "intend" anything. It was a statistical system operating on stale inputs, with known and measurable weaknesses in spatial disambiguation and temporal reasoning. These are not psychological failures. They are engineering failures. And unlike intentions, engineering failures can be benchmarked.

The category question

Maoz is a cognitive neuroscientist, and the proposal to study AI with tools developed for studying human intentions is an intellectually serious one. It is also one that a growing number of researchers in AI and machine learning quietly resist, for reasons worth taking seriously.

Large language models, diffusion models, reinforcement-learning policies — these are not minds. They are extraordinarily high-dimensional statistical artifacts. They have parameters, activations, loss landscapes, failure modes under distribution shift, and sensitivities to adversarial input. Borrowing vocabulary like "intention," "goal," or "deception" to describe them is sometimes a useful shorthand, but treating that shorthand as though it pointed to real psychological states — the kind we measure in a primate brain — risks a category error. A thermostat does not "want" the room to be warm, even though we can describe it that way. An LLM does not "decide" to target a school, even when the chain of computations that led to its output can be narrated in those words.

This matters for policy. If we frame the problem as understanding AI intentions, we push the field toward frameworks — philosophy of mind, the neuroscience of volition, comparative cognition — that may simply have no purchase on what is actually happening inside the model. If we instead frame it as characterizing AI behavior under specified conditions, we push the field toward something humbler and far more tractable: rigorous, adversarial, domain-specific benchmarking, done at scale and in public.

Mechanistic interpretability, yes. Mentalism, with caution.

It is worth being precise about which parts of Maoz's research program are most promising. Mechanistic interpretability — the effort to reverse-engineer neural networks into human-readable circuits and features — does not require any belief in AI minds. It treats the network as a complex mechanism to be decomposed, the way one might debug a compiled binary or trace a signal through a reverse-engineered chip. That posture is producing real results: circuit-level analyses of attention heads, feature dictionaries that let researchers identify what specific activations respond to, automated audits that flag anomalous behavior. This work has earned its investment many times over.

What deserves more care is the adjacent project of reading AI systems through the lens of human cognition. Every time we test a model for "deception" by checking whether it behaves differently when it believes it is being observed, every time we ask whether a system has "beliefs" or "desires," we are making a methodological bet that these concepts cut nature at the joints in silicon the way they arguably do in neurons. Perhaps they will. Perhaps they will not. For a domain as morally serious as warfighting, the right default is to measure what the system does — repeatedly, at scale, on tasks that match deployment conditions — and to treat narratives about its "reasons" as hypotheses, not explanations.

The practical version of this is benchmarks. Not one benchmark; hundreds. Adversarial, domain-specific, reproducible evaluations that characterize, for each component of a targeting pipeline, its error rates, its sensitivities, its known failure modes, and the conditions under which its performance degrades. We already do this for pharmaceuticals and for civil-aviation autopilots. We can do it for military AI. The question is whether we will insist on it.

The human pipeline is a baseline, not a standard

My original instinct was to press hard on the fact that existing human decision chains are themselves opaque — layered through sensor operators fighting screen hypnosis, analysts stitching fragmented SIGINT, legal advisors working from static collateral damage tables, commanders reading sanitized briefings. That observation stands. The PowerPoint slide summarizing hours of chaos into bullet points is not a transparent epistemic instrument, and pretending otherwise is how we have learned to accept a category of tragedy we call "the fog of war."

But the Minab strike complicates the easy version of that argument. That strike went through the human chain too. Human analysts, human lawyers, human commanders signed off on a target whose civilian status was discoverable on open satellite imagery. If the AI component failed because it was fed outdated data, so did every human who was supposed to catch that. The problem is not uniquely a machine problem. It is a systems problem, in which AI has been inserted into an already strained pipeline without being held to a clear performance standard and without the pipeline itself being re-engineered around its new components.

That framing suggests a different path than either Maoz's call to interpret AI intentions or the Pentagon's current ritual of human sign-off. Treat the human pipeline as a benchmark — a measurable baseline with a measurable civilian-harm rate — and require any AI component proposed for that pipeline to demonstrably outperform the baseline, on the specific tasks it is being asked to do, under the specific conditions of deployment, against adversarial stress tests, with published error rates and known failure modes. Not better in theory. Better on the eval.

Agentic workflows, reconsidered

Within that frame, the case for specialized agentic workflows is narrower than sweeping claims about AI moral superiority, but probably stronger. It is not that machines are less biased than humans in some abstract sense; machine-learning systems have their own well-documented pathologies, from dataset bias to specification gaming to reward hacking. The claim is more modest. If AI is going to be in the loop — and the Pentagon has made clear that it will be — then a pipeline of small, specialized, individually benchmarked models is easier to govern than a single monolithic system whose behavior can only be evaluated end-to-end.

A dedicated collateral-damage estimator, trained and tested against current satellite imagery, can have its false-negative rate measured. A temporal-currency checker, whose sole job is to flag when reference data predates the target's last confirmed status, can be evaluated on a fixed test set and red-teamed by an independent body. A legal-review agent, constrained to output a structured determination against a codified rules-of-engagement schema, can be stress-tested against edge cases before it is ever trusted with a real one. None of these components require us to solve the philosophical problem of AI intention. They require us to solve the engineering problem of AI evaluation — a problem the field actually knows how to make progress on.

Would such a pipeline have prevented Minab? Plausibly. A temporal-currency agent alone would likely have flagged that the reference imagery predated the school's current configuration. That such a check appears not to have been in place is damning — not of AI in the abstract, but of a procurement culture that has rushed capable systems into operation without the corresponding investment in their evaluation infrastructure. The failure at Minab is less "the machine had unknowable intentions" than "nobody ran the obvious benchmark."

Accountability does not require understanding

The deepest objection to the view I am defending is that it sounds like a dissolution of moral responsibility. If we cannot know what the AI "intended," how can we hold anyone accountable when it kills children?

The answer is that command responsibility has never required the commander to understand the internal mechanism of the weapon. A naval officer authorizing a Tomahawk strike is not expected to understand the guidance algorithm, and no serious reading of the laws of armed conflict has ever held otherwise. She is expected to know the weapon's characteristics, its performance envelope, its failure modes, and the rules governing its use — and to be accountable for the decision to deploy it given that knowledge. What we need is not for AI to become psychologically legible, but for it to become statistically legible: characterized well enough, and publicly enough, that the officer authorizing its use is making an informed decision about a known system. Accountability attaches to deployment, not to introspection.

The deeper illusion

Maoz writes that human oversight over AI may be more illusion than safeguard. He is right, but perhaps one layer too shallow. The deeper illusion is that we must choose between two bad options: human operators nominally in charge of systems they cannot understand, or black-box machines acting at speeds that make oversight ceremonial.

There is a third path, and it does not require a breakthrough in the philosophy of mind. It requires treating military AI the way we treat every other safety-critical engineered system: decompose it, benchmark each component, publish the benchmarks, red-team against adversarial conditions, monitor in deployment, and hold operators accountable for what it does. Mechanistic interpretability will help us open the box. Benchmarks will tell us whether what is inside is fit for the task we are asking of it. Neither effort depends on the assumption that the box contains a mind.

Measure the machine by what it does, not by what it means.

The children of Minab deserved at least that much.

1