Martian Interpretability Challenge: The Core Problems In Interpretability

fbarez

TLDR; Interpretability today often fails on four fronts: it’s not truly mechanistic (more correlation than causal explanation), not useful in real engineering/safety workflows, incomplete (narrow wins that don’t generalize), and doesn’t scale to frontier models. Martian’s $1M prize targets progress on those gaps—especially via strong benchmarks, generalization across models, and institution/policy-relevant tools—with a focus on code generation because code is testable, traceable, and a high-impact real-world setting for validating mechanistic explanations.

In Part 1, we discussed why we’re announcing a $1M prize for work in interpretability, focused on code generation. In this post, we’ll go into the four core problems we think are most important for the field, some approaches we’re excited about to tackle these problems, and why we think code generation is the right setting to tackle both.

The Biggest Problems In Interpretability

We’re optimists on interpretability (otherwise we wouldn’t be building an interp company!), but if you want to identify the biggest gaps in your field it’s important to take the skeptical lens.

A pessimist or skeptic could reasonably say interpretability today is:
‍

Not generalizable.
Much of what passes for interpretability is post-hoc pattern-matching rather than robust explanation. Work on saliency and feature visualization has repeatedly shown that attractive-looking explanations can be independent of the actual model parameters or fragile under input shifts ([1],[2]). In language models, purported “concept neurons” often dissolve under broader evaluation, revealing polysemantic, basis-dependent behavior rather than clean mechanisms ([3], [4]). Even more sophisticated causal-probe and intervention pipelines can be pushed off-manifold or give false positives, confirming stories that fit the data but not the computation ([5], [6], [7]). From this perspective, a lot of current MI work is still correlational: it tells just-so stories in a convenient basis, rather than isolating the actual algorithms the model runs. This matters because the advantage of explanation over prediction is generalization. You don't need Newton's laws to build a bridge, but you won't build combustion engines just by building more bridges. If your explanations don't generalize beyond the bridges you based them on, they're not much better than black-box methods. Most current MI findings don't verifiably extend beyond the narrow distributions they were tested on.
‍
Useless.
Despite years of effort, interpretability has yielded few tools that engineers actually rely on to build or secure real systems. Benchmarking work often finds that sophisticated representation-level methods underperform simple baselines on practical tasks like steering or detecting harmful intent ([9], [10]), and DeepMind’s recent negative result on using SAEs for safety-relevant classification concluded that SAEs underperform a linear probe and have been deprioritized as a consequence ([11]). More broadly, some critics ([12],[3]) argue that if our methods don’t materially improve control, robustness, or assurance in deployed systems, they are at best scientific curiosities and at worst a distraction from more direct levers on model behavior.
‍
Incomplete.
Even where interpretability has succeeded, the wins are narrow. The famous GPT-2 Indirect Object Identification circuit explains a carefully circumscribed template task, but not pronoun resolution in realistic text ([13]). Circuit analyses in larger models, such as Chinchilla’s multiple-choice capabilities, recover only partial stories that break under modest task variations ([14]). SAE-based work has uncovered many interesting features, but also strong evidence that these features are not canonical building blocks: different SAE configurations carve up representation space differently, miss many concepts, and sometimes learn “warped” directions that don’t match any clean human ontology ([11], [15], [16]). At a higher level, commentators argue that the field still lacks consensus on basic questions like what counts as a mechanism, a feature, or a satisfactory explanation ([17]). The result is a patchwork of partial maps, with no clear path to a complete theory of how modern models work.
‍
Doesn’t scale.
Many of the most impressive MI results look fundamentally unscalable. Detailed circuit analyses require enormous human effort even for tiny models and toy tasks, and attempts to push similar techniques to frontier models have produced fragmentary explanations at great cost ([14], [5]). Dictionary-learning approaches like SAEs scale only by throwing massive engineering and compute at the problem: OpenAI, Anthropic, and others report training SAEs with millions to tens of millions of features just to partially cover a single model’s activations ([18], [19], [20]). Even then, coverage is incomplete and the resulting feature sets are difficult to validate. The empirical frontier is racing ahead—with tool use, memory, agents—while MI remains focused on individual transformers and hand-analyzed neurons. Take the example of reasoning models. Most MI methods assume cognition happens in a single forward pass. Reasoning models break this assumption: the behavior emerges from an ensemble of samples, and the 'algorithm' is spread across the sampling procedure, not localized in activations. Fully understanding the transformer tells you what each sample does, but not how the system thinks. Tool use and other interactions with the environment are also messy and non-differentiable. These interactions often involve hard to reproduce environments like working on someone’s computer. The core worry is that interpretability won’t keep up with the size, complexity, and speed of modern AI systems.
‍

Each of these can also be framed as a question:

‍How do we ensure our methods are generalizable?
How do we ensure our methods are useful?
How do we ensure our methods are complete?
How do we ensure our methods scale?

These are, in short, what we think of as the most important problems. They’re hard but tractable. We’re working on them inside Martian, and we have work we’re excited to share in coming months. In the external work we support, grants and prizes will be awarded based on how much progress each paper could make on these problems.

‍

Approaches We’re Excited About

Below, we outline some of our thoughts on what fruitful approaches could look like. These are merely suggestions though – what we care about are the problems!

Strong Benchmarks for Interpretability

Scientific fields mature when they converge on good benchmarks. Right now, interpretability is missing that shared backbone ([23]). Many results are evaluated relative to other interpretability methods, on hand-picked examples, or via qualitative “this looks right to me” inspection ([1], [4]). That makes it hard to tell whether a new method is actually better, or just telling a more compelling story.

We would like to see benchmarks that measure interpretability methods against ground truth, practical impact, or both. For example, in settings where we know the underlying mechanism (synthetic tasks, hand-constructed models, or instrumented training setups), we can ask whether a method recovers the right causal structure, not just whether it produces plausible artifacts. In more realistic settings, we can ask whether using a method improves downstream outcomes: debugging success, safety performance, policy decisions, or human-model collaboration.

We’re especially wary of “grading on a curve,” where methods are only compared to other (mechanistic) interpretability baselines. If all of the baselines are weak, progress can look impressive while remaining practically irrelevant. Strong benchmarks should force head-to-head comparisons with simple, competitive baselines [9] (e.g. linear probes, prompting tricks, finetunes, reinforcement learning, or data/architecture changes) and should quantify where MI actually buys us leverage. Interpretability work is successful here only if it provides the tools that work the best, with no exceptions.

For this prize, we’re excited about work that:

Directly addresses “uselessness” (do these tools help anyone do anything?)
- Evaluates methods on their ability to improve practical tasks (e.g. finding backdoors, reducing harmful behavior, catching specification gaming, improving robustness, aiding audits). This is likely to come from a combination of two different kinds of tasks. The first involves studying model organisms (think “the model has secret knowledge X implanted, can you recover it”) [8] [25] The second involves real world tasks people are already tackling with blackbox methods, but where we can realistically see gains from whitebox ones.
- Builds careful distributional splits – across multiple domains, different programming languages, different human languages, or other axes along which mechanisms should be able to generalize in the real world
- Includes “regression” tests. Real world applications can’t afford to make changes which cause their product to fail where it was previously succeeding. If we don’t test that interpretable interventions preserve behaviors that they shouldn’t touch, then our methods won’t be used in production.
Moves field towards more generalizable, complete, and scalable approaches
- Designs benchmarks where there is a known or controllable ground-truth mechanism, and evaluates whether interpretability methods recover it.

‍Mathematical Foundations for Interpretability

A lot of the most productive work in interpretability has followed from relatively simple but powerful mathematical pictures. Causality gave us a language for interventions and counterfactuals [26], [27]. Entanglement and superposition gave us a way to think about entangled features and why neurons end up polysemantic [21], [28], [29], [30]. In each case, a clean mathematical abstraction unlocked many concrete experiments and methods.

We think the field is still severely underinvested in this kind of foundational work. We don’t have crisp answers to basic questions like: What does it mean, formally, for an explanation to be “correct” rather than correlational? What is the right unit of analysis in a large model (neurons, features, circuits, modules, something else)? When are two explanations “the same” at an appropriate level of abstraction? In practice, many current methods implicitly assume answers to these questions, but our tools and evaluation standards aren’t grounded in a shared theory.

There is also a real risk that we have fixated on the wrong mathematical toy problems. For example, much recent work is framed around superposition and feature disentanglement, but it is not clear that “undoing superposition” is the core obstacle to understanding modern models. Other candidate difficulties—such as the choice of basis, the geometry of representation manifolds, or the interaction between optimization and architecture—may matter more, and may require different tools.

Internally, we’re exploring several possible foundations: kernel methods and category-theoretic frameworks for what counts as a mechanism or explanation; differential-geometric views of representation manifolds and why local linear approximations break down; and random matrix and linear operator theories that might let us generalize mechanistic tools and automate them in high dimensions. These are examples, not prescriptions.

For the prize, we’re excited about any work that:

Proposes clear mathematical definitions of faithfulness, feature quality, or abstraction and connects them to concrete methods or benchmarks.
Identifies the “right” primitives (features, modules, causal variables, etc.) for understanding real models, and shows how to work with them.
Uses tools from other areas of math or ML (statistics, physics, information theory, dynamical systems, algebraic/topological tools, etc.) to explain why interpretability is hard—and how to make it easier.

If successful, this line of work should move us closer to methods that are genuinely mechanistic (not just correlational), that have a path to completeness (we know what’s missing), and that scale because they are grounded in the underlying structure of learning systems rather than ad hoc tricks.

‍

Generalization Across Models

Many of today’s most compelling interpretability results are one-offs: a single circuit in a single layer of a single model, or a feature dictionary trained on a particular checkpoint. That is scientifically interesting, but it’s not yet a general theory of how models work. We’d like to see interpretability methods, abstractions, and findings that survive changes in architecture, scale, training data, and domain.

Generalization can mean several things. It might mean that a discovered circuit or feature family recurs in larger or differently-trained models. It might mean that an interpretability method finds analogous structure in both vision and language models. It might mean that we can automatically locate “the same” mechanism in a new model, given a description or example from an old one. It might mean that we can learn and reuse higher-level abstractions rather than re-discovering everything from scratch.

We’re especially interested in work that moves beyond small, clean, toy tasks to more realistic settings without losing mechanistic clarity. That might involve new units of analysis (modules, subsystems, learned tools), new training schemes that encourage shared abstractions across models, or new methods for matching and comparing representations.

For this prize, promising directions include:

Empirical work showing that particular circuits, features, or abstractions recur across models, scales, or domains—and identifying what drives that recurrence.
Methods for automatically transferring interpretability results from one model to another (e.g. “find the IOI-like circuit in this larger model”).
Frameworks for defining when two mechanisms are “the same” up to abstraction, and algorithms for detecting such equivalences.
Studies that explicitly measure how well interpretability findings on toy models predict behavior in larger, more realistic models.
Develops tools that increasingly automate interpretability - minimizing reliance on human effort.

Generalization is where “mechanistic,” “complete,” and “scalable” meet: if we can discover explanations that transfer across models and settings, we’re starting to learn something about the underlying space of algorithms these systems converge to.

Interpretability as a Policy/Institution Lever

We don’t think the main impact of interpretability will be a handful of researchers staring at neuron visualizations. The bigger opportunity is institutional: giving regulators, boards, red teams, and operational safety teams tools that change what kinds of AI governance are feasible.

Interpretable tools could, for example, make it possible to build interpretable analog models that track a powerful system’s behavior on safety-relevant dimensions; to transfer safety properties between models via activation transfer and circuit-level alignment; or to design arms-control-style measures where parties can credibly demonstrate that models lack certain dangerous mechanisms or capabilities without revealing proprietary details. Interpretability could also expand the Pareto frontier between safety and innovation by making it cheaper to monitor and mitigate risky behaviors, rather than relying only on blunt constraints.

We take seriously the view (e.g. [24]) that many safety problems are fundamentally about institutions, incentives, and governance, not just algorithms. From that perspective, MI is useful to the extent that it enables new forms of oversight, auditing, liability assignment, and norm-setting: it lets institutions ask better questions of models and developers, and verify answers with more than “trust us.”

‍

For the prize, we’re interested in work that:

Shows how interpretability signals can support concrete policy tools: e.g. risk audits, model cards, licensing regimes, or capability evaluations.
Develops techniques for activation transfer, model-to-model supervision, or analog models that can be realistically used in deployment or evaluation pipelines.
Proposes protocols where interpretability plays a role analogous to verification in arms-control: limited, but enough to support cooperative equilibria.
Explores how interpretability outputs can be made legible to non-experts (regulators, executives, internal risk committees) without losing faithfulness.

This agenda squarely targets the “usefulness” question: if interpretability is going to matter in the real world, it needs to show up in the levers that institutions actually pull.

‍

Why Codegen

Code has several properties that make it unusually friendly to mechanistic work. It has formal semantics and an execution trace; we can run programs, inspect intermediate state, and often characterize “correctness” crisply. Many algorithmic tasks in code are well-understood mathematically. This makes it easier to pose precise questions like “what algorithm is the model implementing here?” and to test whether a proposed mechanism really explains behavior, not just correlates with it.

At the same time, modern code models are at the center of how LLMs are used in practice: as coding assistants, tool-using agents, autonomous refactoring systems, and more. They plan, call tools, read and write large codebases, and increasingly act as components in larger agent systems. That makes them a natural testbed for interpretability methods that aim to handle agentic behavior, tool use, and multi-step reasoning in realistic environments—not just isolated feed-forward tasks.

We’re also excited about a deeper connection: program synthesis and mechanistic interpretability are, in a sense, inverse problems. Program synthesis tries to go from desired behavior to code; mechanistic interpretability tries to go from code-like behavior to a description of the internal “program.” Insights in one direction may feed the other: understanding how models represent and manipulate programs may help us understand their own internal “programs,” and vice versa.

For this prize, we’re particularly interested in:

Mechanistic studies of code models on non-toy tasks: refactoring, multi-file edits, tool-augmented debugging, repository-level changes, etc.
Methods that use the structure of code (types, control flow, data flow, test suites, execution traces) as scaffolding for mechanistic explanations.
Work that links internal circuits or features to high-level software engineering concepts: APIs, invariants, design patterns, security properties, or bug classes.
Interpretability techniques that demonstrably improve code-generation systems in practice: making them safer, more robust, easier to supervise, or more predictable in deployment.

If interpretability can show clear, actionable wins in code generation—currently the largest industrial use case for LLMs—that will be a strong signal that the field is on a useful track, and a powerful engine for further investment and progress.

Thanks to Ana Marasović, Hadas Orgad, Jeff Phillips, John Hewitt, Leo Gao, Mor Geva, Neel Nanda, Stephen Casper, and Yonatan Belinkov for reviewing this RFP and providing feedback.