A new approach to interpretability: round-trip neural network compilation-decompilation

Emma Leonhart

After learning about hyperdimensional computing I ended up making a programming language that is quite different from programming languages I know of.

Sutra is a typed, GPU-native programming language I have been building. Its values are vectors and its programs compile to tensor-op graphs, the same kind of fused tensor computation a small neural network runs as. The paper is at arXiv:2605.20919 and the compiler is on GitHub.

This post is about one specific property of that setup, which I will call the round-trip, and a question I genuinely do not know the answer to: whether the property is a useful kind of interpretability, or whether it falls to the standard objection.

The idea behind it is that a neural network created by it can be trained and decompiled into a different symbolic program. Right now it operates based off of changing set parameters in constrained training but my vision is to train an AI model to decompile compatible neural networks more generally.

What the round trip is

The forward direction is just the compiler: a Sutra program compiles deterministically to a tensor-op graph. Because the graph is tensors, you can train it. The round-trip is the reverse direction. You take the trained parameters and write them back into Sutra source, and that source recompiles to a graph that reproduces the trained network's behavior to floating-point precision.

The symbolic source is therefore not a description sitting next to the network. It is a program that provably compiles to the exact computation the network performs.

This is demonstrated so far as a proof of concept: specific trained parameters writing back to source, not yet a general procedure across arbitrary program structures. I say more about the limits below, because they matter for how much weight the rest of this can carry.

Why I think the isomorphism matters

I want to be careful here because this is where I'm reasoning beyond what's demonstrated.

The standard objection to neuro-symbolic approaches on LessWrong is Wentworth's "Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc", the argument that labeling nodes in a symbolic system doesn't actually give you interpretability, because you can't verify the labels still mean what they say after training.

I think Sutra's claim is structurally different from labeling. The round-trip isn't about whether variable names are semantically accurate. It's about whether there's a verified behavioral isomorphism between the symbolic source and the compiled network, and that isomorphism is checkable without any reference to what the variable names mean. You verify it by checking that the compiled graph reproduces trained behavior to floating point precision. That's a mathematical property, not a semantic one.

This also speaks to a recurring question here, most directly Edy Nastase's thread asking why neuro-symbolic systems get so little attention in alignment. The strongest answer in that thread, from Tailcalled, is that no neurosymbolic architecture has demonstrated a meaningfully better safety property than deep learning, and Thane Ruthenis adds that part of why the research is missing is that the whole direction looks too intimidating to pursue. I am not claiming the round-trip clears that bar. I am trying to state one concrete, checkable property and put it in front of people who can tell me whether it is the kind of thing that would count, or whether it is another property that sounds useful and isn't.

What I think this enables, if the round-trip can be made to work reliably at scale: a symbolic articulation of what process a neural network is executing, not what its representations mean. That's different from interpretability in the Wentworth sense. It's closer to being able to formally reason about the computation.

I'm not claiming this solves alignment or that it's sufficient for safety. I'm claiming it's a different kind of property than what's usually discussed, and I'd like to understand whether people think it's a useful kind of property.

Where I actually am

The round-trip is demonstrated as a proof of concept for specific trained parameters writing back to source. I'm currently working on making the training-back-to-code path work more generally across more program structures. The formal verification work — using the symbolic-neural correspondence as a basis for verifying properties of the training process — is a direction I want to pursue but haven't started yet, partly because I'd want a collaborator with more FV background than I have.

The longer arc: once round-tripping works reliably, you have a corpus of (original source, trained source, compiled graph) triples. That's training data for a learned decompiler — a model that takes a trained tensor and produces Sutra source whose compiled graph matches it. At that point the loop closes in a way that I think has interesting properties for self-improvement with maintained legibility.

What I'm looking for

Primarily: people who think the isomorphism claim is wrong or uninteresting, and can tell me specifically why. Also anyone with formal verification background who finds the neural process verification angle interesting.

Github: https://github.com/EmmaLeonhart/Sutra

Arxiv: https://arxiv.org/abs/2605.20919