If the trajectory of the deep learning paradigm continues, it seems plausible to me that in order for applications of low-level interpretability to AI not-kill-everyone-ism to be truly reliable, we will need a much better-developed and more general theoretical and mathematical framework for deep learning than currently exists. And this sort of work seems difficult. Doing mathematics carefully - in particular finding correct, rigorous statements and then finding correct proofs of those statements - is *slow. *So slow that the rate of change of cutting-edge engineering practices significantly worsens the difficulties involved in building theory at the right level of generality. And, in my opinion, much slower than the rate at which we can generate informal observations that might possibly be worthy of further mathematical investigation. Thus it can feel like the role that serious mathematics has to play in interpretability is primarily *reactive,* i.e. consists mostly of activities like 'adding' rigour after the fact or building narrow models to explain specific already-observed phenomena.

My impression however, is that the best applied mathematics doesn’t tend to work like this. My impression is that although the use of mathematics in a given field may initially be reactive and disunited, one of the most lauded aspects of mathematics is a certain inevitability with which our abstractions take on a life of their own and reward us later with insight, generalization, and the provision of predictions. Moreover - remarkably - often those abstractions are found in relatively mysterious, intuitive ways: i.e. not as the result of us just directly asking "W*hat kind of thing seems most useful for understanding this object and making predictions?"* but, at least in part, as a result of aesthetic judgement and a sense of mathematical taste. One consequence of this (which is a downside and also probably partly due to the inherent limitations of human mathematics) is that mathematics *does not* tend to act as an objective tool that you can bring to bear on whatever question it is that you want to think about. Instead, the very *practice* of doing mathematics seeks out the questions that mathematics is best placed to answer. It cannot be used to say something useful about just anything; rather it finds out what it is that it can say something about.

Even after taking into account these limitations and reservations, developing something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI' might still be a fruitful endeavour. In case it is not clear, this is roughly speaking, because a) Many people are putting a lot of hope and resources into low-level interpretability; b) Its biggest hurdles will be making it 'work' at large scale, on large models, quickly and reliably; and c) - the sentiment I opened this article with - doing this latter thing might well require much more sophisticated general theory.

In thinking about some of these themes, I started to mull over a couple of illustrative analogies or examples. The first - and more substantive example - is algebraic topology. This area of mathematics concerns itself with certain ways of assigning mathematical (specifically algebraic) information to shapes and spaces. Many of its foundational ideas have beautiful informal intuitions behind them, such as the notion that a shape my have enough space in it to contain a sphere, but not enough space to contain the ball that that sphere might have demarcated. Developing these informal notions into rigorous mathematics was a long and difficult process and learning this material - even now when it is presented in its best form - is laborious. The mathematical details themselves do not seem beautiful or geometric or intuitive; and it is a slow and alienating process. One has to begin by working with very low-level concrete details - such as how to define the boundary of a triangle in a way that respects the ordering of the vertices - details that can sometimes seem far removed from the type of higher-level concepts that one was originally trying to capture and say something about. But once one has done the hard work of carefully building up the rigorous theory and internalizing its details, the pay-off can be extreme. Your vista opens back up and you are rewarded with very flexible and powerful ways of thinking (in this case about potentially very complicated higher-dimensional shapes and spaces). Readers may recognize this general story as a case of Terry Tao's now very well-known "three stages" of mathematical education, just applied specifically to algebraic topology. I additionally want to point out that within pure mathematics, algebraic topology often has an applicable and computational flavour too, in the sense that there is something like a toolkit of methods from algebraic topology that one can bring to bear on previously unseen spaces and shapes in order to get information about them. So, to try to summarize and map this story onto the interpretability of deep learning-based AI, some of my impressions are that:

- We have begun to build rigorous theory, but it is a young area and the theory is
*far, far*from settled. In particular there may be lots of different suggestions or guesses as to what sorts of things are amenable to rigorous mathematical development and at what level of generality we should be working. Many of these suggestions or guesses will be incorrect, i.e. will lead to mathematically intractable questions or dead ends. - As the example of algebraic topology is supposed to show - it is reasonable that the early stages of rigorous development don't naively 'look like' the kinds of things we ultimately want to be talking about. This is very relevant to bear in mind when considering things like the mechanistic interpretability of toy models. And,
*Someone*has to actually do the work: It would be a failure of this community if we lament the lack of a better-developed mathematical theory that we might think is actually crucial and then fail to properly do anything about it. Importantly, it may be a gap that the AI not-kill-everyone-ism community cannot expect will be filled by industry or academia. The ideal version of what we are talking about has the grave safety concerns of powerful AI systems at its heart and this isn't a perspective that is necessarily shared by just anyone who is interested in the 'science of deep learning'.- The
*right people*have to do the work. It's plausible that the story with deep learning-based AI will be different, but my 'default' picture is that developing the sorts of things I have in mind is likely to require the right kinds of people making fairly definite bets up front and not looking back for a while. It is the kind of thing that is- Usually achieved by the efforts of many 'serious' and experienced researchers over many, many years;
- Not so wildly unique as a subject that we should heavily discount expertise in mainstream academic fields; and
- Unlikely to emerge serendipitously from the work of a larger number of inexperienced researchers on short-term grants.

The second illustrative example that I have in mind is mathematical physics. This isn't a subject that I know a lot about and so it's perfectly possible that I end up misrepresenting things here, but essentially it is the prototypical example of the kind of thing I am getting at. In *very* simplified terms, successes of mathematical physics might be said to follow a pattern in which informal and empirically-grounded thinking eventually results in the construction of sophisticated theoretical and mathematical frameworks, which in turn leads to the phase in which the cycle completes and the mathematics of those frameworks provide real-world insights and predictions. Moreover, this latter stage often doesn't look like stating and proving theorems, but rather 'playing around' with the appropriate mathematical objects at just the right level of rigour, often using them over and over again in computations (in the pure math sense of the word) pertaining to specific examples. One can imagine wishing that something like this might play out or 'the mathematics of interpretability'.

Perhaps the most celebrated set of examples of this kind of thing are from the representation theory of Lie groups. Again, I know little about the physics so will avoid going into detail but the relevant point here is that the true descriptive, explanatory and predictive relevance of something like the representation theory of Lie groups was not unlocked by physicists alone. The theory only became quite so highly-developed because a large community of 'pure' mathematicians pursuing all sorts of related questions to do with smooth manifolds, groups, representation theory in general etc. helped to mature the area.

One (perhaps relatively unimportant) difference between this story and the one we want to tell for AI not-kill-everyone-ism is that the typical mathematician studying, say, representation theory in this story might well have been doing so for mostly 'pure' mathematical reasons (and not because they thought their work might one day be part of a framework that predicts the behaviour of fundamental forces or something), whereas we are suggesting developing mathematical theory while remaining guided by the eventual application to AI. A more important difference - and a more genuine criticism of this analogy - is that mathematical physics is of course applied to the *real, natural* world. And perhaps there really is something about *nature* that makes it fundamentally amenable to mathematical description in a way that just won't apply to a large neural network trained by some sort of gradient descent? Indeed one does have the feeling that the endeavour we are focussing on would have to be underpinned by a hope that there is something sufficiently '*natural' *about deep learning systems that will ultimately make at least some higher-level aspects of them amenable to mathematical analysis. Right now I cannot say how big of a problem this is.

I will try to sum up:

- One might reasonably believe that in order to go from the ad hoc low-level interpretability of small models to reliable interpretability techniques for large and powerful models, we need a much more highly developed mathematical theory of interpretability.
- Developing such theory will probably need to be done fairly consciously, and initiated by people who are both likely to have some success and who are 'sold' on AI not-kill-everyone-ism. It will not be an individual 'project' but the time-consuming nurturing of a currently under-populated sub-field in order to engender a shift in thinking.
- On an object level, the work will not typically look like toy/smaller/theoretical versions of 'the hard part of the problem'. While it will be crucial to keep one eye firmly on the overall trajectory, I believe that developing useful mathematical theory
*must*also involve a significant amount of following one's nose*mathematically:*Asking the questions, building the abstractions, and pursuing the directions that feel most mathematically natural, without necessarily knowing where they will lead. - I have repeatedly used 'not-kill-everyone-ism' rather than 'safety' or 'alignment'. This was partly to try to emphasize that taking any of what I'm saying as a 'strategy' is a long timelines game. It might be viewed as playing to a certain 'out' (see here for the terminology; thanks to Rubi Hudson for introducing me to this idea) that only really has a chance of occurring when timelines are long.

I'm very interested in comments and thoughts.

There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural, don't you seriously believe there is something supernatural in their working? Hence, the criticism is invalid and the problem is non-existent.

See "AI as physical systems" for more on this. And in the same vein: "DNNs, GPUs, and their technoevolutionary lineages are agents".

I think that a lot of AI safety and AGI capability researchers are confused about this. They see information and computing as mathematical rather than physical. The physicalism of information and computation is a very important ontological commitment one has to make to deconfuse oneself about AI safety. If you wish to "take this pill", see Fields et al. (2022a), section 2.

I think the above confusion of the study of AI as

mathematics(rather thanphysicsandcognitive science-- natural sciences) leads you and some other people that newmathematicshas to be developed to understand AI (I may misinterpret you, but it definitely seems from the post that you think this is true, e. g. from your example of algebraic topology). Itmightbe that we will need new mathematics, but it's far from certain. As an example, take "A mathematical framework for transformer circuits": it doesn't develop new mathematics. It just uses existing mathematics: tensor algebra.I think the research agenda of "weaving together theories of cognition and cognitive development, ML, deep learning, and interpretability through the abstraction-grounding stack" could plausibly lead to the sufficiently robust and versatile understanding of DNNs that we want

^{[1]}, without the need to develop much or any new mathematics along the way.Here's what the "abstraction-grounding stack" looks like:

Many of the connections between theories of cognition and cognitive development at different levels of specificity are not established yet, and therefore present a lot of opportunities to verify the specific mechanistic interpretability theories:

their owntheory of DNNs and Active Inference. Apart from Active Inference, some other theories of intelligence and cognition (Boyd et al. 2022; Ma et al. 2022) are “ML-first” and thus cover both “general cognition” and “ML” levels of description at once.featureandcircuitdevelopment, evolution, and selection inside DNNs or, specifically, transformers.the connections between ML, fractional dynamics, and renormalisation group, identified by Niu et al. (2021), could help to better understand, verify, and contextualise some mechanistic interpretability theories as well.^{^}I'd restrain from saying "general theory" because general theories of DNNs

already exist, and in large numbers. I'm not sure these theories areinsufficient for our purposesandnewgeneral theories should be developed. However, what indeed is definitely lacking are the connections between the theories throughout the "abstraction-grounding stack". This is explained in more detail in the description of the agenda. See also the quote from here: "We (general intelligences) use science (or, generally speaking, construct any models of any phenomena) for the pragmatic purpose of being able to understand, predict and control it. Thus, none of the disciplinary perspectives on any phenomena should be regarded as the “primary” or “most correct” one for some metaphysical or ontological reasons. Also, this implies that if we can reach a sufficient level of understanding of some phenomenon (such as AGI) by opportunistically applying several existing theories then we don’t need to devise a new theory dedicated to this phenomenon specifically: we already solved our task without doing this step.">There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural

I mean "natural" as opposed to "man made". i.e. something like "occurs in nature without being built by something or someone else". So in that sense, DNNs are obviously not natural in the way that the laws of physics are.

I don't see information and computation as only mathematical; in fact in my analogies I write that the mathematical abstractions we build as being separate from the things that one wants to describe or make predictions about. And this applies to the computations in NNs too.

I don't want to study AI

asmathematics or believe that AIismathematics. I write that the practice of doing mathematics will only seek out the parts of the problem that are actually amenable to it; and my focus is on interpretability and not other places in AI that one might use mathematics (like, say, decision theory).You write "

As an example, take "A mathematical framework for transformer circuits": it doesn't develop new mathematics. It just uses existing mathematics: tensor algebra.:" I don't think we are using 'new mathematics' in the same way and I don't think the way you are using it commonplace. Yes I am discussing the prospect of developing new mathematics, but this doesn't only mean something like 'makingnew definitions' or 'coming up withnew objects that haven't been studied before'. If I write a proof of a theorem that "just" uses "existing" mathematical objects, say like...matrices, or finite sets, then that seems to have little bearing on how 'new' the mathematics is. It may well be a new proof, of a new theorem, containing new ideas etc. etc. And it may well need to have been developed carefully over a long period of time.I feel that you are redefining terms. Writing down mathematical equations (or defining other mathematical structures that are not equations, e.g., automata), describing natural phenomena, and proving some properties of these, i.e., deriving some mathematical conjectures/theorems, -- that's exactly what physicists do, and they call it "doing physics" or "doing science" rather than "doing mathematics".

I wonder how would you draw the boundary between "man-made" and "non-man-made", the boundary that would have a bearing on such a fundamental qualitative distinction of phenomena as the amenability to mathematical description.

According to Fields et al.'s theory of semantics and observation ("quantum theory […] is increasingly viewed as a theory of the process of observation itself"), which is also consistent with predictive processing and Seth's controlled hallucination theory which is a descendant of predictive processing, any observer's phenomenology is what makes mathematical sense

by construction. Also, here Wolfram calls approximately the same thing "coherence".Of course, there are infinite phenomena both in "nature" and "among man-made things" the mathematical description of which would not fit our brains yet, but this also means that we cannot spot these phenomena. We can extend the capacity of our brains (e.g., through cyborgisation, or mind upload), as well as equip ourselves with more powerful theories that allow us to compress reality more efficiently and thus spot patterns that were not spottable before, but this automatically means that these patterns become mathematically describable.

This, of course, implies that we

ought to make our minds stronger(through technical means or developing science) precisely to timely spot the phenomena that are about to "catch us". This is the central point of Deutsch's "The Beginning of Infinity".Anyway, there is no point in arguing this point fiercely because I'm kind of on "your side" here, arguing that your worry that developing theories of DL might be doomed is unjustified. I'd just call these theories

scientificrather than mathematical :)I'm skeptical, but I'd love to be convinced. I'm not sure that it's

necessaryto make interpretability scale, but it definitely strikes me as a potential trump card that would allow interpretability research to keep pace with capabilities research.Here are a couple relatively unsorted thoughts (Keep in mind that I'm not a mathematician!):

Thanks very much for the comments I think you've asked a bunch of very good questions. I'll try to give some thoughts:

I think I kind of share your general concern here and I’m uncertain about it. I kind of agree that it seems like people had been trying for a while to figure out the right way to think about deep learning mathematically and that for a while it seemed like there wasn’t much progress. But I mean it when I say these things can be slow. And I think that the situation is developing and has changed - perhaps significantly - in the last ~5 years or so, with things like the neural tangent kernel, the Principles of Deep Learning Theory results and increasingly high-quality work on toy models. (And even when work looks promising, it may still take a while longer for the cycle to complete and for us to get ‘real world’ results back out of these mathematical points of view, but I have more hope than I did a few years ago). My current opinion is that certain aspects of interpretability will be more amenable to mathematics than understanding DNN-based AI as a whole .

I think basically your worries are sound. If what one is doing is something like ‘technical work aimed at understanding how NNs work’ then I don’t see there as being much distinction between capabilities and alignment ; you are really generating insights that can be applied in many ways, some good some bad (and part of my point is you have to be allowed to follow your instincts as a scientist/mathematician in order to find the right questions). But I do think that given how slow and academic the kind of work I’m talking about is, it’s neglected by both a) short timelines-focussed alignment people and b) capabilities people.

Hmm, that's interesting. I'm not sure I can say how likely it is one would go in the correct direction. But in my experience the idea that 'possible future applications' is one of the motivations for mathematicians to do 'blue sky' research is basically not quite right. I think the key point is that the things mathematicians end up chasing for 'pure' math/aesthetic reasons seem to be oddly and remarkably relevant when we try to describe natural phenomena (iirc this is basically a key point in Wigner's famous 'Unreasonable Effectiveness' essay.) So I think my answer to your question is that this seems to be something that happens "in general" or at least does happen in various different places across science/applied math