Considering the Relevance of Computational Uncertainty for AI Safety

Epistemic Status: Thoughts collected at the Sydney Mathematical Research Institute's Focus Period on the Mathematical Science of AI Safety, where I am a key participant. Not representative of anyone else's views.

So far Agent Foundations (AF) has made steady but perhaps slow progress which has not yielded currently practical AI safety ideas. However, engineering approaches (mechanistic interpretability, RLHF) have not proven robust, even for current models. One could imagine a world where frontier LLMs were not possible to jailbreak, and in that world it would be at least harder to argue that new alignment ideas are necessary. Therefore it is overdetermined that navigating the transition to artificial superintelligence (ASI) safely will require much deeper conceptual insight, which is has not so far been provided by simply witnessing ongoing technological progress. I believe that AF research remains essential, even in this era of stunning empirical success, and sympathize with the AF community zeitgeist that we just have not had long enough to think about the problem, and we will get there. At least, the first part, that we need longer to think - but we may not get there in time; so, a few words on time management are in order.

In particular, "solving decision theory" is not a sensible mainline goal. On the margin, general progress on decision theory improves our chances of navigating the transition to ASI successfully, by reducing confusion about agency. Working on decision theory also has multiple other benefits: it's inherently interesting, it may provide tools for human rationality and improve civilizational sanity, and it may lead to useful algorithms. These benefits provide a line of retreat for AF researchers (if it turns out ASI isn't coming anytime soon, we are just quirky decision theorists). But they may also serve to justify work that does not engage with AI safety.

While a complete theory of agency, sufficient to build ASI from rubber bands and popsickle sticks - which here means, in some dialect of Lisp and without GPUs - would be ideal, it does not seem feasible pre-(prosaic)takeoff. The stunning success of prosaic methods like transformers given vast amounts of data does not imply that a cleaner Lisp implementation of ASI is necessarily impossible. However, we have seen quite some Bayesian evidence, over the last couple of decades, that it is at least much harder to do it that way - specifically, to build it in the way the MIRI once intended, which I will call a recursively self-improving (RSI) seed. A prototypical RSI seed is written in something like a plaintext scripting language which works for well-understood reasons, and the main driver of its ascension is self-improvement, which is initially mediated by the natural ease of self-understanding (though the source code may rapidly move beyond human understanding, during the RSI process).

It's worth dwelling on the precise distinction between these paths. AIs absorbing a vast amount of data is not in itself surprising - that was always "the plan" for takeoff. But to absorb that data with such apparent sample inefficiency seems a bit of a surprise. The standard explanation is that gradient descent over neural networks is not a very good learning algorithm. But it seems to pick up steam over the course of training. Foundation models fine-tune rather sample efficiently, and even learn zero-shot in-context. At deployment, sample efficiency is roughly (and jaggedly) human-comparable as of late 2025. Contrast the hypothetical RSI seed, which would perhaps learn as quickly as a human (or much more quickly) by the time it got around to reading its first word of the internet, presumably at a rather late stage, since it ought to have been sensibly boxed for most of its early development. The difference between these stories is not only the pace but the fuel for takeoff. Data feeds and shapes inference in one slow-burning liftoff stage.

One would have hoped that AI would learn heuristics, rather than being heuristically learned. There is a sense in which we understand the principles behind deep learning very poorly (gradient descent on a non-convex objective function, overparameterized, somehow converges to a reasonably good solution that generalizes). On the other hand, perhaps even well-understand learning algorithms tend to rapidly become opaque as they absorb data. I suspect that an RSI seed takeoff could have been much safer, but mainly because it would have taken decades or centuries longer to get off the ground, and required a great deal of detailed understanding of intelligence to accumulate over that time (in other words, it would have been safer primarily because it would have been more difficult to achieve).

The situation we face is grim, but not strictly more grim than a (strained) counterfactual RSI seed takeoff during the same decade; I think some previously expected conceptual problems have become less blocking. But this is not a trivial or obvious statement (as sometimes contended), since many of the predicted problems have not gone away, but only become less visible, as the type of thinking that once made them clear is no longer load-bearing for performance. The typical AF challenge which appear to have been made irrelevant, may be (and usually is) still relevant, but in a less legible and therefore more lethal way, because the easy power of deep learning makes it optional to know what you're doing.

One example is that LLMs give the impression of understanding human values, which leads some to conclude that we are safe from paperclip maximizers "monomaniacally focused on one goal." Unfortunately, this is based on a misunderstanding of the core problem. ASI was always expected to understand human values insofar as convenient for pursuing its own values (for example, to improve its persuasive abilities). The problem is making an ASI that cares about our values - or at least, about obedience to / faithful (super)imitation of a user. Nick Bostrom's "Superintelligence" popularized the idea of an RL agent locked on to a specific narrow (but intended) goal which controls its reward mechanism, for instance paperclip production rates. This specific scenario does look a bit less likely, but probably never would have happened anyway.

A conceptual problem that does seem to have become less blocking is computational uncertainty - for instance Vingean uncertainty, and particularly accompanying puzzles like the Lobian obstacle, and perhaps also the decision-theoretic challenges that motivated FDT, UDT, etc. These are very interesting subjects. However, I am questioning where they are especially useful for AI safety, beyond their general importance to decision theory, and I am particularly tempted to reject a certain tiling argument that seems to have carried over from the plan to build ASI via RSI seed.

The central example of computational uncertainty is running simple but computationally demanding program. This case raises some problems for standard Bayesian reasoning because under the most direct translation of the problem, the results of the program have probability 1, which makes it confusing to deal with.

The standard argument goes that we need to understand computational uncertainty to build tiling agents which construct increasingly smarter successors while maintaining their values. This makes sense in the context of an RSI seed: in order to recursively self-improve, a seed AI would like to know that it can trust its successor to maximize the same values, despite the successor running computations that the seed explicitly programs but cannot predict.

This is not meaningfully the same source of difficulty about predicting the behavior of neural networks. As an analogy, consider a neuroscientist watching a real-time recording of your neural activations and trying to predict what you will say next. If the measurements were fine-grained enough, we could easily imagine that the neuroscientist is unable to run the best possible predictive model she has access to on the full sensor readings in real time. Instead perhaps she uses some high-level features (or statistics) of the data to run a less computationally expensive predictive model, which somewhat reduces the quality of her predictions. I claim that this situation is not confusingly epistemically different from the situation where she only has coarse-grained sensors in the first place. For all intents and purposes, she is simply uncertain about any data that she cannot afford to extract understanding from.

Figuring out how to build a neuroscientist who herself builds partial models is a hard and confusing problem, but the epistemic state of the neuroscientist in the narrow situation I described is not very confusing.

(In reality, I believe neuroscience is somewhat bottlenecked on high-quality sensing, but this is ultimately an analogy for mechanistic interpretability where sensing is no problem)

I think that our epistemic situation regarding artificial neural networks is in many respects the same. Technically, we are computationally uncertain about them, but for practical purposes this computational uncertainty acts much like ordinary scientific uncertainty. We can run tests to learn things about aspects of their behavior, much like psychologists or neuroscientists. There isn't even a specific computation that we are uncertain about; we can afford to run networks on any specific input, we just don't know which ones they will face. Indeed, another way to see this is that a human does not have time to inspect all of the training data, which introduces standard empirical uncertainty about the result of training. However, if a human were given time to inspect all of the training data, but were unable to reason through the exact result of gradient descent on that data, she would seem to be in essentially the same epistemic situation for all practice purposes.

It is true that we can build mathematical models of neural networks in order to understand their behavior better, but this only proves that we could benefit from thinking about artificial neural networks longer. It doesn't suggest that we are bottlenecked on philosophical confusion about computational uncertainty. For instance, studying the connection between stochastic gradient descent and stochastic gradient Langevin dynamics is a fairly ordinary problem in applied mathematics, and translating this to predictions about neural networks is ordinary scientific modeling.

Therefore, the computational uncertainty for tiling story no longer makes much sense for an AI software engineer building an aligned successor. Its source code is trillions of incomprehensible parameters, which is not easy to understand and improve in-place. An AI SWE would probably instead train another AI SWE as its successor.^[1] This challenge is precisely the alignment problem that we face; in other words, it is not bottlenecked on computational uncertainty any more than (one already believes) alignment is bottlenecked on computational uncertainty. If we can align the first AI SWE which exceeds us in intelligence, it should be able to align its successor by similar means.

I am not suggesting that we can simply outsource the alignment problem to AIs. The inductive argument described above explicitly requires a base case.

Rather, I am distinguishing the alignment problem that we face from the one expected for RSI seeds. While we are technically computationally uncertain about the behavior of large neural networks, in the sense that we cannot predict how they run on unseen inputs, and that an individual human can't inspect every activation, that usage of the term "computational uncertainty" is highly non-central; in fact, it may as well be standard empirical uncertainty.

In order to solve alignment with deep neural networks, we need to understand the training process well enough to use them as reliable components in a safe agent design. Currently, the engineering process is alchemy. I am calling for a chemistry of deep learning. A physics of deep learning is too much to ask for, and unfortunately deep learning is what we have to work with. If we insist on building agents, AF should tell us how to combine learned/learning components safely (this, we can hope, will be based on a mathematics/physics of agent structure).

There are a few other justifications for focusing on computational uncertainty for AI safety. One is that it defines normativity with respect to a bounded agent's preferences; a superimitator should act like the user if the user were much smarter (for instance, had longer to think). Another flavor of what I consider a similar story is that we might make progress on understanding what types of AI can be built safely by asking which computations we know how to delegate. I am much more sympathetic to this story, but I am not sure I agree it is the bottleneck to building safe ASI either. I think pointing at any specific goals is the main problem, not figuring out what those goals should be. As I understand it, @abramdemski might describe this as constructing the type of language community that enables trust, and possibly he has sophisticated reasons for believing that this requires an understanding of computational uncertainty. Personally, I see a closer connection to e.g. natural latents or condensation, but I find Abram's general line of argument plausible.

My other reason for hesitance about attacking computational uncertainty directly is that I believe it's totally intractable in generality, roughly as hard as resolving the major problems of computational complexity theory. Therefore it is really important that we carve off the parts of the problem that we need to solve. This is one reason I have argued elsewhere for focusing on (and distinguishing) other aspects of embedded agency, which are more closely connected to safety and indeed to embeddedness.

^{^}
At some stage, an AI SWE may become intelligent enough to create an RSI seed. If it is aligned, it would not rush to do so unsafely, when presumably the same alignment technology that worked once already could be improved and applied to an AI SWE successor. If it is not aligned, we would actually prefer that it does not know how to align such an RSI seed to itself. So this eventuality largely fails to motivate research on matters like the Lobian obstacle.

LESSWRONG
LW

LESSWRONG
LW

11

Considering the Relevance of Computational Uncertainty for AI Safety

11

11