Epistemic status: While I am specialized in this topic, my career incentivizes may bias me towards a positive assessment of AIXI theory. I am also discussing something that is still a bit speculative, since we do not yet have ASI. While basic knowledge of AIXI is the only strict prerequisite, I suggest reading cognitive tech from AIT before this post for context.
AIXI is often used as a negative example by agent foundations (AF) researchers who disagree with its conceptual framework, which many direct critiques listed and addressed here. An exception is Michael Cohen, who has spent most of his career working on safety in the AIXI setting.
But many top ML researchers seem to have a more positive view, e.g. Ilya Sutskever has advocated algorithmic information theory for understanding generalization, Shane Legg studied AIXI before cofounding DeepMind, and now the startup Q labs is explicitly motivated by Solomonoff induction. The negative examples are presumably less salient (e.g. disinterest, unawareness) which may explain why I can't come up with any high-quality critiques from ML. But I conjecture that AIXI is viewed somewhat favorably among ML-minded safety researchers (who are aware of it) or at least that the ML researchers who are enthusiastic about AIXI often turn out to be very successful.
It seems interesting that both AF and ML researchers care about AIXI!
I want to discuss some of the positive and negative features of the AIXI perspective for AI safety research and ultimately argue that AIXI occupies a sort of conceptual halfway point between MIRI-style thinking and ML-style thinking.
Note that I am talking about the AIXI perspective as a cluster of cognitive technology for thinking about AGI/ASI which includes Solomonoff induction, Levin search, and a family of universal agents. This field is lately called Universal Algorithmic Intelligence (UAI). There is often a lot of work to bridge the UAI inspired thinking to real, efficient, or even finite systems, but this does not inherently prevent it from being a (potentially) useful idealization or frame. In fact, essentially all theoretical methods make some idealized or simplifying assumptions. The question is whether (and when) the resulting formalism and surrounding perspective is useful for thinking about the problem.
X-Risk
Probably the most basic desideratum for a perspective on AI safety is that it can express the problem at all - that it can suggest the reasons for concern about misalignment and X-risk.
AIXI comes up frequently in discussions of AI X-risk, rather than (say) only at decision theory conferences. Because AIXI pursues an explicit reward signal and nothing else, it is very hard to deceive oneself that AIXI is friendly. Because the universal distribution is such a rich and powerful prior, it is possible to imagine how AIXI could succeed in defeating humanity. Indeed, it is possible to make standard arguments for AI X-risk a bit more formal in the AIXI framework.
UAI researchers generally take X-risk seriously. I am not sure whether studying UAI tends to make researchers more concerned about X-risk, or whether those concerned about X-risk have (only) been drawn to work on UAI. If I had to guess, the second explanation is more likely. Either way, it is definitely a (superficial) point in UAI's favor that it makes X-risk rather obvious.
Access Level
I think that UAI discusses agents "on the right level" for modern AI safety.
AI safety paradigms tend to carry an implicitly assumed type and level of access to agent internals which constrains the expressible safety interventions. For example (in order to illustrate the concept of access level, not as an exhaustive list):
Causal incentives research assumes that we can represent the situation faced by the agent in terms of a causal diagram. Then we frame alignment as a mechanism design problem. Strictly speaking, this is a view from outside of the agent which does not assume we can do surgery on its beliefs. However, I think this frame makes it very natural to assume that we have access to the agent's actual (causal) representation of the world, and to neglect the richness of learning along with inner alignment problems. I worry that this view can risk an illusion of control over an ASI's ontology and goals.
Assistance games like CIRL are similar. The AG perspective tends to treat principal and agent as ontologically basic. This is (in itself) kind of reasonable if the principal is a designated "terminal" or other built-in channel. However, the AG viewpoint tends to obfuscate its structural assumption, and therefore conceal its biggest weakness and open problem (what is the utility function of a terminal embedded in the world?) and causes much of the research on AG to miss the core challenge (the place where CIRL might be repairable).
Debate is another mechanism design framing. It asks to specify incentives which, if "fully optimized," provably allow a weaker agent to verify the truth from a debate between stronger agents. This is a clean and explicit assumption, so debate should be pretty safe to think about: It clearly does not focus on misgeneralization.
Singular learning theory research is far on the other end. It focuses on highly microscopic structure of the agent throughout learning, and attempts to control generalization (mostly) through data selection. Roughly speaking, this is the sort of perspective that makes inner misalignment salient (or that is adopted in order to prevent inner misalignment). My concern about the SLT picture is that the access level may be "too zoomed in" and we can't select the right generalizations because we don't know what they are, and it seems very hard to craft the right behavior in one shot even if we knew "how deep learning works." For example, the alignment problem is hard not only because learning "human values" is inherently challenging, but because agents aren't immediately corrigible and may prevent us from fixing our mistakes! I think that many of the core problems can only be targeted on the level of agent structure (which is a "lower" level of access, in the sense that it is more coarse-grained).
Natural latents researchers try to expose the ontology of models. That is, they assume a very low level of access by default, and try to attack the problem by increasing our level of access (so in a way, this is a uniquely "dynamic" perspective on access). I think this is a nice strategy that gets at some of the core problems, but it will be challenging to make progress.
UAI. Now I will try to draw out the UAI perspective on access and affordances for AI safety, starting from the formalism.
The ontology of AIXI (call it ) is made of Turing machines which generate a first-person interaction history. Thinking in terms of makes certain mistakes unnatural. It is clear that humans are not a privileged part of the environment (there are no other ontologically basic agents; how would you point at a human?). It is clear that, even given glass-box access to AIXI's beliefs, we would not be able to reliably read them (by Rice's theorem). In fact, the separation between distinct hypotheses in is not privileged, since different TMs can produce the same output distribution (but this is not computably checkable, so it's kind of invisible to UAI). The natural level of discussion is the probabilities produced by the universal distribution (equivalently, the predictions produced by Solomonoff induction) and the plans built on top of them in order to pursue cumulative discounted reward.
When UAI theorists thinks about making ASI safe, I claim that we bring the same sort of expectations about our affordances to the problem. At first brush, we want to think in terms of an ASI that plans to pursue certain goals based on a (continually) learned ~black-box predictive model.
This view has its detractors with some strong objections, particularly around embedded agency. But I think that those objections may not be as relevant to the type of slow(er) takeoff which we are experiencing, and the UAI picture has turned out to be pretty accurate! Pretrained neural networks really are pretty close to black-box predictive models; interpretability techniques of course exist, but tend to streetlamp and not work very well at capturing all (or even most) of what is going on inside of a model. Recursive self-improvement looks less like rewriting oneself and more like speeding up software engineering, and the Löbian obstacle is not relevant in the expected way. But probabilistic predictions are explicitly exposed to us, at least before RL postraining - which really is based on rewards!
Unfortunately, RL trains a behavioral policy which does not expose explicit planning. This is one classic objection to AIXI, often in favor of (the much less rigorously defined) shard theory. So the UAI may still overestimate the level of access we have to model internals.
However, I think that is actually centered around roughly the right level of access. For one thing, we really can train increasingly general (purely) predictive models such as foundation models, which are somewhat analogous to Solomonoff induction. UAI naturally asks what we can usefully do with such models. One option is to run expectimax tree search on top of the predictive model (as in AIXI), but UAI also includes direct policy search and lately a discussion of policy distillation that takes into account the important nonrealizability issues, then patches them with reflective oracles. Also, black-box access to a predictive model is an example of the central, not a minimal level of access that the UAI perspective suggests thinking about. Some UAI safety schemes don't make detailed use of predictions (golden handcuffs, forthcoming) or even forgo access to specific predictions and rely only on some provable high-level properties of the predictor (suicidal AIXI). To be clear, AF researchers can easily point out flaws and limitations of these schemes. But UAI safety research is making theoretical progress which suggests real implementations.
It's useful for safety researchers to have in mind the type of access and affordances that they would like for future ML techniques to expose, but which are at least plausibly achievable. That rules out looking into an alien ontology and reading out a clean object labeled "human values" and many less egregious or more subtle examples of the same mistake. But I don't think it rules out things like exposing a predictive model, approximate inner alignment to an explicit reward signal, or (more ambitiously) high-level architectural features like myopia (try training on only short-term feedback) or pessimism about ambiguity (try OOD detection, perhaps with a mixture of experts model). We should have a plan for designing safe agents, given such engineering/scientific breakthrough (or "miracles"). I think that UAI may also tell us how to get there, by understanding the generalization properties of deep learning. But wherever these breakthroughs come from, UAI can prepare us to take advantage of them.
Conclusion
UAI centers learning and search, which power modern ML. At the same time, the objects of UAI are powerful enough to talk about ASI. For example, the universal distribution is rich enough to express the possibility of very surprising generalization behavior (suggesting malign priors). One of the main (underappreciated) advantages of UAI as a framework for AI safety research is that it allows analysis of AF problems within a setting that closely resembles modern ML. I hope to get more traction from this correspondence soon by implementing practical UAI-inspired safety approaches.
Epistemic status: While I am specialized in this topic, my career incentivizes may bias me towards a positive assessment of AIXI theory. I am also discussing something that is still a bit speculative, since we do not yet have ASI. While basic knowledge of AIXI is the only strict prerequisite, I suggest reading cognitive tech from AIT before this post for context.
AIXI is often used as a negative example by agent foundations (AF) researchers who disagree with its conceptual framework, which many direct critiques listed and addressed here. An exception is Michael Cohen, who has spent most of his career working on safety in the AIXI setting.
But many top ML researchers seem to have a more positive view, e.g. Ilya Sutskever has advocated algorithmic information theory for understanding generalization, Shane Legg studied AIXI before cofounding DeepMind, and now the startup Q labs is explicitly motivated by Solomonoff induction. The negative examples are presumably less salient (e.g. disinterest, unawareness) which may explain why I can't come up with any high-quality critiques from ML. But I conjecture that AIXI is viewed somewhat favorably among ML-minded safety researchers (who are aware of it) or at least that the ML researchers who are enthusiastic about AIXI often turn out to be very successful.
It seems interesting that both AF and ML researchers care about AIXI!
I want to discuss some of the positive and negative features of the AIXI perspective for AI safety research and ultimately argue that AIXI occupies a sort of conceptual halfway point between MIRI-style thinking and ML-style thinking.
Note that I am talking about the AIXI perspective as a cluster of cognitive technology for thinking about AGI/ASI which includes Solomonoff induction, Levin search, and a family of universal agents. This field is lately called Universal Algorithmic Intelligence (UAI). There is often a lot of work to bridge the UAI inspired thinking to real, efficient, or even finite systems, but this does not inherently prevent it from being a (potentially) useful idealization or frame. In fact, essentially all theoretical methods make some idealized or simplifying assumptions. The question is whether (and when) the resulting formalism and surrounding perspective is useful for thinking about the problem.
X-Risk
Probably the most basic desideratum for a perspective on AI safety is that it can express the problem at all - that it can suggest the reasons for concern about misalignment and X-risk.
AIXI comes up frequently in discussions of AI X-risk, rather than (say) only at decision theory conferences. Because AIXI pursues an explicit reward signal and nothing else, it is very hard to deceive oneself that AIXI is friendly. Because the universal distribution is such a rich and powerful prior, it is possible to imagine how AIXI could succeed in defeating humanity. Indeed, it is possible to make standard arguments for AI X-risk a bit more formal in the AIXI framework.
UAI researchers generally take X-risk seriously. I am not sure whether studying UAI tends to make researchers more concerned about X-risk, or whether those concerned about X-risk have (only) been drawn to work on UAI. If I had to guess, the second explanation is more likely. Either way, it is definitely a (superficial) point in UAI's favor that it makes X-risk rather obvious.
Access Level
I think that UAI discusses agents "on the right level" for modern AI safety.
AI safety paradigms tend to carry an implicitly assumed type and level of access to agent internals which constrains the expressible safety interventions. For example (in order to illustrate the concept of access level, not as an exhaustive list):
Causal incentives research assumes that we can represent the situation faced by the agent in terms of a causal diagram. Then we frame alignment as a mechanism design problem. Strictly speaking, this is a view from outside of the agent which does not assume we can do surgery on its beliefs. However, I think this frame makes it very natural to assume that we have access to the agent's actual (causal) representation of the world, and to neglect the richness of learning along with inner alignment problems. I worry that this view can risk an illusion of control over an ASI's ontology and goals.
Assistance games like CIRL are similar. The AG perspective tends to treat principal and agent as ontologically basic. This is (in itself) kind of reasonable if the principal is a designated "terminal" or other built-in channel. However, the AG viewpoint tends to obfuscate its structural assumption, and therefore conceal its biggest weakness and open problem (what is the utility function of a terminal embedded in the world?) and causes much of the research on AG to miss the core challenge (the place where CIRL might be repairable).
Debate is another mechanism design framing. It asks to specify incentives which, if "fully optimized," provably allow a weaker agent to verify the truth from a debate between stronger agents. This is a clean and explicit assumption, so debate should be pretty safe to think about: It clearly does not focus on misgeneralization.
Singular learning theory research is far on the other end. It focuses on highly microscopic structure of the agent throughout learning, and attempts to control generalization (mostly) through data selection. Roughly speaking, this is the sort of perspective that makes inner misalignment salient (or that is adopted in order to prevent inner misalignment). My concern about the SLT picture is that the access level may be "too zoomed in" and we can't select the right generalizations because we don't know what they are, and it seems very hard to craft the right behavior in one shot even if we knew "how deep learning works." For example, the alignment problem is hard not only because learning "human values" is inherently challenging, but because agents aren't immediately corrigible and may prevent us from fixing our mistakes! I think that many of the core problems can only be targeted on the level of agent structure (which is a "lower" level of access, in the sense that it is more coarse-grained).
Natural latents researchers try to expose the ontology of models. That is, they assume a very low level of access by default, and try to attack the problem by increasing our level of access (so in a way, this is a uniquely "dynamic" perspective on access). I think this is a nice strategy that gets at some of the core problems, but it will be challenging to make progress.
UAI. Now I will try to draw out the UAI perspective on access and affordances for AI safety, starting from the formalism.
The ontology of AIXI (call it ) is made of Turing machines which generate a first-person interaction history. Thinking in terms of makes certain mistakes unnatural. It is clear that humans are not a privileged part of the environment (there are no other ontologically basic agents; how would you point at a human?). It is clear that, even given glass-box access to AIXI's beliefs, we would not be able to reliably read them (by Rice's theorem). In fact, the separation between distinct hypotheses in is not privileged, since different TMs can produce the same output distribution (but this is not computably checkable, so it's kind of invisible to UAI). The natural level of discussion is the probabilities produced by the universal distribution (equivalently, the predictions produced by Solomonoff induction) and the plans built on top of them in order to pursue cumulative discounted reward.
When UAI theorists thinks about making ASI safe, I claim that we bring the same sort of expectations about our affordances to the problem. At first brush, we want to think in terms of an ASI that plans to pursue certain goals based on a (continually) learned ~black-box predictive model.
This view has its detractors with some strong objections, particularly around embedded agency. But I think that those objections may not be as relevant to the type of slow(er) takeoff which we are experiencing, and the UAI picture has turned out to be pretty accurate! Pretrained neural networks really are pretty close to black-box predictive models; interpretability techniques of course exist, but tend to streetlamp and not work very well at capturing all (or even most) of what is going on inside of a model. Recursive self-improvement looks less like rewriting oneself and more like speeding up software engineering, and the Löbian obstacle is not relevant in the expected way. But probabilistic predictions are explicitly exposed to us, at least before RL postraining - which really is based on rewards!
Unfortunately, RL trains a behavioral policy which does not expose explicit planning. This is one classic objection to AIXI, often in favor of (the much less rigorously defined) shard theory. So the UAI may still overestimate the level of access we have to model internals.
However, I think that is actually centered around roughly the right level of access. For one thing, we really can train increasingly general (purely) predictive models such as foundation models, which are somewhat analogous to Solomonoff induction. UAI naturally asks what we can usefully do with such models. One option is to run expectimax tree search on top of the predictive model (as in AIXI), but UAI also includes direct policy search and lately a discussion of policy distillation that takes into account the important nonrealizability issues, then patches them with reflective oracles. Also, black-box access to a predictive model is an example of the central, not a minimal level of access that the UAI perspective suggests thinking about. Some UAI safety schemes don't make detailed use of predictions (golden handcuffs, forthcoming) or even forgo access to specific predictions and rely only on some provable high-level properties of the predictor (suicidal AIXI). To be clear, AF researchers can easily point out flaws and limitations of these schemes. But UAI safety research is making theoretical progress which suggests real implementations.
It's useful for safety researchers to have in mind the type of access and affordances that they would like for future ML techniques to expose, but which are at least plausibly achievable. That rules out looking into an alien ontology and reading out a clean object labeled "human values" and many less egregious or more subtle examples of the same mistake. But I don't think it rules out things like exposing a predictive model, approximate inner alignment to an explicit reward signal, or (more ambitiously) high-level architectural features like myopia (try training on only short-term feedback) or pessimism about ambiguity (try OOD detection, perhaps with a mixture of experts model). We should have a plan for designing safe agents, given such engineering/scientific breakthrough (or "miracles"). I think that UAI may also tell us how to get there, by understanding the generalization properties of deep learning. But wherever these breakthroughs come from, UAI can prepare us to take advantage of them.
Conclusion
UAI centers learning and search, which power modern ML. At the same time, the objects of UAI are powerful enough to talk about ASI. For example, the universal distribution is rich enough to express the possibility of very surprising generalization behavior (suggesting malign priors). One of the main (underappreciated) advantages of UAI as a framework for AI safety research is that it allows analysis of AF problems within a setting that closely resembles modern ML. I hope to get more traction from this correspondence soon by implementing practical UAI-inspired safety approaches.