I wrote this post towards the end of my three and a half months long SERI MATS fellowship. I didn't get even close to the point where I could say that I understand infra-Bayesianism on a really detailed level (according to Vanessa there are only three people in the world who fully understand the infra-Bayesian sequence). Still, I spent three months reading and thinking about infra-Bayesianism, so I ought to be able to say something useful to newcomers.
The imaginary audience of this post is myself half a year ago, when I was just thinking about applying to Vanessa's mentorship but knew almost nothing about infra-Bayesianism or the the general research direction it fits into. The non-imaginary intended audience is people who are in a similar situation now, just considering whether they should dive into infra-Bayesianism.
My review is mostly critical of the infra-Bayesian approach, and my main advice is that if you decide that you are interested in the sort of questions infra-Bayesianism tries to solve, then it's more useful to try it yourself first in your own way, instead of starting with spending months getting bogged down it the details of Basic infra-measure theory that might or might not lead closer to solutions. Still, I want to make it clear that my criticism is not aimed at Vanessa herself, as she chose questions that she found important, then created a theory that made some progress towards answering those questions. I have somewhat different intuitions than Vanessa over how important are certain questions and how promising are certain research directions, but I support her continuing her work and I thank her for answering my annoying questions throughout the three months.
I applied to the infra-Bayesian stream in SERI MATS because I have a pure mathematics background, so I figured that this is the alignment agenda that is closest to my area of expertise. I met some other people too, also with pure math background, who get convinced that alignment is important, then start spending their time on understanding infra-Bayesianism, because it's the most mathematical alignment proposal.
Although paying attention to our comparative advantages is important, in retrospect I don't believe this is a very good way to select research topics. I feel that I was like the man who only has a hammer and is desperately looking for nails, and I think that many people who tried or try to get into infra-Bayesianism are doing so in a similar mindset, and I don't think that's a good approach.
It's important to note that I think this criticism doesn't apply to Vanessa herself, my impression is that she honestly believes this line of mathematical research to be the best way forward to alignment, and if she believed that some programming work in prosaic alignment, or the more philosophical and less mathematical parts of conceptual research were more important, then she would do that instead. But this post is mainly aimed at newer researchers considering to get into infra-Bayesianism, and I believe this criticism might very well apply to many of them.
Motivations behind the learning theoretical agenda
According to my best understanding, this is the pitch behind Vanessa Kosoy's learning theoretical alignment agenda:
Humanity is developing increasingly powerful AI systems without a clear understanding of what kind of goals the AIs might develop during a training, how to detect what an AI is optimizing for, and how to distinguish relatively safe goal-less tools from goal-oriented optimizers.
Vanessa's research fits into the general effort of trying to get a better model of what possible forms "optimization", "agency" and "goals" can take, so we can have a better chance to identify them in the AI systems we are concerned about, and have better ideas on which training paradigm might lead to which kind of behavior.
The behavior of an actual neural net is hard to describe mathematically, and I expect that even if we see a paradigm shift away from neural nets, the behavior of future, even more powerful designs will not be any easier to describe. However, it seems that successful systems are often an approximation of some ideal solution to the problem, that is often easier to understand than the messy real-life process. We know disappointingly little about the internal working of AlphaZero's neural net or Kasparov's brain, but we can model their play as an approximation of the optimal minimax algorithm of chess. The minimax algorithm is computationally intractable, but mathematically simple enough that we can prove some statements about it. For example, if we want to know how AlphaZero or Kasparov will answer to the opening of a scholar's mate, we can prove that minimax won't get mated and then hope that Kasparov's and AlphaZero's algorithms are close enough approximations of minimax that this observation generalizes to them. On the other hand, it would be much harder to predict their response to the opening by studying their algorithms directly (by looking at their neurons).
Similarly, we can expect a real-life powerful AI system to be an approximation of an Idealized AGI. Based on the previous example and a few other cases, it is plausible that we can understand the behavior of an Idealized AGI better than any actual powerful AI system. So it makes sense to study the theory of Idealized AGI first, then hope that the real-life systems will be close-enough approximations of the ideal that our observations about the Idealized AGI give some useful insights about the real-world AIs.
Unfortunately, we don't even know yet what would be a good model for such an Idealized AGI. The most famous candidate is Hutter's AIXI. A condensed description of AIXI: An agent interacts with an environment that is supposed to be computable, that is, the environment can be modelled as a Turing machine. The agent doesn't know which environment it is interacting with, but it has prior distribution over all computable environments using Solomonoff induction (explained later in the IBP part of my post). Then the agent acts in a way that minimizes its expected loss until a large time horizon T, based on these priors.
This is a nice and simple model, and because it considers all computable environments, the agent is pretty general. However, it has several very serious weaknesses. Now we don't care about it needing unimaginable compute, an uncomputable prior and extremely slow learning rate, these are okay now, we are looking for an Idealized AGI after all. But even granting all that, its decision process has some serious shortcomings, as discussed in the Embedded agency sequence. Infra-Bayesianism was created to address some of these problems.
Is this actually useful?
I have my doubts about the usefulness of this approach. In principle, I think I agree that capable intelligent agents can be modeled as approximations of the Idealized Agents we are studying. But a crucial question is, how close they get to these idealized versions by the time they pose an existential threat. My current best guess is that not close enough for our current investigation into infra-Bayesianism to make a difference.
Napoleon was pretty good at manipulating people and taking over countries. Edward Teller was pretty good at building superweapons. I imagine that the first AIs that will be able to defeat humanity if they want, or bring us into existential security if we align them well, will be approximately as capable as Napoleon and Teller together, but with immense self-replication capacities.
Was Napoleon an approximation of a Bayesian expected utility maximizer? Well, he had some goals that he followed somewhat consistently, and he sometimes changed his mind in the face of new evidence, but this is a pretty weak sense of "approximation". Was Napoleon more of an approximation of an infra-Bayesian optimizer than a Bayesian one? Maybe, sort of? Some cornerstones of infra-Bayesianism, like Knightian uncertainty and preparing for the worst-case scenario seem to be relatively important elements of human decision making.
But would it have helped the British generals in predicting Napoleon's movements if they had had a better mathematical understanding of the difference between Bayesian expected utility maximization and infra-Bayesianism? Would this understanding have helped Napoleon's schoolteacher in raising him to be a peaceful citizen instead of a megalomaniac conqueror? I don't think that in the "Napoleon alignment" and "Napoleon control" problem, any mathematical formulation of optimization would have been very useful other than some general heuristics of "has some goals, has some beliefs, usually acts according to these".
I think that when tranformative AI arrives, it will be a strange and alien mind, and aligning it will likely be a much harder task than raising Napoleon well. But I don't think it will be very close to any Idealized model, it will just be heuristic kludgery-machine, just like we are, only operating with an alien type of kludgery. That's why I am skeptical that a better mathematical understanding of an Idealized agent would help us aligning an AI anymore than Napoleon's teacher would have been helped by it.
What if we deliberately build the first transformative AI in a way that relies on more understandable optimization principles instead of the inscrutable kludgery that deep learning currently is? If I understand correctly, that would be a big part of Vanessa's plan. I'm skeptical that this would be possible, I expect that the inscrutable training processes will create a transformative AI before we can devise a training scheme that we actually understand. I can easily imagine a paradigm shift away from deep learning, but I would be profoundly surprised if the new thing was something nice and understandable Naive Bayes Classifier But This Time More Powerful And Also Possibly Infra-Bayesian.
Another hope she mentions is that we could develop the mathematical understanding of deep learning to a level where we realize in which ways it really is an approximation of the Idealized AGI. I'm generally very much in favor of getting a better mathematical understanding of deep learning, but I am skeptical whether we could connect it to this agenda. My strong guess is that even if we got a very successful mathematical theory of the human brain (predictive processing maybe?), the result would still be very different from any Idealized model, and I wouldn't expect our research about infra-Bayesianism to be useful in stopping Napoleon. My guess is similar about deep learning.
What else can we do then to understand an alien mind that doesn't even exist yet? Good question, and if one doesn't have hope with any other approaches, then it can make sense to go back trying to work out a better mathematical formulation of an Idealized AGI and hope that it will be more relevant than in the case of Napoleon, or hope for a long AI winter and then work on producing a paradigm shift in the direction of more understandable models. Personally, I plan to look into other approaches first.
Does Infra-Bayesianism actually solve the problems of embedded agency?
Okay, assume for now that you want to work on a mathematical formulation agency that handles the questions of embedded agency well. Should you study infra-Bayesianism or try to develop your own solutions?
I think the answer somewhat depends on which question of embedded agency do you care about, as I will detail below. But my general takeaway is that there are surprisingly few actual results in infra-Bayesianism yet, and there are some important questions it doesn't even try to address.
Clarifying the "surprising" part: In retrospect, the sparsity of results is not surprising at all, since the problems are hard, and basically only two people worked on it for a few years. Compared to that, the output is actually pretty impressive. Also, the questions of embedded agency are diverse, and Vanessa never claimed that one theory could solve all of them.
Still, when I started looking into infra-Bayesianism, and read the problem-statements at the beginning of the posts, and then stared at the following imposing wall of mathematical formulas, I presumed that the developed formalism solves a large part of the problems involved, although there are still some open questions. In reality, we have lots of definitions, theory-building and some conjectures, we very rarely have anything concrete that "Yes, this is a nice property we could prove about infra-Bayesian agents, but wasn't true or we couldn't prove about AIXI."
Again, this is not a criticism of Vanessa's work, she started developing an approach for addressing the open questions of embedded agency, she came up with some interesting ideas, but formalizing them required lots of technical theory-building, and she didn't get to many impressive results yet.
But if you are just a new researcher who wants to tackle the problems of embedded agency, this means that, as far as I can tell, there is not much reason to think that infra-Bayesianism is an especially promising direction, other than Vanessa having a hunch that this is the right way to go. I think Vanessa also agrees that we don't have strong evidence yet that infra-Bayesianism will turn out to be an especially fruitful framework, and she would be happy if someone found something better (although currently she believes infra-Bayesianism is the best approach).
So if you are a new researcher who is interested in embedded agency, I think I would advise reading the Embedded agency post and look for some related works, then try to develop your own framework. I imagine that the most likely result is that you don't get too far, but I think there is a non-negligible chance that you can develop a framework in which progress is easier and you get stronger results earlier than in infra-Bayesianism. After all, given that infra-Bayesianism didn't really deliver much of an evidence yet, that's just your best guess against another smart person's (Vanessa's) best guess. And if you fail to create a good theory yourself, as you most likely will, you can still look into Vanessa's work after that. In fact, I think you will get a better understanding of the motivations behind infra-Bayesianism if you try it yourself for a while rather than just jumping heads-in to the IB formalism. I somewhat regret that I realized this too late and didn't at the very least give a few weeks to try to answer the problems in my own way.
Specific issues of AIXI and how well infra-Bayesianism handles them
AIXI's prior is uncomputable and AIXI would require tremendous sample complexity to learn anything
Infra-Bayesianism doesn't even try to address this problem yet, it faces the same problem that if we want the agent to be general enough to handle a wide variety of situations well, then its prior needs to contain lots of hypotheses, which makes learning very slow and makes the prior uncomputable if we go general enough. It is possible that solving the non-realizablity problem (see below) makes it easier to create a reasonable prior, but it's by no means obvious.
This is not really a question of embedded agency, and is not among the first steps Vanessa wants to consider (we are looking at an Idealized AGI, remember?) and I only included it in the list because it's a natural first thought when we are discussing "the shortcomings of AIXI".
There might be things in the environment that kill the agent or cause irreversible harm, and the agent can't learn the truth about them without trying it. Thus, with certain priors, AIXI can spend its whole life always taking action , because it's afraid that any other action would kill itself, and it has now way of learning whether this hypothesis is true, because it never dares to try anything other than action . Alternatively, it can have a prior which makes it confidently try all sorts of things and killing itself in a short notice.
I feel that this is the most serious shortcoming of AIXI as a realistic model of an intelligent agent in the world. Unfortunately, infra-Bayesianism was not created to address this, and we would need to develop a completely separate theory for that. Vanessa has a few plans about this too, but they are on the level of unexplored ideas yet.
A classical learning agent can be expected to do well in environments that are included in its hypothesis class (assuming there are no traps). But an agent can't fully model an environment that's bigger than itself (in particular an environment that includes that agent itself or other agents of similar complexity), so its hypothesis class must be limited, and in reality, it very well might encounter an environment that is not in its hypothesis class. We call an environment that is not in the hypothesis class non-realizable. For a classical learning agent we have no general guarantee on its performance in a non-realizable environment.
In particular, an agent playing a game against another agent very similar to itself is a non-realizable setting, and we don't know much about the behavior of classical learning agents in such a game. This is called the grain of truth problem.
Infra-Bayesianism was developed to handle the problem of non-realizability. In my other, more technical post, I explain the infra-Bayesian approach to non-realizability in more detail. My general conclusion is that infra-Bayesianism in fact seems to be an improvement over classical learning theory, but it's unclear how big of an improvement it is, and it's very unclear (at least for me), how to move forward.
About games among infra-Bayesian agents we still have only the most preliminary results.
I stand by my recommendation that people should try to look for solutions on their own first, and just look into the infra-Bayesian framework later. But if someone is specifically interested in the non-realizability problem, then maybe I recommend a shorter working-alone time, because infra-Bayesianism really might have interesting insights here.
The infra-Bayesian framework really seems well-equipped to handle these! Later, I think Infra-Bayesian Physicalism handles them even more naturally, but even good old ordinary infra-Bayesianism is pretty good for this. Hooray!
Thomas Larssen has a nice write-up explaining this, including a counterexample where infra-Bayesianism actually fails in a Newcomb-problem, but it seems to be a relatively minor problem that can be addressed with the introduction of a little randomness.
Motivations of Infra-Bayesian Physicalism
Infra-Bayesian Physicalism (IBP) is a major research direction inside Vanessa's agenda, and the one I have the most mixed feelings about.
My understanding is that IBP tries to tackle three main questions. I find them more interesting and more plausibly relevant to alignment than the previously listed problems of embedded agency, so I will write about them in more detail, then I try to address whether IBP is the good framework to handle them.
How should we think about Occam's razor and anthropics?
Occam's razor seems to be an important cornerstone of scientific thinking, but when we are saying we should use the simplest hypothesis consistent with the data, it's not obvious how to define "simplest". This question is equivalent to asking where a Bayesian agent should get its priors from. Occam's razor says that simple hypotheses should have higher a priori probabilities than complicated ones, but we should still define "simple".
Which hypothesis of free fall is more simple: "Objects fall as they do because they have a constant gravitational acceleration " or "Objects fall as they do because there is a witch who makes it so"?
The most widely accepted solution is Solomonoff induction: for every environment, one should look at the length of the shortest program on a universal Turing machine that produces this environment. This length is the Kolmogorov complexity. Then the prior probability of being in that environment should be approximately . (Actually, it's summing this up for all programs producing the sequence, but the main term corresponds to the shortest program.)
This gives a nice answer to the previous question about free fall: if we want to express the "witch hypothesis" as a program that outputs the results of a bunch of experiments we observed, we need to hardcode for every experiment how high we dropped the ball from and how long it took to land. (Also, we need to use a few superfluous bits to specify that all of this happened because of a witch.)
On the other hand, to express the "gravitational acceleration hypothesis" as a program that outputs the result of experiments, we just need to specify the constant at the beginning, write down the function , then hardcode for all experiments the falling time, and now the program can output both the falling time and the height for each experiment, and the description length was just half as long a the "witch hypothesis".
(If you haven't encountered this concept before, take some time to think about it, when I first read this explanation in the Sequences, it felt revelatory to me.)
AIXI, the most widely accepted general model of intelligence, and the starting point of the whole Embedded agency inquiry, is nothing else but a Bayesian expected utility maximizer based on the Solomonoff prior.
There are some problems with this interpretation of Occam's razor however. The first is that there is no such thing as a unique, canonical universal Turing machine. This means that it's possible to to construct a universal Turing machine such that Russell's teapot is a very short description length fundamental object, so it has a high prior for existence. And as we have no way to gather evidence in any direction about a hypothetical teapot in the asteroid belt, we can live our lives believing with high probability in Russel's teapot. I don't have a good solution to this problem, neither does infra-Bayesian physicalism, and I suspect that there might not be any objective solution to this other than biting the bullet that it's maybe not that big of a problem if we have false beliefs about teapots in the asteroid belt that we will never observe. (Also, if you just construct a universal Turing machine in any sensible way, without specifically creating it with the purpose of giving short description length to Russell's teapot, then probably no such problems will emerge.)
The problem with Solomonoff induction IBP tries to solve is something different: what do we mean by "environments" on which we use a simplicity prior? AIXI's answer is that it looks at the description complexity of its own string of observations: this is reasonable, that's the input you have after all, what else could you do?
On the other hand, this pretty much breaks the whole heuristics behind Occam's razor. When scientists explore the laws of physics, they slowly figure out how to connect different phenomena to each other, reducing the number and length of different laws necessary to describe our universe. There is a reasonable hope that if we understood physics well enough, we could reduce the description length of our whole universe to a handful of equations governing reality, and a few fundamental constants describing the initial conditions of the universe. This would be pretty short description length, which means that a Bayesian updater starting with a simplicity prior would converge to learn the true law pretty fast while observing physics.
But this nice, short description length "universe from third person perspective" is not how AIXI thinks of its environment. It has a simplicity prior over the programs that can output its own observations. Here, the true hypothesis has much-much longer description complexity: "This sequence of bits is produced by the universe described by these simple equations, and inside this universe, you can find this sequence in this particular Everett-branch, in that particular planet, in this particular moment in history, as the sensory observations of that particular agent." This description needs to specify the bridge rule picking out the agent in the universe, which makes it very long.
This is pretty bad. First, it's not the simplicity of this description that scientists are thinking about when they are using Occam's razor. Second, because of the long description, the sample complexity (amount of observations) necessary to learn the truth becomes huge, and for a long while the program "Output [hardcoding of the sequence of observations so far] then output forever" will be shorter than the true program, so the agent will believe the first one to be more likely. Third, because the simplicity prior is over the whole description including the bridge rule, the agent will have a strong presumption towards believing that it is in a central, short description length place in the universe. This contradicts the Copernican principle.
On the other hand, if your prior is over third-person descriptions of the universe, how do you update on seeing sensory observations? An intuitive answer seems to be "Use Bayes-theorem on the evidence that there exists an agent in the universe observing the things I observed so far." But what does "an agent" mean in this context? Also, you don't only have access to your sensory observations, but to your thoughts too, so maybe the evidence you condition on should be more like "There exists an agent that observed the things I saw and thought and acted the same way I did". But how do we define all of that? And should we weigh a hypothesis differently if, according to the hypothesis, there are more than one agents in the universe who fit the observations? What if the universe is infinite in one way or another, and every object, mind and observation exists somewhere with probability ?
For me, it is still an open question how we should handle these questions of Occam's razor and anthropics, and IBP was created as an interesting attempt to answer that. I don't necessarily believe that it's very relevant to alignment in itself: the transformative AI will be good at the technically relevant parts of science, because it will be trained to be so, and I don't think it's especially relevant whether it uses an actual, good formalization of Occam's razor, or just uses messy heuristics of simplicity, just like our scientists do. Still, I think it's a very interesting question in general, and I consider thinking about it more in the future.
And there is one aspect of these questions that I believe to be more plausibly relevant to AI alignment:
Paul Christiano developed the idea that AIXI would be susceptible to acausal attackers. The idea in a nutshell is that because AIXI uses simplicity prior over its own input, it has a strong presumption to be in a short description length place in the universe, which can make it believe that it's in a special kind of simulation. I don't want to go into details here, because others already wrote that up, and it's not very relevant to my current post, but it's a very clever argument (although it needs some questionable but not implausible assumptions).
I don't find it very realistic that this situation arises in this particular form, as I don't expect we will build anything actually close to a Bayesian reasoner with Solomonoff prior. (Paul also writes "I don't expect that humanity will ever do anything like this. This is all in the "interesting speculation" regime.")
But in general, I think it's worth thinking about simulation arguments. Going back to a previous analogy, I don't think that Napoleon was thinking very much about simulations, and I don't think he would have ever base important decisions on that, but at least there are some smart humans who do take the general simulation hypothesis seriously, and there are probably at least a few people who actually base some decisions on it. (Examples in the comments are welcome!)
I find it possible that there is a "correct way" to reason about simulation arguments and anthropics in general, and a smart enough AI will figure it out and base some important decisions on it if it concludes that there is a chance it lives in a simulation. If that's the case, it would be nice to figure out this "correct way" ourselves, so we can better prepare for how it might influence the AI's behavior.
It's also plausible that there can be different, otherwise highly capable thinking-structures that come to different conclusion about the simulation hypothesis. For example, one of the purported advantages of IBP is that it is less susceptible to believe itself to be in a simulation than AIXI.
If that's the case, then it's not clear what kind of thinking process the first transformative AI will use, for example I have no idea what an AGI arising from the current deep learning paradigm would think about the simulation hypothesis. But potentially, if we understand the questions of Occam's razor, anthropics and simulations better for ourselves, we can have better ideas of what we can change in the architecture or training of a young mind that steers it towards our preferred conclusion in these questions.
I'm not actually sure, by the way, which direction I would want to steer the AI's thinking. Vanessa thinks about this in terms of making sure that the AI is not influenced by simulation worries. I agree that if we have an otherwise aligned AI, I wouldn't want it to make unpredictable decisions based on a simulation argument. On the other hand, if the situation is desperate enough, but we somehow figured out what kind of thinking-structures take the simulation argument more seriously, and how to steer our models towards that direction, I would probably do it. My best guess is that if an AI gives a non-negligible likelihood that it's in a simulation, and is the kind of thinking-structure that takes this kind of concern seriously, then it will take some not too costly steps to avoid doing things that might anger the simulators. As we, humans, seem to be the main characters of the story so far, it seems plausible that killing off or enslaving humanity would displease the simulators. This would mean that if leaving Earth to the humans is not too costly for the AI (as in the case of a paperclip maximizer, for whom Earth is not more valuable then any other chunk of raw material in the universe), then it might let humanity to survive here, just to guard against the anger of potential simulators. Sure, it still leads to humanity losing most of the universe, but I think mankind living on Earth for a few more million years is already enough to explore lots of interesting experiences, activities and life histories, and for me this outcome feels at least as good as conquering the galaxies and fulfilling our whole cosmic potential.
Obviously, this is very speculative, and I'd rather not resort to this solution. But if we figure out what kind of models and training makes an AI more susceptible to the simulation argument, this possibility is worth to keep in mind. A Hail Mary approach, in every sense of the phrase.
Anyways, I find simulation arguments plausibly important to study, I think Paul's acausal attacker argument is a nice illustration of what kind of logic can lead to believing in simulators, and IBP is an interesting attempt to create a system that's less likely to conclude that.
This is a seemingly unrelated concern from the previous two, but IBP also makes a nice attempt to handle this.
We often talk about utility functions, but for a long time I never considered what's the function's domain supposed to be. The possible states of the world? How would you assess that? You only have models of world, you need to define your utility by the concepts you have in your model, you don't have direct access to the underlying territory.
What happens then, when an agent refines its model of the world? Let's say it starts with a caveman-level world model where trees, stones, humans etc. are ontological objects. And let's assume it is well-aligned, it wants good things to happen to the humans. But then, to improve its abilities, it reads a chemistry textbook and learns that things are made atoms. There are no such things as humans, only collections of atoms! There is no such thing as happiness, just some particular formation of atoms swirling through a brain-shaped chunk of atoms! None of its encoded values are well-defined in the new model now! The world is just a wasteland of cold matter devoid of meaning!
(This would be actually a pretty good outcome, as the AI just stops functioning because its value function completely loses meaning. The worse scenario is if the AI tries to do some kind of translation of values from its old world model to the new, but it's not perfect, so its alignment with humans breaks and starts to maximize diamonds because that was the only valuable object it could translate to an atomic description.)
I don't expect this to actually happen. If we solve the hard problem that the complex values of the human mind, or at least the complicated concept of "corrigibility" is successfully encoded in the values of the AI's in its first ontology (which is probably already alien to a human mind), then I don't expect this to break after the AI switches to another alien ontology. Still, I think it's useful to think about the ontological crisis as a test-case of the more general problem of how to reliably translate concepts and values between different ontologies, like ours and the AI's.
(Note: I need to look more into John's Natural Abstraction Hypothesis, this seems relevant here.)
Interestingly, a shortcut solution to the ontological crisis falls out naturally from the framework IBP develops for handling the mostly unrelated Occam's razor question. I want to flag this as a positive sign that IBP might be on the right track towards something, even if I'm uncomfortable with its conclusion here: it's generally a possible good sign that we are getting closer to the truth when a theory developed to solve one thing happens to offer a persuasive solution to another unrelated thing too.
Is infra-Bayesian Physicalism a good solution to these problems?
Newcomb's problem and the five-and-ten problem
IBP was created keeping acausal decision theory and action couterfactuals in mind, so IBP is well-equipped to handle Newcomb-like scenarios, and the paradox of the five-and-ten problem is dissolved for an IBP agent. As I previously explained, I'm not very convinced about the importance of these questions, but it's a good thing that IBP handles them well.
Occam's razor and anthorpics
IBP uses third-person descriptions of the world, this is definitely a good point as it conforms to the Copernican principle. Does the correct hypothesis about the world have actually a short description length, as we would hope for (like "a few rules and a few initial conditions")? Not really, as it also needs to have a prior over what the outputs of different computations are (You don't know from the beginning the result of all computations! You just have guesses about them, until you actually calculate them!), and it also needs a prior over the interaction of the computational world and the physical world. It makes sense, but it makes the theory hard to process it for me, and the description of the hypothesis the agent needs to learn certainly won't be short. On the other hand, Vanessa argues that having a prior over the result of computations is actually inevitable for any agent, so this is not really a disadvantage compared to AIXI, and at least we get rid of the long-description bridge rule.
I spent some time trying to wrap my head around this, but I'm still very confused. I find this question interesting enough though that I will probably return trying to work this out, and if I have a better understanding, I will return to edit this part. Anthropics is hard.
For now, my tentative conclusion is IBP takes a promising shot at this problem, although we don't really have proofs of anything yet, and I'm not fully convinced yet that there is no hard to detect sleight of hand somewhere hidden in the argument.
Because IBP takes a third-person view and adheres to the Copernican principle, unlike AIXI, it doesn't fall prey to Paul's acausal attackers, because it doesn't give high prior to be in a simulation that is run in a central, short-description length place in a universe.
This is an interesting point in favor of the theory that different thinking-structures (like AIXI and IBP) might have different in-built presumptions about the simulation hypothesis. On the other hand, that IBP doesn't fall for this specific trap doesn't mean there is no other reason that convinces it that it's in a simulation. After all, the acausal attackers argument is also something I probably wouldn't have come up with on my own, and comparatively little thought went so far into searching for similar tricky failure modes in IBP.
I also feel that the fact that acausal thinking comes very naturally to IBP can open some vulnerabilities towards giving concessions to non-existent entities that are even weirder than the simulation hypothesis. But I don't actually know any examples, and IBP might manage to handle all of these gracefully, I'm just superstitiously worried about any agent that thinks too deeply about acausality, because I don't understand it well enough.
IBP's proposed solution is that your utility function shouldn't be defined in terms of objects in your current model, as you might lose those after reading a chemistry textbook. Instead, you should have a way of determining which computations are "run by the universe". It's not obvious at all what a good definition for this would be, but it turns out that this step is also necessary for the basic setup of IBP that was developed for formalizing Occam. If you have that, you can have a utility function over which computations you find valuable to be run by the universe. As you refine your world-model, the domain of your utility function doesn't change, and if you made sure in a less-refined model that a computation is running, then you don't lose this fact when you refine your model.
The obvious downside is that we need to accept the controversial philosophical statement "Everything valuable can be expressed in terms of which computations are running in the universe". This is a defensible position, qualia might be substrate independent, in which case if the AI takes the computations that my brain would do during an enjoyable activity and runs it on a computer or creates a collection of billiard balls whose motion is described by the computations that are equivalent to my thoughts, it should be just the same.
I'm uncomfortable with this position, especially since it implies that even the timing of computations don't matter: the IBP agent definitely considers it fine to simulate a person's life with each moment being simulated a thousand years apart on different computers, but even the order of life events being simulated doesn't have to be the same as they would be in a chronological order in a life history. There are some constraints on which kind of life moments the AI can reliably simulate without previously computing the other life events that led to this exact situation (including the human having memories of those previous events during this moment), but I feel that the AI still has a considerable liberty in changing the order of the events. Although in practice I don't see a good reason why it would do so, this thought experiment makes it harder for me to bite the bullet of computationalism. Also, I think that most people are even less sympathetic to the idea of "everything valuable is just a computation" so it would pretty strongly go against my democratic intuitions to hardcode this principle into a world-conquering AI.
The most serious objections
The main problem with IBP identified so far is the so-called monotonicity principle. When we define the utility function of an IBP agent over which computations should be run in the universe, we can't give any computation less than utility. This is just a mathematical fact about how IBP can think about utilities. Given a chance to create Hell and torment people for eons, while getting an ice cream out of this business, an IBP agent takes the deal, because eating the ice cream has a positive value, and torments don't have negative value, so it's a positive deal overall.
Vanessa is very much aware that this is a serious problem, but the best solution I've heard so far is that the available universe is plausibly finite, so creating a Hell-region has the opportunity cost of not creating a Heaven-region in its place instead. Still, an IBP agent has no objection to enslaving the population of Earth to horrible labor-camps to speed up the creation of universe-conquering Neumann-probes a little, so it can reach a bigger region of the light cone and tile it with happy computations billions of years from now. I'm selfish enough to be worried by this possibility.
Also, I'm not convinced enough about the opportunity cost argument, there can be lots of pitfalls.
"Well, I already simulated most normal pleasurable experiences, so there is strong diminishing returns for creating new standard utopia-regions. However, I haven't yet simulated the positive experience someone can get by rescuing someone else from a torture chamber! That's a valuable new situation, and the part of the computation that is the suffering has cost anyway! For the matter, I also haven't yet simulated the pleasurable experience Marquis de Sade gets from torturing people..."
I think we might think up some potential workarounds to this too (the whole process of the torture counts as one computation and not just the part of de Sade enjoying it, so we can give it value, as we don't approve of it happening?). But I don't think we can think through all failure modes, and I'd rather choose a paperclip maximizer killing everybody over an agent that literally can't conceive of anything being worse than non-existence.
Incidentally, this is the reason I didn't look very deeply into Vanessa's concrete alignment strategy, Physicalist Superimitation (previously called Pre-DCA), as it is based on making an IBP agent the sovereign of the universe, and I'm extremely wary of any such proposal, because of the monotonicity principle.
Similarly, if we think of IBP not as a way to design an AI, but as a model of how we, ourselves should think about Occam's razor, the situation is not much better. I already have a value system which very much assigns negative value to certain experiences, and if I want to figure out how to use Occam's razor, the answer really shouldn't include that I have to change my fundamental values first.
My conclusion from looking into IBP was that unfortunately the monotonicity principle is baked very deeply into the framework, and I don't see a way to get rid of it without changing the whole machinery.
That's why I started with saying that IBP is the part of the agenda that I have the most mixed feelings about: I feel that its questions are especially interesting and potentially important, and it's a valiant try to answer them, and it does contain some interesting ideas, but because of the monotonicity principle, I basically consider it a failed framework.
So my advice to new researchers interested in the topic is a stronger iteration of my advice from before: try to come up with your own theory, and hope that it can solve the same problems but without IBP's pitfalls. I would advise trying it first without even looking very much into IBP, my guess is that it's better to keep your thoughts independent. Then, after you get stuck, or if you produced some results, you can look into IBP to see if you can get some inspiration from there, because it really does contain some interesting ideas.
One more note about IBP: the formalism is very complicated even by the standards of infra-Bayesianism. Fifth level power sets (probability distributions over convex sets of probability distributions of power sets of power sets of all computations) are standard building blocks of the theory, which makes it very hard to visualize anything and calculating even the simplest toy example was a huge struggle. I think it's at least worth a try for someone to develop a theory that doesn't require that level of conceptual complexity.
I also want to note that Vanessa developed IBP when she was already thinking in infra-Bayesian terms, which led to IBP being created inside this frame. But I see no strong a priori reason why infra-Bayesianism would be necessary for a good formalization of Occam's razor and anthropics. So I would actually advise researchers to first try thinking about this question without looking very deeply into infra-Bayesianism, to keep their thoughts independent. This would have the extra advantage that the monotonicity principle comes in mostly because of some special properties of infra-Bayesian thinking, so if it turns out that infra-Bayesianism is not actually necessary for formalizing Occam, then the monotonicity principle probably wouldn't show up in the new theory.
Personally, I find the questions motivating IBP interesting enough that I might decide to follow this advice myself and try to develop my own answers.
Ambitious value learning vs corrigibility
As I was thinking about IBP, I started to form some more general objections towards ambitious value learning.
I don't want an AI to try to guess my utility function or my coherent extrapolated volition, then try to maximize value according to that. I'm not convinced that I do have a utility function that can be coherently extrapolated. In general I'm highly distrustful of any AI that tries to maximize for anything.
I also don't really want the AI to try to figure out the general truth about the universe and then go ahead and maximize my utility in some weird way implied by its discoveries. Here, I think I have a general difference in attitude with Vanessa that I'm more concerned about our clever creation breaking down after it discovers an unexpected concept in physics.
We had a few discussions about IBP, and Vanessa sometimes used arguments like "The available universe is probably finite so opportunity cost can save us of from Hell" and "It seems that quantum mechanics doesn't allow Boltzmann-brains so we might not need have to worry about that" and "Vacuum collapse is either not possible or inevitably happens, so we don't have to worry about the IBP agent obsessing unreasonably about avoiding it". Of course, Vanessa is very much aware that we shouldn't launch a world-conquering super-AI whose safety is based on arguments that have this level of confidence. But I still think there is a pretty strong disagreement in our outlook, as she seems more optimistic that with enough work, we can just iron out these questions.
Quoting her: "IBP seems like a promising direction in which to study this sort of questions, and hopefully once we understand these and other questions *way, way* better than we currently do, we will be well-equipped to actually know what we're doing when we build AI. Also, here are some arguments why certain objections to IBP might turn out to be wrong, but it might also easily turn out that these arguments are 100% wrong and the objections are 100% correct. Whatever the case may be, I believe that working out the math is usually a better method to move towards understanding than lots of philosophical hand-waving, at least when the mathematical understanding is as shallow as it is at this point".
This is all reasonable, but I suspect that we will just never get to point where we understand all the questions similar to "How will our value-maximizing AI react to the possibility of Boltzmann-brains?" well enough that I would trust turning on the AI. More importantly, there can be always possible discoveries about the nature of the world we can't even imagine. The universe is weird. If I understand correctly, Vanessa's plan for this would be to come up with mathematical guarantees that the agent will act in a satisfactory way in every scenario, so we don't need to worry about specific weirdnesses like Boltzmann-brains. That might not be impossible, but personally I don't find it likely that we could create a an AI that just goes ahead maximizing some kind of predetermined value, but reliably never breaks down even if it's faced with situations no human ever thought of.
In general, when our AI figures out the possibility of Boltzmann brains, I want it to tell us nicely "Hey, I figured out that there will be Boltzmann brains after the heat death, and I have some nice philosophical arguments on why it actually matters." Then we might or might not listen to its philosophical arguments, then tell it that it's all nice and well and it can do something reasonable about it, but it definitely shouldn't destroy the Earth because of some newfound utilitarian priority. Also, I would feel safer with the AI having only relatively short-term and localized goals instead of grand universal maximization, that seems lees likely to break down from some cosmological thought experiment we haven't thought of.
I know that corrigibility has its own set of problems and inconsistencies, but this reasonable behavior sounds mostly like what we call corrigibility, and I prefer to look more into alignment strategies that intend to get us there instead of creating a sovereign AI maximizing value according to what it imagines as our coherent extrapolated volition.
It would be nice if in one or two years someone would write a follow-up post on how much progress infra-Bayesianism and Vanessa's agenda made on the problems mentioned in this post. Do we have more actual results that feels like satisfactory answers to the questions of embedded agency? Do we have a better understanding of performance guarantees in general environments than what I write about in my other post? Is there an alternative to IBP that doesn't have the monotonicity principle baked in (or do we have an extremely compelling case why the monotonicity principle is not a problem)?
Until then, I would advise new researchers to look for other directions, while I wish success to Vanessa and others who decided to pursue her agenda in building interesting results on the theory built so far.
Also, one final advice for those who still decide that they want to work on infra-Bayesianism: there might be a textbook coming! Vanessa says that someone is working on a more readable version of the infra-Bayesian sequences, and it might be published in a few months. If this work really is in progress (confirmations in the comments are welcome!) then I would advise everyone to wait for its publication instead of heroicly jumping into the current infra-Bayesian sequences.
Thank you for this detailed review, David. Replies to selected points:
Vanessa, Diffractor, and who is the third one?
I'll mention my own issues with IBP, and where the fatal issue lies in my opinion.
The most fatal objection, is as you said the monotonicity principle issue, and I suspect this is an issue because IBP is trying to both unify capabilities and values/morals, when I think they are strictly separate types of things, and in general the unification heuristic is going too far.
To be honest, if Vanessa managed to focus on how capable the IBP agent is, without trying to shoehorn an alignment solution into it, I think the IBP model might actually work.
I disagree on whether maximization of values is advisable, but I agree that the monotonicity principle is pointing to a fatal issue in IBP.
Another issue is that it's trying to solve an impossible problem, that is it's trying to avoid simulation hypotheses forming if the AI already has a well calibrated belief that we are being simulated by a superintelligence. But even under the most optimistic assumptions, if the AI is actually acausally cooperating with the simulator, we are no more equipped to fight against it than we are against alien invasions. Worst case, it would be equivalent to fighting an omnipotent and omniscient god, which pretty obviously is known to be unsolvable.
I’m curious why you say it handles Newcomb’s problem well. The Nirvana trick seems like an artificial intervention where we manually assign certain situations a utility of infinity to enforce a consistent condition which then ensures they are ignored when calculating the maximin. If we are manually intervening, why not just manually cross out the cases we wish to ignore, instead of adding them with infinite value then immediately ignoring them.
Just because we modelled this using infrabayesianism, it doesn’t follow that it contributed anything to the solution. It feels like we just got out what we put in, but that this is obscured by a philosophical shell game. The reason why it feels compelling is though we’re only adding in an option to then immediately ignore it, this is sufficient to give us a fake sense of having made a non-trivial decision.
It would seem that infrabayesianism might be contributing to our understanding of the problem if the infinite utility arose organically, but as far as I can tell, this is a purely artificial intervention.
I think this is made clearer by Thomas Larson’s explanation of infrabayesianism failing Transparent Newcomb’s. It seems clear to me that this isn’t an edge case; instead it demonstrates that rather than solving counterfactuals, all this trick does is give you back what you put in (one-boxing in the case where you see proof you one-box, two-boxing in the case where you see proof you two-box).
(Vanessa claims to have a new intervention that makes the Nirvana trick redundant, if this doesn’t fall prey to the same issues, I’d love to know)
You don't need the Nirvana trick if you're using homogeneous or fully general ultracontributions and you allow "convironments" (semimeasure-environments) in your notion of law causality. Instead of positing a transition to a "Nirvana" state, you just make the transition kernel vanish identically in those situations.
However, this is a detail, there is a more central point that you're missing. From my perspective, the reason Newcomb-like thought experiments are important is because they demonstrate situations in which classical formal approaches to agency produce answers that seems silly. Usually, the classical approaches examined in this context are CDT and EDT. However, CDT and EDT are both too toyish for this purpose, since they ignore learning and instead assume the agent already knows how the world works, and moreover this knowledge is represented in the preferable form of the corresponding decision theory. Instead, we should be thinking about learning agents, and the classical framework for those is reinforcement learning (RL). With RL, we can operationalize the problem thus: if a classical RL agent is put into an arbitrary repeated Newcomb-like game, it fails to converge to the optimal reward (although it does success for the original Newcomb problem!)
On the other hand, an infra-Bayesian RL agent provably does converge to optimal reward in those situations, assuming pseudocausality. Ofc IBRL is just a desideratum, not a concrete algorithm. But examples like Tian et al and my own upcoming paper about IB bandits show that there are algorithms with reasonably good IB regret bounds for natural hypotheses classes. While an algorithm with a good regret bound for ultra-POMDPs has not yet been proposed, it seems very like that it exists.
Now, about non-pseudocausal scenarios (such as noiseless transparent Newcomb). While this is debatable, I'm leaning towards the view that we actually shouldn't expect agents to succeed there. This became salient to me when looking at counterfactuals in infra-Bayesian physicalism. [EDIT: actually, things are different in IBP, see comment below.] The problem with non-pseudocausal updatelessness is that you expect the agent to follow the optimal policy even after making an observation that, according to the assumptions, can never happen, not even with low probability. This sounds like it might make sense when viewing an individual problem, but in the context of learning it is impossible. Learning requires that an agent that sees an observation which is impossible according to hypothesis H, discards hypothesis H and acts on the other hypotheses in has. There is being updateless, and then there is being too updateless :)
Scott Garrabrant wrote somewhere recently that there is tension between Bayesian updates and reflective consistency, and that he thinks reflective consistency is so important that we should sacrifice Bayesian updates. I agree that there is tension, and that reflective consistency is really important, and that Bayesian updates should be partially sacrificed, but it's possible to take this too far. In Yudkowsky's original paper on TDT he gives the example of an alphabetizing agent as something that can be selected for by certain decision problems. Ofc this doesn't prove non-alphabetizing is irrational. He argues that we need some criterion of "fairness" to decide which decisions problem count. I think that pseudocausality should be a part of the fairness criterion, because without that we don't get learnability: and learnability is so important that I'm willing to sacrifice reflective consistency in non-pseudocausal scenarios!
Instead of literal repetition, we could examine more complicated situations where information accumulates over time so that the nature of the game can be confidently inferred in the limit. But, the principle is the same. ↩︎
If you don't care about the specific regret bound then it's easy to come up with an algorithm based on Exp3, but that's just reducing the problem to blind trial and error of different policies, which is missing the point. The point being, the ability to exploit regularities in the world which also applies to Newcomb-like scenarios. ↩︎
You need ultra-POMDPs to model e.g. counterfactual mugging. Even ordinary POMDPs have been relatively neglected in the literature, because the control problem is PSPACE-hard. Dealing with that is an interesting question, but it seems orthogonal to the philosophical issues that arise from Newcomb. ↩︎
Although there is still room for a fairness criterion weaker than pseudocausality but stronger that the imagined fairness criterion of UDT. ↩︎
Thanks for the detailed response.
To be honest, I’ve been persuaded that we disagree enough in our fundamental philosophical approaches, that I’m not planning to deeply dive into infrabayesianism, so I can’t respond to many of your technical points (though I am planning to read the remaining parts of Thomas Larson’s summary and see if any of your talks have been recorded).
“However, CDT and EDT are both too toyish for this purpose, since they ignore learning and instead assume the agent already knows how the world works, and moreover this knowledge is represented in the preferable form of the corresponding decision theory” - this is one insight I took from infrabayesianism. I would have highlighted this in my comment, but I forgot to mention it.
“ Learning requires that an agent that sees an observation which is impossible according to hypothesis H, discards hypothesis H and acts on the other hypotheses in has” - I have higher expectations from learning agents - that they learn to solve such problems despite the difficulties.
I'm saying that there's probably a literal impossibility theorem lurking there.
But, after reading my comment above, my spouse Marcus correctly pointed out that I am mischaracterizing IBP. As opposed to IBRL, in IBP, pseudocausality is not quite the right fairness condition. In fact, in a straightforward operationalization of repeated full-box-dependent transparent Newcomb, an IBP agent would one-box. However, there are more complicated situations where it would deviate from full-fledged UDT.
Example 1: You choose whether to press button A or button B. After this, you play Newcomb. Omega fills the box iff you one-box both in the scenario in which you pressed button A and in the scenario in which you pressed button B. Random is not allowed. A UDT agent will one-box. An IBP agent might two-box because it considers the hypothetical in which it pressed a button different from what it actually intended to press to be "not really me" and therefore unpredictable. (Essentially, the policy is ill-defined off-policy.)
Example 2: You see either a green light or a red light, and then choose between button A and button B. After this, you play Newcomb. Omega fills the box iff you either one-box after seeing green and pressing A or one-box after seeing green and pressing B. However, you always see red. A UDT agent will one-box if it saw the impossible green and two-box if it saw red. An IBP agent might two-box either way, because if it remembers seeing green then it decides that all of its assumptions about the world need to be revised.