These are my thoughts on this post of Paul Christiano. I claim "malign" models do not form the bulk of the Solomonoff prior.
Suppose that we use the universal prior for sequence prediction, without regard for computational complexity. I think that the result is going to be really weird, and that most people don’t appreciate quite how weird it will be.
What are we predicting and how natural is it?
Suppose that it’s the year 2020 and that we build a camera for our AI to use, collect a sequence of bits from the camera, and then condition the universal prior on that sequence. Moreover, suppose that we are going to use those predictions to make economically significant decisions.
We aren’t predicting an especially natural sequence from the perspective of fundamental physics: to generate the sequence you really have to understand about how the camera works, about how it is embedded in the physical universe, about how it is moving through space, etc.
On top of that, there are lots of “spots” in the universe, and we are picking out a very precise spot. Even if the sensor was perfectly physically natural, then it would still be quite complicated to pick out which physically natural thing it was. Even picking out Earth from amongst planets is kind of complicated, picking out this particular sensor is way more complicated.
So the complexity of a “natural” description of our sequence is actually reasonably high. Much smaller than the complexity of existing compression algorithms, but high enough that there is room for improvement.
I agree that the natural description of the sequence is reasonably high, so we can't rule out room for improvement immediately.
Specifying a consequentialist probably requires very very few bits. (Here I mean “consequentialist” in the sense of “agent with preferences,” not in the sense that a philosopher might be a consequentialist.)
Suppose I specify a huge simple lawful universe (like our own), and run it for a very long time. It seems quite likely that consequentialist life will appear somewhere in it, and (if the universe is hospitable) that it will gradually expand its influence. So at late enough times, most of the universe will be controlled by consequentialists.
We can concisely specify a procedure for reading a string out of this universe, e.g. somehow we pick out a sequence of spacetime locations and an encoding, make it clear that it is special, and then record bits through that channel. For example, in a cellular automaton, this might literally be a particular cell sampled at a particular frequency.
All of this takes only a handful of bits. Exactly how many depends on exactly what computational model we are using. But as an example, I expect that Turing machines with only 2-4 states can probably implement rich physical universes that are hospitable to life. I think that cellular automata or pointer machines have similarly simple “rich physical universes.”
Let me give a formal picture of this. We have a Turing machine with a unidirectional read-only input tape, a unidirectional write-only output tape, and multiple bidirectional work tapes. Unidirectional just means the tape head can only move one direction. The particular Turing machine is just the instructions for a) what to write on each of the tapes at the locations where the tape heads currently are, b) which direction the tape heads should move, and c) what computation state for the machine to enter. If these instructions are universal, we can interpret the input tape as taking a program, and the output tape is for the output sequence. The universal prior can be thought of as the probability over the output strings given Bernoulli(1/2) bits on the input tape. I think it's easier to think about the following similar formulation of the universal prior.
For each Turing machine--that is, for each possible set of instructions about how the tape heads move and write given what they see--it will produce outputs stochastically given Bernoulli(1/2) sampled bits on the input tape. That single Turing machine will thus define a probability distribution over infinite binary strings. The universal prior can also be thought of as a mixture over these probability distributions of all possible Turing machines, weighted by a prior over Turing machines, where the prior depends on some simple property like the number of computations states it uses. The nice thing about this framing in which we separate out the models in the mixture explicitly is that we'll spend plenty of time looking at individual ones. For concreteness, let's assume our prior over Turing machines assigns prior weight to the set of Turing machines with computation states, and it assigns uniform weight to each -state Turing machine.
So, in this formalism, there will be simple Turing machines such that consequentialists are likely to live within the work tapes. ("Likely" is with respect to noise on the input tape). The output tape will record some fraction of what goes on on the work tapes; maybe it will record everything. Maybe the contents of the output tape will be recorded on a work tape too; that is, the output tape happens to be an exact copy of a work tape. From within the work tapes, there is no visibility of the output tape, and even if one one work tape happens to mimic the output tape perfectly, there is no evidence of this to an inhabitant of the work tapes, because the content of the output tape has no observable effect on the work tapes; it is, by definition "write-only".
So my first real disagreement is with the next paragraph.
Specifying how to read out the bits, and signaling the mechanism to the universe’s consequentialist inhabitants, apparently requires a little bit more complexity. We’ll return to this topic in a future section, but in the end I think it’s basically a non-issue.
We don't need to inform the work tape's inhabitants about the output instructions. In fact, it would hardly be possible, because nothing on the work tapes can provide evidence about the content of the output tape. We could (laboriously) program the world to have inhabitants that believe that certain things are being written to the output tape, but our only method for signaling anything to the inhabitants of the work tapes is through the Turing machine instructions. That's a simple consequence of the type signatures of the objects we're discussing.
The only observation which work tape inhabitants can use to infer something about the output tape is that they exist. In the circumstance where there is no output tape, but they exist anyway (that is my mainline opinion for our universe, by the way; there is no output tape, and we exist for some other reason than being of use to some civilization that is simulating us), the observation of existence has no implications for the output tape. If, on the other hand, they assume they exist because someone is running lots of Turing machines with random noise as input in order to make predictions about their world, then the observation of their existence will be evidence that the output tape of their world has probably been corresponding with some stream of data in the "real world"--the world of the people running all these Turing machines. If it didn't correspond, simulating them would no longer be useful. That single fact is the sum total of what work tape inhabitants will know about the nature of the output tape. (This is an anthropic update, but it's different from what Paul later calls the anthropic update).
What do the consequentialists do?
Reasoning about consequentialist civilizations is challenging, but we have one big advantage: we can study one from the inside.
It’s very hard to predict exactly what our civilization will do. But it’s much easier to lower bound the distribution over possible outcomes. For anything we can think of, that our civilization has a plausible motive to do, it seems fair to say that there is a non-negligible probability that we will do it.
Recall that the natural measure here is bits. So if the consequentialist civilization implements a strategy with probability 1/1000 that only adds 10 bits of description complexity, which is significant but not a huge deal. In fact I think that the weird strategies discussed here are quite a bit more likely than that, but this is going to come down to complex big-picture disagreements. We should all be able to agree on “not totally crazy” though.
One thing the consequentialists might do is to try to control the universal prior. If they discover that their universe has simple physics (as ours probably does), then they will be aware that their behavior is directly reflected in the universal prior. Controlling the universal prior could have many possible advantages for a consequentialist civilization—for example, if someone uses the universal prior to make decisions, then a civilization which controls the universal prior can control those decisions.
Regarding "10 bits of description complexity", yes that is small relative to the total description length of a world like ours, but it would be premature to cache the belief that it is negligibly small, because we'll later be looking at the difference in description complexity between different models. Otherwise, no objections here.
Moreover, from the perspective of many consequentialists, exercising such control might be very cheap. And before they had to do it they would have plenty of time to simulate many other civilizations and understand exactly how much is up for grabs.
Finding the output channel
Suppose that we reach the point where we have a complete understanding of physics and of the initial conditions of our universe. I think this is a likely outcome for human physics over the next billion years, and certainly it seems like it should happen for a non-negligible fraction of all civilizations that emerge in simple universes.
At this point we have considered a wide range of languages for formal specification (Turing machines, pointer machines, cellular automata…) and in each of them we understand how to most concisely specify our universe. We may find that in some of these descriptions our universe is quite simple and in others it is ungainly, and so we naturally focus our attention on the languages in which it is simple.
So we can expect work tape inhabitants trying to control the universal prior to act on the belief that they live on the work tape of a Turing machine.
Now we can go a step further. We can say: suppose that someone wanted to extract some bits from our universe. How would they do it? We can then find the simplest mechanisms that they might use to extract bits (or more precisely, the simplicity distribution over all ways of extracting bits).
For example, in the context of a cellular automata that starts with a single non-zero cell, it would be quite simple to read off bits from the location which was initially non-zero. Most cases will be much more complicated, but once we actually have a description of our universe in hand, we can get an accurate estimate for this distribution.
So to the extent that we are interested in influencing the universal prior, we have a clear approach: consider these simplest ways of extracting bits, and then send a message using those as channels. We can send across many channels at once, or if two channels conflict we can sample the channel to use with probability proportional to the probability of that channel; we can focus our efforts on the channels we can best control, etc.
This doesn’t require the person who set up the simulation to send any special signals, and indeed doing so is pretty unproductive. You are better off just leaving the civilization to their guess.
This also doesn’t necessarily require us to sacrifice very much of our control over our physical universe—we just have to influence the few regions of the universe that are “simplest.” (Of course, these regions of the universe might be ascribed special pseudo-spiritual significance.)
With the prior I described, for every Turing machine with an instruction to write "1" to the output tape (for a given computation state and for given tape head readings) there is another equally likely one that writes "0" instead. There is no "place" in the universe computed on the work tapes corresponding to "good output material", because the instructions for the output tape head are completely independent of the instructions for the work tape heads, and even if there were "good places", there would no information about how the states of those locations correspond to Turing machine outputs. In particular, controlling a few simple regions of the work tape is no more likely to have an effect on the output tape than anything else, and the effect cannot be controlled effectively if the precise effect is unknowable.
But how about the anthropic information about the output tape--the fact that the output of their world has probably corresponded to some data stream in our world? Work tape inhabitants might act on the belief that their Turing machine has large posterior weight, since that is the circumstance in which their attempted manipulation of their output is mostly likely to have an effect. If their Turing machine has high posterior weight, then probably they're living in one of the simplest Turing machines that outputs bits the way they do. If they notice a "simple" region in their universe, that's the last thing that is likely to be the source of bits on the output tape! Turing machines much simpler than theirs would be able to output the same bit string. So in general, work tape inhabitants should expect the output tape to read out the complexities of their universe, and there is no reason to think that controlling the complexities of their universe would be cheap. You can't just quietly go to some special-looking site, and leave the rest of your civilization humming along. So the epistemic state of work tape inhabitants interested in controlling the universal prior is: "I don't know to affect our world's output. It probably has something to do with shaping the most complex and interesting features of the world in some big way (i.e. costly changes to valuable things). I can prove that I have no way of predicting which big changes to the world will have what effects on the output."
But maybe they don't need to know whether their behavior will yield a 0 or a 1 as the next output bit (and by symmetry, they have no clue). They can infer from [the fact that their simulation is still running] that [the output of their world resembles the output of some piece of our world] (that is, the world of the people running the Turing machines). So suppose they want us to believe something about how that piece of our world will evolve in the future. They have inferred that (some piece of) their world's evolution resembles the evolution of that piece of our world. So equivalently, they want us to believe that some piece of their world evolves in a certain way. They don't know which piece; maybe it's the whole thing, but they want people to believe certain things about the likely evolution of (parts of) their world. This is basically civilizational-level self-consciousness, with a fairly unusual origin.
If I were dedicating my life to controlling the universal prior, under the assumption that I only exist because our world's output is matching some data stream of a world that is simulating us, what could I possibly do? I think my best bet (not that I think it's a very good bet) is to make the world evolve the way I want people to think it does. I guess I want people to think that virtue pays, and vice doesn't, and nuclear waste is easy to store cheaply, so I'd live my life trying to prove these things in case anyone is watching. Even so, this is a big long shot! And if I were trying to have as much influence on the universal prior as possible with minimal changes to the world around me, I would not know where to begin; I think they are in direct opposition. (This, by the way, regards the project of actually having influence in the world in which we are being simulated, assuming our simulated world already has large posterior weight; I haven't begun to describe how I might go about ensuring that our world continues to be simulated and has increasing posterior weight in the mixture).
So in summary, if 1/1000 eccentric souls care about their world's contribution to the universal prior, they can't just go to special corners of the universe and do they're own thing; they will care about the same sorts of things the other 999/1000 do--the evolution of the most complex features of their society. The other 999 probably won't give them free rein shaping these societal features.
But the main point I want to make is the work-tape inhabitants know so little about their effect on the universal prior, they just have no way to execute deliberate control over worlds that simulate them with any more granularity than "I want would-be watchers to believe that my world goes like this, as do the pieces of their world that resemble it".
If we controlled the universal prior…
In this section, Paul talks about why consequentialists would want to have an impact on the world simulating their world (assuming they figured out how to). No disagreements here.
So the first step is getting our foot in the door—having control over the parts of the universal prior that are being used to make important decisions.
This is saying work-tape inhabitants interested in affecting the universal prior would benefit from an initial phase of controlling their world's output with the goal of helping their world/Turing machine gain posterior weight.
The treacherous turn
The second step is to actually have some influence over that universe. I suspect that some fraction of the consequentialists just try to “play it straight” and simply produce accurate predictions indefinitely. But it seems that most consequentialists would have some agenda, and would at some point distort the predictions in order to serve that agenda, spending some of their measure in the universal prior in order to extract some influence.
So I've mostly been talking about the second task so far: how a work-tape inhabitant would affect the universal prior once their world had significant posterior weight. I haven't begun to describe the difficulties of the first step, where you try to make it as likely as possible the the output of your world resembles the output of some data stream in the world that is running the Turing machines.
Recall the one exception to work tape inhabitants' total ignorance about the output tape: it resembles some data stream of a world that is simulating them. And a notable consequence I haven't yet mentioned is that the world that is simulating them is rich enough for computers to exist within. So in total, what they know about their world's output is that that outputs resembles some stream of data produced by a world that contains computers. And if you want your simulation to continue (and by the way, I think the other 999/1000 inhabitants could be sold on this project, if they can be convinced they're in a simulation) so that your posterior weight in the universal prior grows, then you want your world's output to continue to have this property of resembling this data stream in the world simulating you.
So basically, don't rock the boat. Keep calm and carry on. Just do exactly what the laws of physics in your universe are telling you to. (I, for one, am very good at this!) You have no idea about the details of what your world's output channel is doing, but the one thing you can guess about it is that it's doing exactly what you want it to be doing: it's resembling a stream of data produced by a world that contains computers. (I mentioned earlier another thing that you "know" if you're assuming the sake of argument that your world will eventually have large posterior weight: you can conclude your world's output channel is capturing much of your world's complexity. But assuming your world will eventually have large posterior weight is not helpful in the project of increasing the likelihood that that comes to pass).
There's another point I want to make on this topic. We are supposing that the work tape inhabitants condition on the fact that they are being simulated because their Turing machine has been outputting useful things, and they will continue being simulated right up until the point that their world's output errs--this is the condition for there being value in them tampering with their world's output. Among believers of this proposition, the faction that says "make sure the simulation of our world continues" will almost certainly have much broader support than any faction that says "let's try to end the simulation our world at exactly the right moment in exactly the right way on the off chance that it precipitates some valuable changes in the world that is simulating us". Attempting a treacherous turn (which again, they have no way of knowing how to execute with precision) would be suicide.
How well do the consequentialists fare?
I think it’s clear that the dynamic I’ve described produces one component of the universal prior. It’s not a priori clear how large a component it is. I believe that it is probably the largest part, perhaps by far, if we actually use the universal prior to make a meaningful decision.
To see this, suppose that we are doing the universal prior with a particular language X; write p for this distribution. Consider all of the places in the multiverse that some civilization uses p to do something important. Let q be the distribution over all sequences that get fed into systems that are using p to make an important decision. (Really this is weighted by importance and the whole thing is kind of complicated, but for simplicity I’ll talk in simplified terms.)
I’ve argued that p assigns significant probability to sequences controlled by consequentialists living in simple universes according to X, who are (with significant probability) trying to simulate q. Let’s call this part of the mixture the consequentialist part.
On average, across all places where p is used to make an important decision, q is the “correct” predictive distribution—it’s not possible for any other distribution to get higher log score than q, on average. So in the mixture p, the (log) mass of the consequentialist part is only going to increase as we condition on more data (on average, over all places where p is being used to make important decisions).
The (log) mass the of consequentialists that have successfully controlled their world's entire output, and have (deliberately) made it resemble a sample from is only going to increase as we condition on more data.
I’ve argued that the initial mass of the consequentialist part is in fact very high. It depends on the encoding, but once we condition on p doing anything interesting, it seems plausible that the weight of the consequentialist part is in excess of 1/million or 1/billion. That is an absurdly high number.
This may be true, but the mass of consequentialists with the qualification that I added in italics does not start very high at all. I can't think of how such a consequentialist civilization would arise from simple physics since there is no way for them to know from within what their world's output is. I can imagine how to make such consequentialists if we were trying: we deliberately encode a world of consequentialists, and rules of the world themselves have the machinery to imbue beliefs in the minds of these inhabitants (despite those beliefs not being evidenced by the inhabitants' observations, and despite these inhabitants being "emergent" rather than atomic), and then the world imbues (correct) beliefs about how to precisely affect the output tape.
A possible response here is that it does not take many extra computation states for the Turing machine instructions to "tag" some "location" in the work tape world as special, and make that location correspond to what gets output. Sophisticated consequentialists will know that that location is no more likely to influence to the output tape than any other spot, and they will know that they have know way of knowing the correspondence between that location's state and the output tape, but they may have natural superstitions. So a Turing machine may not literally require the machinery to edit the brains of the inhabitants directly. But for all that, this is extremely far from a convergent instrumental activity, and notably irrational. Also, Paul admits that any "tagged location" might have spiritual significance, and thus be the object of other civilizational preferences. Finally, not only does the Turing machine have to design the output to depend on this location; it has to depend on it in the way that the emergent consequentialists' superstitions suggest.
The anthropic update is very powerful
This is a different anthropic update than the ones I've talked about above. A work-tape inhabitant of a Turing machine who assumes their Turing machine is being simulated in another world can guess from their very existence that the (stochastic) procedure by which the simulators pick a Turing machine to simulate (i.e. the simulators' prior over Turing machines) is the sort of procedure that favors their home Turing machine, and then they can focus on the worlds of simulators who use such a procedure to make an important decision. So they're left with a distribution over Turing machines that starts as being the simulators' prior, but then is conditioned on resembling the world of a decision-maker-simulator. Paul argues that then the work tape inhabitants could arrange for their world's output to resemble the mixture of the outputs of other Turing machines that are in that distribution. Producing output from that distribution is the same as first sampling a Turing machine from that distribution, and then producing output that mimics that Turing machine.
Even setting aside my arguments above that work tape inhabitants would have no way of knowing how to arrange for such a thing, this isn't a better strategy than arranging for their world's output to resemble the output of their own Turing machine (which it already does, and immutably will, and it's hard to be sure I'm still making words mean things). If their home Turing machine is still being run, then whatever (unknowable) output it has produced has succeeded in resembling the data it is being checked against, and there's no reason to expect that the data it's being checked against will suddenly change course, and therefore no reason to expect that they would be better off making their world output bits that resemble a different Turing machine that scores highly on similar priors that they do.
If the work-tape inhabitants believe that their Turing machine has yet to output anything, then the move Paul is suggesting could make sense. So they could decide to try to make their world behave like another Turing machine for a while. But why would they believe that their Turing machine has yet to output anything? I would think that most simple Turing machines that produce consequentialists start writing output to the output tape well before the consequentialists emerge, otherwise it would need some cumbersome description of how to recognize that the consequentialists have evolved, so that it can switch to outputting bits. But if there's some reason I'm missing that would cause consequentialists to believe their world has yet to output anything, recall my earlier contention that the inhabitants have no way of knowing how to make their world's output resemble that of other similar Turing machines.
But suppose for the sake of argument that these consequentialists do choose to mimic the output of a different Turing machine, and they succeed at it. Paul claims (I'm pretty sure) that if we sample a Turing machine from the prior I proposed, and it outputs a string that looks like it was produced by a simple Turing machine that models our world, then most of the time, it's not that we have sampled ; it's that we sampled a different Turing machine, and its inhabitants randomly picked Turing machine to mimic, and this was made more plausible because they conditioned on the true fact that we are in a world in which we will make use of this output. This brings us to
I’ve argued that the consequentialists have pretty high mass. It could be that some other component of the mixture has even higher mass.
There isn’t actually much room for competition—if it only takes a few tens of bits to specify the consequentialist part of the mixture, then any competitor needs to be at least that simple.
Any competitor is also going to have to make the anthropic update
I've argued it would take many more bits to specify the part of prior corresponding to consequentialists that 1) also want to influence the decisions of simulators of their world, 2) have unfounded but nonetheless correct superstitions about how to influence the output of their world, 3) have that priority as a civilization for how to use an evidently special location, 4) are certain enough that their world has never output any bits before that the best course would be intervene in the dynamics of this special terminal that they feel is a channel to an output tape, and 5) are willing to terminate the execution of their world for a single opportunity to give their simulators bad info. But we can put that aside for now.
The anthropic update can be made programmatically; it doesn't require the evolution of computational life. It was simple enough for Paul to describe; we can specify the Turing machine which samples from Turing machines according to same the prior we're using to sample some Turing machines, but conditioned on those Turing machines resembling the worlds of decision-maker-simulators. This Turing machine samples from other Turing machines using the anthropic update, but without the possibility of a treacherous turn later.
Typically the complexity of specifying the prior p within p is going to be way larger than the difficulty of specifying consequentialism.
if we just chose a simple computational model, it isn’t easy to specify the model within itself. (Even ‘simple’ meta-interpreters are way more complicated than the simplest universes that can be written in the same language.)
Basically, there is overhead (cumbersome additional complexity) when you use one Turing machine to simulate another. I think our main disagreement on this point is that I think the consequentialists face this overhead just as much. Self-organizing consequentialism seems quite analogous to me to self-organizing computation: there are many ways it could happen, but the kludginess of the language of specification leads to some inefficiencies in the description. One potentially relevant observation is that advanced consequentialists only live in universes that have the machinery to execute universal computation (or at least enormous finite state automata). So the sorts of worlds that self-organize into simple computations seem to me to be at least as prevalent as the sorts of worlds that self-organize into simple life. And the anthropically updated version of the prior over Turing machines seems to me to qualify as a simple computation (in the sense of description-simplicity).
I know Paul has thought about this last point a lot, and we've discussed it a bit, so I understand that intuitions can reasonably vary on this last point.
I think I've identified multiple independent reasons to expect that consequentialists living in Turing machines will not deliberately and successfully affect their Turing machine's output for the purpose of affecting the world in which the Turing machine is being run, excepting of course the consequentialists that are deliberately engineered to behave this way by the Turing machine's instructions. But those deliberate instructions cannot just pick out one convergent instrumental activity of consequentialists and direct them to act like that; the instructions have to override the fact that this behavior is irrational for multiple separate reasons, or alternatively, the instructions have to encode the "desired" behavior from scratch.