Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Clarifying Consequentialists in the Solomonoff Prior

4eric_langlois

2Vlad Mikulik

3zeshen

3harfe

3AlexMennen

7paulfchristiano

2AlexMennen

1Vlad Mikulik

2AlexMennen

1Vlad Mikulik

5AlexMennen

2paulfchristiano

2AlexMennen

6paulfchristiano

2AlexMennen

1Liam Donovan

New Comment

16 comments, sorted by Click to highlight new comments since: Today at 6:58 AM

The part about the reasoners having an arbitrary amount of time to think wasn't obvious to me. The TM can run for arbitrarily long but if it is simulating a universe and using the universe to determine its output then the TM needs to specify a system for reading from the universe.

If that system involves a start-to-read time that is long enough for the in-universe life to reason about the universal prior then that time specification alone would take a huge number of bits.

On the other hand, I could imagine a scheme that looks for a specific short trigger sequence at a particular spatial location then starts reading out. If this trigger sequence is unlikely to occur naturally then the civilization would have as long as they want to reason about the prior. So overall it does seem plausible to me now to allow for arbitrarily long in-universe time.

The trigger sequence is a cool idea.

I want to add that the intended generator TM also needs to specify a start-to-read time, so there is symmetry there. Whatever method a TM needs to use to select the camera start time in the intended generator for the real world samples, it can also use in the simulated world with alien life, since for the scheme to work only the difference in complexity between the two matters.

There is additional flex in that unlike the intended generator, the reasoner TM can sample its universe simulation at any cheaply computable interval, giving the civilisation the option of choosing any amount of thinking they can perform between outputs, if they so choose.

The Solomonoff, or Universal, prior is a probability distribution over strings of a certain alphabet (usually over all strings of 1s and 0s). It is defined by taking the set of all Turing machines (TMs) which output strings, assigning to each a weight proportional to

The image right after the paragraph above can't seem to be displayed after multiple refreshes. The same happens on the AF post. Tried different browsers but it didnt work.

The picture links to https://www.codecogs.com/png.latex?2^{-L} which gives a 404 Error. The picture likely displayed the formula

I'm not convinced that the probability of S' could be pushed up to anything near the probability of S. Specifying an agent that wants to trick you into predicting S' rather than S with high probability when you see their common prefix requires specifying the agency required to plan this type of deception (which should be quite complicated), and specifying the common prefix of S and S' as the particular target for the deception (which, insofar as it makes sense to say that S is the "correct" continuation of the prefix, should have about the same "natural" complexity as S). That is, specifying such an agent requires all the information required to specify S, plus a bunch of overhead to specify agency, which adds up to much more complexity than S itself.

specifying the agency required to plan this type of deception (which should be quite complicated)

Suppose that I just specify a generic feature of a simulation that can support life + expansion (the complexity of specifying "a simulation that can support life" is also paid by the intended hypothesis, so we can factor it out). Over a long enough time such a simulation will produce life, that life will spread throughout the simulation, and eventually have some control over many features of that simulation.

And specifying the common prefix of S and S' as the particular target for the deception (which, insofar as it makes sense to say that S is the "correct" continuation of the prefix, should have about the same "natural" complexity as S)

Once you've specified the agent, it just samples randomly from the distribution of "strings I want to influence." That has a way lower probability than the "natural" complexity of a string I want to influence. For example, if 1/quadrillion strings are important to influence, then the attackers are able to save log(quadrillion) bits.

Suppose that I just specify a generic feature of a simulation that can support life + expansion (the complexity of specifying "a simulation that can support life" is also paid by the intended hypothesis, so we can factor it out). Over a long enough time such a simulation will produce life, that life will spread throughout the simulation, and eventually have some control over many features of that simulation.

Oh yes, I see. That does cut the complexity overhead down a lot.

Once you've specified the agent, it just samples randomly from the distribution of "strings I want to influence." That has a way lower probability than the "natural" complexity of a string I want to influence. For example, if 1/quadrillion strings are important to influence, then the attackers are able to save log(quadrillion) bits.

I don't understand what you're saying here.

I agree that this probably happens when you set out to mess with an arbitrary particular S, I.e. try to make some S’ that shares a prefix with S as likely as S.

However, some S are special, in the sense that their prefixes are being used to make very important decisions. If you, as a malicious TM in the prior, perform an exhaustive search of universes, you can narrow down your options to only a few prefixes used to make pivotal decisions, selecting one of those to mess with is then very cheap to specify. I use S to refer to those strings that are the ‘natural’ continuation of those cheap-to-specify prefixes.

There are, it seems to me, a bunch of other equally-complex TMs that want to make other strings that share that prefix more likely, including some that promote S itself. What the resulting balance looks like is unclear to me, but what’s clear is that the prior is malign with respect to that prefix - conditioning on that prefix gives you a distribution almost entirely controlled by these malign TMs. The ‘natural’ complexity of S, or of other strings that share the prefix, play almost no role in their priors.

The above is of course conditional on this exhaustive search being possible, which also relies on there being anyone in any universe that actually uses the prior to make decisions. Otherwise, we can’t select the prefixes that can be messed with.

This reasoning seems to rely on there being such strings S that are useful to predict far out of proportion to what you would expect from their complexity. But a description of the circumstance in which predicting S is so useful should itself give you a way of specifying S, so I doubt that this is possible.

I agree. That’s what I meant when I wrote there will be TMs that artificially promote S itself. However, this would still mean that most of S’s mass in the prior would be due to these TMs, and not due to the natural generator of the string.

Furthermore, it’s unclear how many TMs would promote S vs S’ or other alternatives. Because of this, I don’t now whether the prior would be higher for S or S’ from this reasoning alone. Whichever is the case, the prior no longer reflects meaningful information about the universe that generates S and whose inhabitants are using the prefix to choose what to do; it’s dominated by these TMs that search for prefixes they can attempt to influence.

I didn't mean that an agenty Turing machine would find S and then decide that it wants you to correctly predict S. I meant that to the extent that predicting S is commonly useful, there should be a simple underlying reason why it is commonly useful, and this reason should give you a natural way of computing S that does not have the overhead of any agency that decides whether or not it wants you to correctly predict S.

How many bits do you think it takes to specify the property "people's predictions about S, using universal prior P, are very important"?

(I think you'll need to specify the universal prior P by reference to the universal prior that is actually used in the world containing the string S, if you spell out the prior P explicitly you are already sunk just from the ambiguity in the choice of language.)

It seems relatively unlikely to me that this will be cheaper than specifying some arbitrary degree of freedom in a computationally rich universe that life can control (+ the extra log(fraction of degrees of freedom the consequentialists actually choose to control)). Of course it might.

I agree that the entire game is in the constants---what is the cheapest way to pick out important strings.

I don't think that specifying the property of importance is simple and helps narrow down S. I think that in order for predicting S to be important, S must be generated by a simple process. Processes that take large numbers of bits to specify are correspondingly rarely occurring, and thus less useful to predict.

I don't buy it. A camera that some robot is using to make decisions is no simpler than any other place on Earth, just more important.

(This already gives the importance-weighted predictor a benefit of ~log(quadrillion))

Clearly you need to e.g. make the anthropic update and do stuff like that before you have any chance of competing with the consequentialist. This might just be a quantitative difference about how simple is simple---like I said elsewhere, all the action is in the additive constants, I agree that the important things are "simple" in some sense.

I have spent a long time being confused about Paul’s post on consequentialists in the Solomonoff prior. I now think I understand the problem clearly enough to engage with it properly.

I think the reason I was confused is to a large degree a problem of framing. It seemed to me in the course of discussions I had to deconfuse myself to me that similar confusions are shared by other people. In this post, I will attempt to explain the framing that helped clarify the problem for me.

## i. A brief sketch of the Solomonoff prior

The Solomonoff, or Universal, prior is a probability distribution over strings of a certain alphabet (usually over all strings of 1s and 0s). It is defined by taking the set of all Turing machines (TMs) which output strings, assigning to each a weight proportional to

(where L is its description length), and then assigning to each string a probability equal to the weights of the TMs that compute it. The description length is closely related to the amount of information required to specify the machine; I will use description length and amount of information for specification interchangeably.

(The actual formalism is in fact a bit more technically involved. I think this picture is detailed enough, in the sense that my explanation will map onto the real formalism about as well.)

The above defines the Solomonoff prior. To perform Solomonoff induction, one can also define conditional distributions by considering only those TMs that generate strings beginning with a certain prefix. In this post, we’re not interested in that process, but only in the prior.

## ii. The Malign Prior Argument

In the post, Paul claims that the prior is dominated by consequentialists. I don’t think it is quite dominated by them, but I think the effect in question is plausibly real.

I’ll call the key claim involved the Malign Prior Argument. On my preferred framing, it goes something like this:

Premiss: For some strings, it is easier to specify a Turing Machine that simulates a reasoner which decides to predict that string, than it is to specify the intended generator for that string.Conclusion: Therefore, those strings’ Solomonoff prior probability will be dominated by the weight assigned to the TM containing the reasoner.It’s best to explain the idea of an ‘intended generator’ with examples. In the case of a camera signal as the string, the intended generator is something like a TM that simulates the universe, plus a specification of the point in the simulation where the camera input should be sampled. Approximations to this, like a low-fidelity simulation, can also be considered intended generators.

There isn’t anything special about the intended generator’s relationship to the string - it’s just one way in which that string can be generated. It seems most natural to us as humans, and the Occamian nature of SI feels like it should be biased towards such strings, but nothing in principle stops something less ‘natural’ from being in fact a shorter description.

This idea of ‘naturalness’ is important in understanding what the Malign Prior Argument is about; I will use it roughly to refer to something like ‘the set of Turing Machines that don’t involve reasoners that attempt to influence the prior’, or ‘the set of intended generators’. It’s vague, but I think it gets across the point.

I read most of Paul’s post as an existence argument for the premiss, using consequentialists in other worlds as the reasoners. I don’t think all such reasoners are like Paul describes; I also doubt that all or even most strings are subject to this effect, but find it very plausible that some are.

I think the argument is not, at its core, about these reasoners making the strings they output

more likelythan the ‘true string’. It is concerning enough that there is any effectat allthat these reasoners have on the prior, which is the fact this argument establishes.As a side note, it’s also worth noting that this is not about these reasoners breaking out of the box and taking over our world, although that is also a related concern one might have.

## iii. The support for premiss 1

Consider a string S’ with very high natural K-complexity (description length of the intended generator) that shares a prefix with a string S that is of high interest to human-like civilisations.

I claim that the prior probability of this string is higher than it ‘naturally’ ‘should’ be, in the sense that a large part of the weight that composes this probability is coming from a TM that simulates a reasoner that is attempting to influence the prior.

The reasons this happens are:

1 is a crucial property of the Solomonoff prior that allows this to happen. A TM in the Solomonoff prior can think for a very, very long time — enough to e.g. simulate an Ackerman(Ackerman(10)) initial world states each for Ackerman(Ackerman(10)) timesteps. It can perform something close to an exhaustive search of all possible civilizations and decide to attempt to influence the one that is most susceptible to be influenced, if that’s what it wants to do. This is a ridiculous computation, but we’re talking about a mathematical object, not an actual process that we run. It’s plausible that if the prior was also weighted by speed of computation, these effects would be far less pronounced (and maybe would not arise at all).

To see that 2 and 3 are plausible, we need to think about S’, which by assumption is a string with high natural K-complexity. This high complexity ‘buys’ us the space to specify a reasoner, and the space to specify values, without making the TM more complex than a natural generator of S’. Now, because S is by assumption of interest to civilisations, there likely exists a TM containing a reasoner that performs its exhaustive search, finds S, and concludes that its values are best served by making S’ more likely (e.g. to influence the decision-making of civilisations that are thinking about what S is, given a prefix of it known to them).

In a way, this agent uses its simplicity to give more simplicity to some other string. That is how the prior gets hijacked.

Note that this reasoner will need to have goals that are simpler than the natural generator of S’ in order to actually contribute to S’ being more likely - otherwise, specifying its TM would be more expensive than specifying the natural generator of S’.

The above is non-constructive (in the mathematical sense), but nevertheless the existence of strings S’ that are affected thus seems plausible. The spaces of possible TMs and of the strings we (or other users of the Solomonoff prior) could be interested in are simply too vast for there not to be such TMs. Whether there are very many of these, or whether they are so much more complicated than the string S so as to make this effect irrelevant to our interests, are different questions.

## iv. Alien consequentialists

In my view, Paul’s approach in his post is a more constructive strategy for establishing 2 and 3 in the argument above. If correct, it suggests a stronger result - not only does it cause the probability of S’ to be dominated by the TM containing the reasoner, it makes the probability of S’ roughly comparable to S, for a wide class of choices of S.

In particular, the choice of S that is susceptible to this is something like the camera example I used, where the natural generator is S is a specification of our world together with a location where we take samples from. The alien civilisation is a way to construct a Turing Machine that outputs S’ which has comparable complexity to S.

To do that, we specify a universe, then run it for however long we want, until we get somewhere within it smart agents that decide to influence the prior. Since 1 is true, these agents have an arbitrary amount of time to decide what they output. If S is important, there probably will be a civilisation somewhere in some simulated world which will decide to attempt to influence decisions based on S, and output an appropriate S’. We then specify the output channel to be whatever they decide to use as the output channel.

This requires a relatively modest amount of information - enough to specify the universe, and the location of the output. This is on the same order as the natural generator for S itself, if it is like a camera signal.

Trying to specify our reasoner within this space (reasoners that naturally develop in simulations) does place restrictions on what kind of reasoner we up end up with. For instance, there are now some implicit runtime bounds on many of our reasoners, because they likely care about things other than the prior. Nevertheless, the space of our reasoners remains vast, including unaligned superintelligences and other odd minds.

## v. Conclusion. Do these arguments actually work?

I am mostly convinced that there is at least some weirdness in the Solomonoff prior.

A part of me wants to add ‘especially around strings whose prefixes are used to make pivotal decisions’; I’m not sure that is right, because I think scarcely anyone would actually use this prior in its true form - except, perhaps, an AI reasoning about it abstractly and naïvely enough not to be concerned about this effect despite having to explicitly consider it.

In fact, a lot of my doubt about the malign Solomonoff prior is concentrated around this concern: if the reasoners don’t believe that anyone will act based on the true prior, it seems unclear why they should spend a lot of resources on messing with it. I suppose the space is large enough for at least some to get confused into doing something like this by mistake.

I think that even if my doubts are correct, there will still be weirdness associated with the agents that are specified directly, along the lines of section iii, if not those that appear in simulated universes, as described in iv.