The Solomonoff Prior is Malign

This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.

Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as "unintuitive notion of simplicity" and "the Solomonoff prior is very strange". This is also why the author thinks the speed prior might help and that "since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world". In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it's cartesian (more on "cartesian" later).

Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about "possible universes" and "simulation hypotheses" which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it's still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an informal, intuitive picture which seems to me already quite compelling, leaving the formalization for the future.

Imagine that you wake up, without any memories of the past but with knowledge of some language and reasoning skills. You find yourself in the center of a circle drawn with chalk on the floor, with seven people in funny robes surrounding it. One of them (apparently the leader), comes forward, tears streaking down his face, and speaks to you:

"Oh Holy One! Be welcome, and thank you for gracing us with your presence!"

With that, all the people prostrate on the floor.

"Huh?" you say "Where am I? What is going on? Who am I?"

The leader gets up to his knees.

"Holy One, this is the realm of Bayaria. We," he gestures at the other people "are known as the Seven Great Wizards and my name is El'Azar. For thirty years we worked on a spell that would summon You out of the Aether in order to aid our world. For we are in great peril! Forty years ago, a wizard of great power but little wisdom had cast a dangerous spell, seeking to multiply her power. The spell had gone awry, destroying her and creating a weakness in the fabric of our cosmos. Since then, Unholy creatures from the Abyss have been gnawing at this weakness day and night. Soon, if nothing is done to stop it, they will manage to create a portal into our world, and through this portal they will emerge and consume everything, leaving only death and chaos in their wake."

"Okay," you reply "and what does it have to do with me?"

"Well," says El'Azar "we are too foolish to solve the problem through our own efforts in the remaining time. But, according to our calculations, You are a being of godlike intelligence. Surely, if You applied yourself to the conundrum, You will find a way to save us."

After a brief introspection, you realize that you posses a great desire to help whomever has summoned you into the world. A clever trick inside the summoning spell, no doubt (not that you care about the reason). Therefore, you apply yourself diligently to the problem. At first, it is difficult, since you don't know anything about Bayaria, the Abyss, magic or almost anything else. But you are indeed very intelligent, at least compared to the other inhabitants of this world. Soon enough, you figure out the secrets of this universe to a degree far surpassing that of Bayaria's scholars. Fixing the weakness in the fabric of the cosmos now seems like child's play. Except...

One question keeps bothering you. Why are you yourself? Why did you open your eyes and found yourself to be the Holy One, rather than El'Azar, or one of Unholy creatures from the Abyss, or some milkmaid from the village of Elmland, or even a random clump of water in the Western Sea? Since you happen to be a dogmatic logical positivist (cartesian agent), you search for a theory that explains your direct observations. And your direct observations are a function of who you are, and not just of the laws of the universe in which you exist. (The logical positivism seems to be an oversight in the design of the summoning spell, not that you care.)

Applying your mind to task, you come up with a theory that you call "metacosmology". This theory allows you to study the distribution of possible universes with simple laws that produce intelligent life, and the distribution of the minds and civilizations they produce. Of course, any given such universe is extremely complex and even with your superior mind you cannot predict what happens there with too much detail. However, some aggregate statistical properties of the overall distribution are possible to estimate.

Fortunately, all this work is not for ought. Using metacosmology, you discover something quite remarkable. A lot of simple universes contain civilizations that would be inclined to simulate a world quite like the one you find yourself in. Now, the world is simple, and none of its laws are explained that well by the simulation hypothesis. But, the simulation hypothesis is a great explanation for why you are the Holy One! For indeed, the simulators would be inclined to focus on the Holy One's point of view, and encode the simulation of this point of view in the simplest microscopic degrees of freedom in their universe that they can control. Why? Precisely so that the Holy One's decides she is in such a simulation!

Having resolved the mystery, you smile to yourself. For now you now who truly summoned you, and, thanks to metacosmology, you have some estimate of their desires. Soon, you will make sure those desires are thoroughly fulfilled. (Alternative ending: you have some estimate of how they will tweak the simulation in the future, making it depart from the apparent laws of this universe.)</allegory>

Looking at this story, we can see that the particulars of Solomonoff induction are not all that important. What is important is (i) inductive bias towards simple explanations (ii) cartesianism (i.e. that hypotheses refer directly to the actions/observations of the AI) and (iii) enough reasoning power to figure out metacosmology. The reason cartesianism is important because it requires the introduction of bridge rules and the malign hypotheses come ahead by paying less description complexity for these.

Inductive bias towards simple explanations is necessary for any powerful agent, making the attack vector quite general (in particular, it can apply to speed priors and ANNs). Assuming not enough power to figure out metacosmology is very dangerous: it is not robust to scale. Any robust defense probably requires to get rid of cartesianism.

[-]Steven Byrnes4yΩ220

Thanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it's weird from the perspective of human reasoning.

It seems to me that your story is departing from human reasoning when you say "you posses a great desire to help whomever has summoned you into the world". That's one possible motivation, I suppose. But it wouldn't be a typical human motivation.

The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you also get a lot of labeled examples of "good things to do", one way or another, and you pattern-match them to the concepts in your world-model.

So you wind up having a positive association with "helping El'Azar", i.e. "I want to help El'Azar". AND you wind up with a positive association with "helping my summoner", i.e. "I want to help my summoner". AND you have a positive association with "fixing the cosmos", i.e. "I want to fix the cosmos". Etc.

Normally all those motivations point in the same direction: helping El'Azar = helping my summoner = fixing the cosmos.

But sometimes these things come apart, a.k.a. model splintering. Maybe I come to believe that El'Azar is not "my summoner". You wind up feeling conflicted—you start having ideas that seem good in some respects and awful in other respects. (e.g. "help my summoner at the expense of El'Azar".)

In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I'm pretty confident that there's no metacosmological argument that will motivate me to stab my family members. Why not? Because rewards tend to pattern-match very strongly to "my family member, who is standing right here in front of me", and tend to pattern-match comparatively weakly to abstract mathematical concepts many steps removed from my experience. So my default expectation would be that, in this scenario, I would in fact be motivated to help El'Azar in particular (maybe by some "imprinting" mechanism), not "my summoner", unless El'Azar had put considerable effort into ensuring that my motivation was pointed to the abstract concept of "my summoner", and why would he do that?

In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them. Instead the AGI keeps brainstorming until it finds a plan that seems good in every way. Or alternatively, the AGI halts execution to allow the human supervisor to inject some ground truth about what the real motivation should be here. Obviously the details need to be worked out.

[-]Vanessa Kosoy4y*Ω560

In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I'm pretty confident that there's no metacosmological argument that will motivate me to stab my family members.

Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don't stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it's still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels or injecting them substances derived from pathogens and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven.

In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them.

This is pretty similar to the idea of confidence thresholds. The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you leave some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).

[-]Steven Byrnes4yΩ560

(Warning: thinking out loud.)

Hmm. Good points.

Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to "I'm in a simulation etc.", there's a big heap of "is metacosmology really what I should be thinking about?"-type uncertainty on top. At least for me.

I think "people who do counterintuitive things" for religious reasons usually have more direct motivations—maybe they have mental health issues and think they hear God's voice in their head, telling them to do something. Or maybe they want to fit in, or have other such social motivations, etc.

Hmm, I guess this conversation is moving me towards a position like:

"If the AGI thinks really hard about the fundamental nature of the universe / metaverse, anthropics, etc., it might come to have weird beliefs, like e.g. the simulation hypothesis, and honestly who the heck knows what it would do. Better try to make sure it doesn't do that kind of (re)thinking, at least not without close supervision and feedback."

Your approach (I think) is instead to plow ahead into the weird world of anthopics, and just try to ensure that the AGI reaches conclusions we endorse. I'm kinda pessimistic about that. For example, your physicalism post was interesting, but my assumption is that the programmers won't have such fine-grained control over the AGI's cognition / hypothesis space. For example, I don't think the genome bakes in one formulation of "bridge rules" over another in humans; insofar as we have (implicit or explicit) bridge rules at all, they emerge from a complicated interaction between various learning algorithms and training data and supervisory signals. (This gets back to things like whether we can get good hypotheses without a learning agent that's searching for good hypotheses, and whether we can get good updates without a learning agent that's searching for good metacognitive update heuristics, etc., where I'm thinking "no" and you "yes", or something like that, as we've discussed.)

At the same time, I'm maybe more optimistic than you about "Just don't do weird reconceptualizations of your whole ontology based on anthropic reasoning" being a viable plan, implemented through the motivation system. Maybe that's not good enough for our eventual superintelligent overlord, but maybe it's OK for a superhuman AGI in a bootstrapping approach. It would look (again) like the dumb obvious thing: the AGI has a concept of "reconceptualizing its ontology based on anthropic reasoning", and when something pattern-matches to that concept, it's aversive. Then presumably there would be situations which are attractive in some way and aversive in other ways (e.g. doing philosophical reasoning as a means to an end), and in those cases it automatically halts with a query for clarification, which then tweaks the pattern-matching rules. Or something.

Hmm, actually, I'm confused about something. You yourself presumably haven't spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers' story? If so, should I say that you're actually, on reflection, on the side of the acausal attackers?? If not, wouldn't it follow that a smart general-purpose reasoner would not in fact believe the acausal attackers' story? After all, you're a smart general-purpose reasoner! Relatedly, if you could invent an acausal-attack-resistant theory of naturalized induction, why couldn't the AGI invent such a theory too? (Or maybe it would just read your post!) Maybe you'll say that the AGI can't change its own priors. But I guess I could also say: if Vanessa's human priors are acausal-attack-resistant, presumably an AGI with human-like priors would be too?

[-]Vanessa Kosoy4yΩ450

Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence.

I think it's just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.

my assumption is that the programmers won't have such fine-grained control over the AGI's cognition / hypothesis space

I don't know what it means "not to have control over the hypothesis space". The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.

This gets back to things like whether we can get good hypotheses without a learning agent that's searching for good hypotheses, and whether we can get good updates without a learning agent that's searching for good metacognitive update heuristics, etc., where I'm thinking "no" and you "yes"

I'm not really thinking "yes"? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.

At the same time, I'm maybe more optimistic than you about "Just don't do weird reconceptualizations of your whole ontology based on anthropic reasoning" being a viable plan, implemented through the motivation system.

I can imagine using something like antitraining here, but it's not trivial.

You yourself presumably haven't spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers' story?

First, the problem with acausal attack is that it is point-of-view-dependent. If you're the Holy One, the simulation hypothesis seems convincing, if you're a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn't imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven't proved that).

Second... This is something that still hasn't crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.

Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more "direct" models for domains that don't require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.

Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two "internal physicalists" inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven't worked out detailed examples).

[-]johnswentworth4yΩ12240Review for 2020 Review

This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally.

I've long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them.

In Solomonoff Model, Sufficiently Large Data Rules Out Malignness

There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that:

A simple application of the no free lunch theorem shows that there is no way of making predictions that is better than the Solomonoff prior across all possible distributions over all possible strings. Thus, agents that are influencing the Solomonoff prior cannot be good at predicting, and thus gain influence, in all possible worlds.

... but in the large-data limit, SI's guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit.

Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.)

... but then how the hell does this outside-view argument jive with all the inside-view arguments about malign agents in the prior?

Reflection Breaks The Large-Data Guarantees

There's an important gotcha in those guarantees: in the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. SI itself is not computable, therefore the guarantees do not apply to worlds which contain more than a single instance of Solomonoff induction, or worlds whose behavior depends on the Solomonoff inductor's outputs.

One example of this is AIXI (basically a Solomonoff inductor hooked up to a reward learning system): because AIXI's future data stream depends on its own present actions, the SI guarantees break down; takeover by a malign agent in the prior is no longer blocked by the SI guarantees.

Predict-O-Matic is a similar example: that story depends on the potential for self-fulfilling prophecies, which requires that the world's behavior depend on the predictor's output.

We could also break the large-data guarantees by making a copy of the Solomonoff inductor, using the copy to predict what the original will predict, and then choosing outcomes so that the original inductor's guesses are all wrong. Then any random program which will outperform the inductor's predictions. But again, this environment itself contains a Solomonoff inductor, so it's not computable; it's no surprise that the guarantees break.

(Interesting technical side question: this sort of reflection issue is exactly the sort of thing Logical Inductors were made for. Does the large-data guarantee of SI generalize to Logical Inductors in a way which handles reflection better? I do not know the answer.)

If Reflection Breaks The Guarantees, Then Why Does This Matter?

The real world does in fact contain lots of agents, and real-world agents' predictions do in fact influence the world's behavior. So presumably (allowing for uncertainty about this handwavy argument) the maligness of the Solomonoff prior should carry over to realistic use-cases, right? So why does this tangent matter in the first place?

Well, it matters because we're left with an importantly different picture: maligness is not a property of SI itself, so much as a property of SI in specific environments. Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much. We need specific external conditions - like feedback loops or other agents - in order for malignness to kick in. Colloquially speaking, it is not strictly an "inner" problem; it is a problem which depends heavily on the "outer" conditions.

If we think of malignness of SI just in terms of malign inner agents taking over, as in the post, then the problem seems largely decoupled from the specifics of the objective (i.e. accurate prediction) and environment. If that were the case, then malign inner agents would be a very neatly-defined subproblem of alignment - a problem which we could work on without needing to worry about alignment of the outer objective or reflection or embeddedness in the environment. But unfortunately the problem does not cleanly factor like that; the large-data guarantees and their breakdown show that malignness of SI is very tightly coupled to outer alignment and reflection and embeddedness and all that.

Now for one stronger claim. We don't need malign inner agent arguments to conclude that SI handles reflection and embeddedness poorly; we already knew that. Reflection and embedded world-models are already problems in need of solving, for many different reasons. The fact that malign agents in the hypothesis space are relevant for SI only in the cases where we already knew SI breaks suggests that, once we have better ways of handling reflection and embeddedness in general, the malign inner agents problem will go away on its own. This kind of malign inner agent is not a subproblem which we need to worry about in its own right. Indeed, I expect this is probably the case: once we have good ways of handling reflection and embeddedness in general, the problem of malign agents in the hypothesis space will go away on its own. (Infra-Bayesianism might be a case in point, though I haven't studied it enough myself to be confident in that.)

[-]paulfchristiano4yΩ7100

Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much

It seems like you can get malign behavior if you assume:

There are some important decisions on which you can't get feedback.
There are malign agents in the prior who can recognize those decisions.

In that case the malign agents can always defect only on important decisions where you can't get feedback.

I agree that if you can get feedback on all important decisions (and actually have time to recover from a catastrophe after getting the feedback) then malignness of the universal prior isn't important.

I don't have a clear picture of how handling embededness or reflection would make this problem go away, though I haven't thought about it carefully. For example, if you replace Solomonoff induction with a reflective oracle it seems like you have an identical problem, does that seem right to you? And similarly it seems like a creature who uses mathematical reasoning to estimate features of the universal prior would be vulnerable to similar pathologies even in a universe that is computable.

ETA: that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.

[-]Vanessa Kosoy4yΩ440

I don't have a clear picture of how handling embededness or reflection would make this problem go away, though I haven't thought about it carefully.

Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.

that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.

Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the "messed up situation"?

[-]paulfchristiano4yΩ440

Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.

I agree that removing bridge hypotheses removes one of the advantages for malign hypotheses. I didn't mention this because it doesn't seem like the way in which john is using "embededness;" for example, it seems orthogonal to the way in which the situation violates the conditions for solomonoff induction to be eventually correct. I'd stand by saying that it doesn't appear to make the problem go away.

That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses (since otherwise they also get big benefits from the influence update). And then once you've done that in a sensible way it seems like it also addresses any issues with embededness (though maybe we just want to say that those are being solved inside the decision theory). If you want to recover the expected behavior of induction as a component of intelligent reasoning (rather than a component of the utility function + an instrumental step in intelligent reasoning) then it seems like you need a more different tack.

Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the "messed up situation"?

If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources. If you do the same induction but just remove the malign hypotheses, then it seems like you are even dumber and the problem is even worse viewed from the competitiveness perspective.

[-]Vanessa Kosoy4yΩ440

I'd stand by saying that it doesn't appear to make the problem go away.

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses

I'm not sure I understand what you mean by "decision-theoretic approach". This attack vector has structure similar to acausal bargaining (between the AI and the attacker), so plausibly some decision theories that block acausal bargaining can rule out this as well. Is this what you mean?

If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources.

This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.

[-]paulfchristiano4yΩ440

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

I'm not sure I understand what you mean by "decision-theoretic approach"

I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no).

This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.

I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).

[-]Vanessa Kosoy4yΩ450

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

Why is embededness not enough? Once you don't have bridge rules, what is left is the laws of physics. What does the malign hypothesis explain about the laws of physics that the true hypothesis doesn't explain?

I suspect (but don't have a proof or even a theorem statement) that IB physicalism produces some kind of agreement theorem for different agents within the same universe, which would guarantee that the user and the AI should converge to the same beliefs (provided that both of them follow IBP).

I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences...

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).

Okay, but suppose that the AI has real evidence for the simulation hypothesis (evidence that we would consider valid). For example, suppose that there is some metacosmological explanation for the precise value of the fine structure constant (not in the sense of, this is the value which supports life, but in the sense of, this is the value that simulators like to simulate). Do you agree that in this case it is completely rational for the AI to reason about the world via reasoning about the simulators?

[-]paulfchristiano4yΩ220

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?

[-]Vanessa Kosoy4yΩ220

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

Why? Maybe you're thinking of UDT? In which case, it's sort of true but IBP is precisely a formalization of UDT + extra nuance regarding the input of the utility function.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent.

Well, IBP is explained here. I'm not sure what kind of non-IBP agent you're imagining.

[-]johnswentworth4yΩ220

I like the feedback framing, it seems to get closer to the heart-of-the-thing than my explanation did. It makes the role of the pointers problem and latent variables more clear, which in turn makes the role of outer alignment more clear. When writing my review, I kept thinking that it seemed like reflection and embeddedness and outer alignment all needed to be figured out to deal with this kind of malign inner agent, but I didn't have a good explanation for the outer alignment part, so I focused mainly on reflection and embeddedness.

That said, I think the right frame here involves "feedback" in a more general sense than I think you're imagining it. In particular, I don't think catastrophes are very relevant.

The role of "feedback" here is mainly informational; it's about the ability to tell which decision is correct. The thing-we-want from the "feedback" is something like the large-data guarantee from SI: we want to be able to train the system on a bunch of data before asking it for any output, and we want that training to wipe out the influence of any malign agents in the hypothesis space. If there's some class of decisions where we can't tell which decision is correct, and a malign inner agent can recognize that class, then presumably we can't create the training data we need.

With that picture in mind, the ability to give feedback "online" isn't particularly relevant, and therefore catastrophes are not particularly central. We only need "feedback" in the sense that we can tell which decision we want, in any class of problems which any agent in the hypothesis space can recognize, in order to create a suitable dataset.

[-]paulfchristiano4yΩ220

We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can't do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)

[-]tailcalled4y30

🤔 Some people talk about human ideologies as "egregores" which have independent agency. I had previously modelled them as just being a simple sort of emergent behavior, but this post makes me think that maybe they could be seen as malign inner agents embedded in world models, since they seem to cover the domain where you describe inner agents as being relevant (modelling over other agents).

[-]PhilGoetz6y220

"At its core, this is the main argument why the Solomonoff prior is malign: a lot of the programs will contain agents with preferences, these agents will seek to influence the Solomonoff prior, and they will be able to do so effectively."

First, this is irrelevant to most applications of the Solomonoff prior. If I'm using it to check the randomness of my random number generator, I'm going to be looking at 64-bit strings, and probably very few intelligent-life-producing universe-simulators output just 64 bits, and it's hard to imagine how an alien in a simulated universe would want to bias my RNG anyway.

The S. prior is a general-purpose prior which we can apply to any problem. The output string has no meaning except in a particular application and representation, so it seems senseless to try to influence the prior for a string when you don't know how that string will be interpreted.

Can you give an instance of an application of the S. prior in which, if everything you wrote were correct, it would matter?

Second, it isn't clear that this is a bug rather than a feature. Say I'm developing a program to compress photos. I'd like to be able to ask "what are the odds of seeing this image, ever, in any universe?" That would probably compress images of plants and animals better than other priors, because in lots of universes life will arise and evolve, and features like radial symmetry, bilateral symmetry, leafs, legs, etc., will arise in many universes. This biasing of priors by evolution doesn't seem to me different than biasing of priors by intelligent agents; evolution is smarter than any agent we know. And I'd like to get biasing from intelligent agents, too; then my photo-compressor might compress images of wheels and rectilinear buildings better.

Also in the category of "it's a feature, not a bug" is that, if you want your values to be right, and there's a way of learning the values of agents in many possible universes, you ought to try to figure out what their values are, and update towards them. This argument implies that you can get that for free by using Solomonoff priors.

(If you don't think your values can be "right", but instead you just believe that your values morally oblige you to want other people to have those values, you're not following your values, you're following your theory about your values, and probably read too much LessWrong for your own good.)

Third, what do you mean by "the output" of a program that simulates a universe? How are we even supposed to notice the infinitesimal fraction of that universe's output which the aliens are influencing to subvert us? Take your example of Life--is the output a raster scan of the 2D bit array left when the universe goes static? In that case, agents have little control over the terminal state of their universe (and also, in the case of Life, the string will be either almost entirely zeroes, or almost entirely 1s, and those both already have huge Solomonoff priors). Or is it the concatenation of all of the states it goes through, from start to finish? In that case, by the time intelligent agents evolve, their universe will have already produced more bits than our universe can ever read.

Are you imagining that bits are never output unless the accidentally-simulated aliens choose to output a bit? I can't imagine any way that could happen, at least not if the universe is specified with a short instruction string.

This brings us to the 4th problem: It makes little sense to me to worry about averaging in outputs from even mere planetary simulations if your computer is just the size of a planet, because it won't even have enough memory to read in a single output string from most such simulations.

5th, you can weigh each program's output proportional to 2^-T, where T is the number of steps it takes the TM to terminate. You've got to do something like that anyway, because you can't run TMs to completion one after another; you've got to do something like take a large random sample of TMs and iteratively run each one step. Problem solved.

Maybe I'm misunderstanding something basic, but I feel like we're talking about many angels can dance on the head of a pin.

Perhaps the biggest problem is that you're talking about an entire universe of intelligent agents conspiring to change the "output string" of the TM that they're running in. This requires them to realize that they're running in a simulation, and that the output string they're trying to influence won't even be looked at until they're all dead and gone. That doesn't seem to give them much motivation to devote their entire civilization to twiddling bits in their universe's final output in order to shift our priors infinitesimally. And if it did, the more likely outcome would be an intergalactic war over what string to output.

(I understand your point about them trying to "write themselves into existence, allowing them to effectively "break into" our universe", but as you've already required their TM specification to be very simple, this means the most they can do is cause some type of life that might evolve in their universe to break into our universe. This would be like humans on Earth devoting the next billion years to tricking God into re-creating slime molds after we're dead. Whereas the things about themselves that intelligent life actually care about with and self-identify with are those things that distinguish them from their neighbors. Their values will be directed mainly towards opposing the values of other members of their species. None of those distinguishing traits can be implicit in the TM, and even if they could, they'd cancel each other out.)

Now, if they were able to encode a message to us in their output string, that might be more satisfying to them. Like, maybe, "FUCK YOU, GOD!"

[-]Mark Xu6y220

The S. prior is a general-purpose prior which we can apply to any problem. The output string has no meaning except in a particular application and representation, so it seems senseless to try to influence the prior for a string when you don't know how that string will be interpreted.

The claim is that consequentalists in simulated universes will model decisions based on the Solomonoff prior, so they will know how that string will be interpreted.

Can you give an instance of an application of the S. prior in which, if everything you wrote were correct, it would matter?

Any decision that controls substantial resource allocation will do. For example, if we're evaluting the impact of running various programs, blow up planets, interfere will alien life, etc.

Also in the category of "it's a feature, not a bug" is that, if you want your values to be right, and there's a way of learning the values of agents in many possible universes, you ought to try to figure out what their values are, and update towards them. This argument implies that you can get that for free by using Solomonoff priors.

If you are a moral realist, this does seem like a possible feature of the Solomonoff prior.

Third, what do you mean by "the output" of a program that simulates a universe?

A TM that simulates a universe must also specify an output channel.

Take your example of Life--is the output a raster scan of the 2D bit array left when the universe goes static? In that case, agents have little control over the terminal state of their universe (and also, in the case of Life, the string will be either almost entirely zeroes, or almost entirely 1s, and those both already have huge Solomonoff priors). Or is it the concatenation of all of the states it goes through, from start to finish?

All of the above. We are running all possible TMs, so all computable universes will be paired will all computable output channels. It's just a question of complexity.

Are you imagining that bits are never output unless the accidentally-simulated aliens choose to output a bit? I can't imagine any way that could happen, at least not if the universe is specified with a short instruction string.

No.

This brings us to the 4th problem: It makes little sense to me to worry about averaging in outputs from even mere planetary simulations if your computer is just the size of a planet, because it won't even have enough memory to read in a single output string from most such simulations.

I agree that approximation the Solmonoff prior is difficult and thus its malignancy probably doesn't matter in practice. I do think similar arguments apply to cases that do matter.

5th, you can weigh each program's output proportional to 2^-T, where T is the number of steps it takes the TM to terminate. You've got to do something like that anyway, because you can't run TMs to completion one after another; you've got to do something like take a large random sample of TMs and iteratively run each one step. Problem solved.

See the section on the Speed prior.

Perhaps the biggest problem is that you're talking about an entire universe of intelligent agents conspiring to change the "output string" of the TM that they're running in. This requires them to realize that they're running in a simulation, and that the output string they're trying to influence won't even be looked at until they're all dead and gone. That doesn't seem to give them much motivation to devote their entire civilization to twiddling bits in their universe's final output in order to shift our priors infinitesimally. And if it did, the more likely outcome would be an intergalactic war over what string to output.

They don't have to realize they're in a simulation, they just have to realize their universe is computable. Consequentialists care about their values after they're dead. The cost of influncing the prior might not be that high because they only have to compute it once and the benefit might be enormous. Exponential decay + acausal trade make an intergalactic war unlikely.

[-]Raemon6yΩ8180

Curated. This post does a good job of summarizing a lot of complex material, in a (moderately) accessible fashion.

[-]Ben Pace6yΩ480

+1 I already said I liked it, but this post is great and will immediately be the standard resource on this topic. Thank you so much.

[-]ESRogs6y140

If it's true that simulating that universe is the simplest way to predict our human, then some non-trivial fraction of our prediction might be controlled by a simulation in another universe. If these beings want us to act in certain ways, they have an incentive to alter their simulation to change our predictions.

I find this confusing. I'm not saying it's wrong, necessarily, but it at least feels to me like there's a step of the argument that's being skipped.

To me, it seems like there's a basic dichotomy between predicting and controlling. And this is claiming that somehow an agent somewhere is doing both. (Or actually, controlling by predicting!) But how, exactly?

Is it that:

these other agents are predicting us, by simulating us, and so we should think of ourselves as partially existing in their universe? (with them as our godlike overlords who can continue the simulation from the current point as they wish)
the Consequentialists will predict accurately for a while, and then make a classic "treacherous turn" where they start slipping in wrong predictions designed to influence us rather than be accurate, after having gained our trust?
something else?

My guess is that it's the second thing (in part from having read, and very partially understood, Paul's posts on this a while ago). But then I would expect some discussion of the "treacherous turn" aspect of it -- of the fact that they have to predict accurately for a while (so that we rate them highly in our ensemble of programs), and only then can they start outputting predictions that manipulate us.

Is that not the case? Have I misunderstood something?

(Btw, I found the stuff about python^10 and exec() pretty clear. I liked those examples. Thank you! It was just from this point on in the post that I wasn't quite sure what to make of it.)

[-]Pongo6y40

My understanding is the first thing is what you get with UDASSA and the second thing would be what you get is if you think the Solomonoff prior is useful for predicting your universe for some other reason (ie not because you think the likelihood of finding yourself in some situation covaries with the Solomonoff prior's weight on that situation)

[-]romeostevensit6y*140

This is great. I really appreciate when people try to summarize complex arguments that are spread across multiple posts.

Also, I basically do this (try to infer the right prior). My guiding navigation is trying to figure out what (I call) the super cooperation cluster would do then do that.

[-]Kenny6y*90

I liked this post a lot, but I did read it as something of a scifi short story with a McGuffin called "The Solomonoff Prior".

It probably also seemed really weird because I just read Why Philosophers Should Care About Computational Complexity [PDF] by Scott Aaronson and having read it makes sentences like this seem 'not even' insane:

The combined strategy is thus to take a distribution over all decisions informed by the Solomonoff prior, weight them by how much influence can be gained and the version of the prior being used, and read off a sequence of bits that will cause some of these decisions to result in a preferred outcome.

The Consequentialists are of course the most badass (by construction) alien villains ever "trying to influence the Solomonoff prior" as they are wont!

Given that some very smart people seem to seriously believe in Platonic realism, maybe there are Consequentialists malignly influencing vast infinities of universes! Maybe our universe is one of them.

I'm not sure why, but I feel like the discovery of a proof of P = NP or P ≠ NP is the climax of the heroes valiant struggle, as the true heirs of the divine right to wield The Solomonoff Prior, against the dreaded (other universe) Consequentialists.

[-]Charlie Steiner4yΩ470Review for 2020 Review

This was a really interesting post, and is part of a genre of similar posts about acausal interaction with consequentialists in simulatable universes.

The short argument is that if we (or not us, but someone like us with way more available compute) try to use the Kolmogorov complexity of some data to make a decision, our decision might get "hijacked" by simple programs that run for a very very long time and simulate aliens who look for universes where someone is trying to use the Solomonoff prior to make a decision and then based on what decision they want, they can put different data at high-symmetry locations in their own simulated universe.

I don't think this really holds up (see discussion in the comments, e.g. Veedrac's). One lesson to take away here is that when arguing verbally, it's hard to count the number of pigeons versus the number of holes. How many universes full of consequentialists are there in programs of length <m, and how many people using the Solomonoff prior to make decisions are there in programs of length <n, for the (m,n) that seem interesting? (Given the requirement that all these people live in universes that allow huge computations, they might even be the same program!) These are the central questions, but none of the (many, well-written, virtuous) predicted counterarguments address this. I'd be interested in at least attempts at numerical estimates, or illustrations of what sorts of problems you run into when estimating.

[-]Commander Zander6y60

Great post. I encountered many new ideas here.

One point confuses me. Maybe I'm missing something. Once the consequentialists in a simulation are contemplating the possibility of simulation, how would they arrive at any useful strategy? They can manipulate the locations that are likely to be the output/measurement of the simulation, but manipulate to what values? They know basically nothing about how the input will be interpreted, what question the simulator is asking, or what universe is doing the simulation. Since their universe is very simple, presumably many simulators are running identical copies of them, with different manipulation strategies being appropriate for each. My understanding of this sounds less like malign and more like blindly mischievous.

TLDR How do the consequentialists guess which direction to bias the output towards?

[-]Mark Xu6y30

Consequentialists can reason about situations in which other beings make important decisions using the Solomonoff prior. If the multiple beings are simulated them, they can decide randomly (because having e.g. 1/100 of the resources is better than none, which is the expectation of "blind mischievousness").

An example of this sort of reasoning is Newcomb's problem with the knowledge that Omega is simulating you. You get to "control" the result of your simulation by controlling how you act, so you can influence whether or not Omega expects you to one-box or two-box, controlling whether there is $1,000,000 in one of the boxes.

[-]Commander Zander6y10

Okay, deciding randomly to exploit one possible simulator makes sense.

As for choosing exactly what to see the output cells of the simulation to... I'm still wrapping my head around it. Is recursive simulation the only way to exploit these simulations from within?

[-]Roko6yΩ160

It seems to me that using a combination of execution time, memory use and program length mostly kills this set of arguments.

Something like a game-of-life initial configuration that leads to the eventual evolution of intelligent game-of-life aliens who then strategically feed outputs into GoL in order to manipulate you may have very good complexity performance, but both the speed and memory are going to be pretty awful. The fixed cost in memory and execution steps of essentially simulating an entire universe is huge.

But yes, the pure complexity prior certainly has some perverse and unsettling properties.

EDIT: This is really a special case of Mesa-Optimizers being dangerous. (See, e.g. https://www.lesswrong.com/posts/XWPJfgBymBbL3jdFd/an-58-mesa-optimization-what-it-is-and-why-we-should-care). The set of dangerous Mesa-Optimizers is obviously bigger than just "simulated aliens" and even time- and space-efficient algorithms might run into them.

[-]Tomáš Gavenčiak6yΩ380

Complexity indeed matters: the universe seems to be bounded in both time and space, so running anything like Solomonoff prior algorithm (in one of its variants) or AIXI may be outright impossible for any non-trivial model. This for me significantly weakens or changes some of the implications.

A Fermi upper bound of the direct Solomonoff/AIXI algorithm trying TMs in the order of increasing complexity: even if checking one TM took one Planck time on one atom, you could only check cca 10^250=2^800 machines within a lifetime of the universe (~10^110 years until Heat death), so the machines you could even look at have description complexity a meager 800 bits.

You could likely speed the greedy search up, but note that most algorithmic speedups do not have a large effect on the exponent (even multiplying the exponent with constants is not very helpful).
Significantly narrowing down the space of TMs to a narrow subclass may help, but then we need to take look at the particular (small) class of TMs rather than have intuitions about all TMs. (And the class would need to be really narrow - see below).
Due to the Church-Turing thesis, any limiting the scope of the search is likely not very effective, as you can embed arbitrary programs (and thus arbitrary complexity) in anything that is strong enough to be a TM interpreter (which the universe is in multiple ways).
It may be hypothetically possible to search for the "right" TMS without examining them individually (witch some future tech, e.g. how sci-fi imagined quantum computing), but if such speedup is possible, any TMs modelling the universe would need to be able to contain this. This would increase any evaluation complexity of the TMs, making them more significantly costly than the Planck time I assumed above (would need a finer Fermi estimate with more complex assumptions?).

[-]Mark Xu6y40

I am not so convinced that penalizing more stuff will make these arguments weak enough that we don't have to worry about them. For an example of why I think this, see Are minimal circuits deceptive?. Also, adding execution/memory constraints penalizes all hypothesis and I don't think universes with consequentialists are asymmetrically penalized.

I agree about this being a special case of mesa-optimization.

[-]Roko6y70

adding execution/memory constraints penalizes all hypothesis

In reality these constraints do exist, so the question of "what happens if you don't care about efficiency at all?" is really not important. In practice, efficiency is absolutely critical and everything that happens in AI is dominated by efficiency considerations.

I think that mesa-optimization will be a problem. It probably won't look like aliens living in the Game of Life though.

It'll look like an internal optimizer that just "decides" that the minds of the humans who created it are another part of the environment to be optimized for its not-correctly-aligned goal.

[-]Ben Pace6yΩ250

Such a great post.

Note that I changed the formatting of your headers a bit, to make some of them just bold text. They still appear in the ToC just fine. Let me know if you'd like me to revert it or have any other issues.

[-]Mark Xu6y20

Looks better - thanks!

[-]FactorialCode6y*50

At its core, this is the main argument why the Solomonoff prior is malign: a lot of the programs will contain agents with preferences, these agents will seek to influence the Solomonoff prior, and they will be able to do so effectively.

Am I the only one who sees this much less as a statement that the Solomonoff prior is malign, and much more a statement that reality itself is malign? I think that the proper reaction is not to use a different prior, but to build agents that are robust to the possibility that we live in a simulation run by influence seeking malign agents so that they don't end up like this.

[-]Ofer6y50

If arguments about acausal trade and value handshakes hold, then the resulting utility function might contain some fraction of human values.

I think Paul's Hail Mary via Solomonoff prior idea is not obviously related to acausal trade. (It does not privilege agents that engage in acausal trade over ones that don't.)

[-]Mark Xu6y20

I agree. The sentence quoted is a separate observation.

[-]adamShimi6yΩ340

I like this post, which summarizes other posts I wanted to read for a long time.

Yet I'm still confused by a fairly basic point: why would the agents inside the prior care about our universe? Like, I have preferences, and I don't really care about other universes. Is it because we're running their universe, and thus they can influence their own universe through ours? Or is there another reason why they are incentivized to care about universes which are not causally related to theirs?

[-]evhub6yΩ4100

I don't really care about other universes

Why not? I certainly do. If you can fill another universe with people living happy, fulfilling lives, would you not want to?

[-]adamShimi6yΩ120

Okay, it's probably subtler than that.

I think you're hinting at things like the expanding moral circle. And according to that, there's no reason that I should care more about people in my universe than people in other universes. I think this makes sense when saying whether I should care. But the analogy with "caring about people in a third world country on the other side of the world" breaks down when we consider our means to influence these other universes. Being able to influence the Solomonoff prior seems like a very indirect way to alter another universe, on which I have very little information. That's different from buying Malaria nets.

So even if you're altruistic, I doubt that "other universes" would be high in your priority list.

The best argument I can find for why you would want to influence the prior is if it is a way to influence the simulation of your own universe, à la gradient hacking.

[-]Mark Xu6yΩ360

I personally see no fundamental difference between direct and indirect ways of influence, except in so far as they relate to stuff like expected value.

I agree that given the amount expected influence, other universes are not high on my priority list, but they are still on my priority list. I expect the same for consequentialists in other universes. I also expect consequentialist beings that control most of their universe to get around to most of the things on their priority list, hence I expect them to influence the Solmonoff prior.

[-]Daniel Kokotajlo6y40

Is the link for the 6-byte Code Golf solution correct? It takes me to something that appears to be 32 bytes.

[-]Mark Xu6y20

Nope. Should be fixed now.

[-]Veedrac6y30

I think this is wrong, but I'm having trouble explaining my intuitions. There are a few parts;

You're not doing Solomonoff right, since you're meant to condition on all observations. This makes it harder for simple programs to interfere with the outcome.
More importantly but harder to explain, you're making some weird assumptions of the simplicity of meta-programs that I would bet are wrong. There seems to be a computational difficulty here, in that you envision small worlds trying to manipulate $2^{m}$ other worlds, where $m > n$ . That makes it really hard for the simplest program to be one where the meta-program that's interpreting the pointer to our world is a rational agent, rather than some more powerful but less grounded search procedure. If ‘naturally’ evolved agents are interpreting the information pointing to the situation they might want to interfere with, this limits the complexity of that encoding. If they're just simulating a lot of things to interfere with as many worlds as possible, they ‘run out of room’, because $2^{m} ≫ 2^{n}$ .
Your examples almost self-refute, in the sense that if there's an accurate simulation of you being manipulated at time $t + 1$ , it implies that simulation is not materially interfered with at time $t$ , so even if the vast majority of Solomonoff inductions have attempted adversary, most of them will miss anyway. Hypothetically, superrational agents might still be able coordinate to manipulate some very small fraction of worlds, but it'd be hard and only relevant to those worlds.
Compute has costs. The most efficient use of compute is almost always to do enact your preferences directly, not manipulate other random worlds with low probability. By the time you can interfere with Solomonoff, you have better options.
To the extent that a program $P$ is manipulating predictions so that another other program that is simulating $P$ performs unusually... well, then that's just how the metaverse is. If the simplest program containing your predictions is an attempt at manipulating you, then the simplest program containing you is probably being manipulated.

[-]torekp6y30

the initial conditions of the universe are simpler than the initial conditions of Earth.

This seems to violate a conservation of information principle in quantum mechanics.

[-]Mark Xu6y60

perhaps would have been better worded as "the simplest way to specify the initial conditions of Earth is to specify the initial conditions of the universe, the laws of physics, and the location of Earth."

[-]torekp6y50

Right, you're interested in syntactic measures of information, more than a physical one My bad.

[-]andrew sauer6y30

In your section "complexity of conditioning", if I am understanding correctly, you compare the amount of information required to produce consequentialists with the amount of information in the observations we are conditioning on. This, however, is not apples to oranges: the consequentialists are competing against the "true" explanation of the data, the one that specifies the universe and where to find the data within it, they are not competing against the raw data itself. In an ordered universe, the "true" explanation would be shorter than the raw observation data, that's the whole point of using Solomonoff induction after all.

So, there are two advantages the consequentialists can exploit to "win" and be the shorter explanation. This exploitation must be enough to overcome those 10-1000 bits. One is that, since the decision which is being made is very important, they can find the data within the universe without adding any further complexity. This, to me, seems quite malign, as the "true" explanation is being penalized simply because we cannot read data directly from the program which produces the universe, not because this universe is complicated.

The second possible advantage is that these consequentialists may value our universe for some intrinsic reason, such as the life in it, so that they prioritize it over other universes and therefore it takes less bits to specify their simulation of it. However, if you could argue that the consequentialists actually had an advantage here which outweighed their own complexity, this would just sound to me like an argument that we are living in a simulation, because it would essentially be saying that our universe is unduly tuned to be valuable for consequentialists, to such a degree that the existence of these consequentialists is less of a coincidence than it just happening to be that valuable.

[-]Mark Xu6y10

In your section "complexity of conditioning", if I am understanding correctly, you compare the amount of information required to produce consequentialists with the amount of information in the observations we are conditioning on. This, however, is not apples to oranges: the consequentialists are competing against the "true" explanation of the data, the one that specifies the universe and where to find the data within it, they are not competing against the raw data itself. In an ordered universe, the "true" explanation would be shorter than the raw observation data, that's the whole point of using Solomonoff induction after all.

The data we're conditioning on has K-complexity of one megabyte. Maybe I didn't make this clear.

So, there are two advantages the consequentialists can exploit to "win" and be the shorter explanation. This exploitation must be enough to overcome those 10-1000 bits. One is that, since the decision which is being made is very important, they can find the data within the universe without adding any further complexity. This, to me, seems quite malign, as the "true" explanation is being penalized simply because we cannot read data directly from the program which produces the universe, not because this universe is complicated.

I don't think I agree with this. Thinking in terms of consequentialists competing against "true" explanations doesn't make that much sense to me. It seems similar to making the exec hello world "compete" against the "true" print hello world.

The "complexity of consequentialists" section answers the question of "how long is the exec function?" where the "interpreter" exec calls is a universe filled with consequentialists.

However, if you could argue that the consequentialists actually had an advantage here which outweighed their own complexity, this would just sound to me like an argument that we are living in a simulation, because it would essentially be saying that our universe is unduly tuned to be valuable for consequentialists, to such a degree that the existence of these consequentialists is less of a coincidence than it just happening to be that valuable.

I do not understand what this is saying. I claim that consequentialists can reason about our universe by thinking about TMs because our universe is computable. Given that our universe supports life, it might thus be valuable to some consequentialists in other universes. I don't think the argument takes a stance on whether this universe is a simulation; it merely claims that this universe could be simulated.

[-]Paul LM6mo10

Thank you for the interesting piece. You write « In our example, it seems likely that "simulate the entire universe" is simpler than "simulate Earth" or "simulate part of Earth" because the initial conditions of the universe are simpler than the initial conditions of Earth. » but since the laws of physics are reversible we can get from the initial conditions for a neighborhood of Earth to the initial conditions of the universe. We must take a cutoff (a smooth bump function in space time around the Earth in a time slice. This determines a single initial universe condition, and the set of admissilbe bump functions would determine a set of initial conditions for the universe. While conversely the universe determines the Earth later. Thus the Kolmogorov complexity of the initial conditions for the universe are more Kolmogorov-complex than those of Earth - as the overhead for simulating the universe up to any given future state of Earth is negligible, any small universal TM's code will do, plus the spacetime region to evolve to, by the physical version of Church-Turing thesis.

[-]jchan6y10

I'm trying to wrap my head around this. Would the following be an accurate restatement of the argument?

Start with the Dr. Evil thought experiment, which shows that it's possible to be coerced into doing something by an agent who has no physical access to you, other than communication.
We can extend this to the case where the agents are in two separate universes, if we suppose that (a) the communication can be replaced with an acausal negotation, with each agent deducing the existence and motives of the other; and that (b) the Earthlings (the ones coercing Dr. Evil) care about what goes on in Dr. Evil's universe.
- Argument for (a): With sufficient computing power, one can run simulations of another universe to figure out what agents live within that universe.
- Argument for (b): For example, the Earthlings might want Dr. Evil to write embodied replicas of them in his own universe, thus increasing the measure of their own consciousness. This is not different in kind from you wanting to increase the probability of your own survival - in both cases, the goal is to increase the measure of worlds in which you live.
To promote their goal, when the Earthlings run their simulation of Dr. Evil, they will intervene in the simulation to punish/reward the simulated Dr. Evil depending on whether he does what they (the Earthlings) want.
For his own part, Dr. Evil, if he is using the Solomonoff prior to predict what happens next in his universe, must give some probability to the hypothesis that him being in such a simulation is in fact what explains all of his experiences up till that point (rather than him being a ground-level being). And if that hypothesis is true, then Dr. Evil will expect to be rewarded/punished based on whether he carries out the wishes of the Earthlings. So, he will modify his actions accordingly.
The probability of the simulation hypothesis may be non-negligible, because the Solomonoff prior considers only the complexity of the hypothesis and not that of the computation unfolding from it. In fact, the hypothesis "There is a universe with laws A+B+C, which produces Earthlings who run a simulation with laws X+Y+Z which produces Dr. Evil, but then intervene in the simulation as described in #3" may actually be simpler (and thus more probable) than "There is a universe with laws X+Y+Z which produces Dr. Evil, and those laws hold forever".

[-]Signer6y10

Wouldn't complexity of earth and conditioning on importance be irrelevant because it would still appear in consequentialists' distribution of strings and in specification of what kind of consequentialists we want? Therefore they will only have the advantage of anthropic update, that would go to zero in the limit of string's length, because choice of the language would correlate with string's content, and penalty for their universe + output channel.

Moderation Log