Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.


Just a small note which I’m not sure has been mentioned anywhere else:

It seems like there are two different classes of “updateless reasoning”.

In problems like Agent Simulates Predictor, switching to updateless reasoning is better for you in the very situation you find yourself in. The gains accrue to you. You objectively achieve higher expected value, at the point of decision, by making the decision from the perspective of yourself long ago rather than doing what seems higher EV from the current perspective.

In problems like counterfactual mugging, the gains do not accrue to the agent at the point of making the decision. The increase in expected value goes to other possible selves, which the decision-point self does not even believe in any more. The claim of higher EV is quite subjective; it depends entirely on one’s prior.

For lack of better terms, I’ll call the first type all-upside updatelessness; the second type is mixed-upside.

It is quite possible to construct decision theories which get all-upside updateless reasoning without getting mixed-upside. Asymptotic decision theory was one.

On the other hand, it seems unlikely that any natural proposal would get mixed-upside without getting the all-upside cases. Policy selection, for example, automatically gets both types (to the limited extent that it enables updatelessness reasoning).

Nonetheless, I find it plausible that one wants two different mechanisms to get the two different kinds. It seems to me that one can handle all-upside cases in a more objective way, getting good overall guarantees. Mixed-upside cases, on the other hand, require more messiness and compromise, as in the policy selection proposal. So, it could be beneficial to combine a mechanism which does perfectly for all-upside cases with a mechanism that provides some weaker guarantee for mixed-upside.

New Comment
18 comments, sorted by Click to highlight new comments since:

I think that in both cases, the gains accrue for different possible selves. The difference is that in counterfactual mugging, this you still exists, independent of what action you take, while in ASP, you can destroy the world that this you is in by taking a different action.

Yep. If we think about it from a commitment-making perspective, the difference is that all-upside cases rule out possible worlds in a way which is definitely an improvement, whereas mixed-upside cases rule out possibilities in an uncertain gamble. In the context of logical uncertainty, what the "real prior" is is confusing, so udateless choices which are good regardless of how the chances shift around are much more objectively good.

From a destroying-worlds perspective, this might be seen as destroying worlds in a way which definitely improves things vs destroying worlds in a chancy way. I suppose this could come apart from the distinction between choices which destroy the world you're in vs choices where you still exist.

If an agent receives an offer to self-modify today into something that pays up in Counterfactual Mugging tomorrow, plain old VNM says it should accept the offer. So full updatelessness is forced by reflective consistency. Studying reflectively inconsistent theories is a waste of time, because an AI won't keep them for long.

Getting updatelessness as originally conceived to work well with logical induction is difficult at best, and may be outright impossible. If reflective consistency and logical uncertainty turn out to be incompatible, that puts us in an awkward position. We have to take a stance which allows us to chip away at the problem.

The stance I'm taking here is: "What we even want to mean by updatelessness is puzzling in the context of logical uncertainty. Let's see if we can deconstruct 'updatelessness' more, in itself, to move toward being less confused."

I think, roughly, that we actually want all-upside updatelessness, but mixed-upside updatelessness falls apart when we move to logical uncertainty. In counterfactual mugging with a logical coin, we want to be logically updateless enough to give Omega the money, but not so logically updateless that we are totally ignorant of the structure of the problem and do some random thing. It doesn't seem like there's a nice dividing line between the two. If the counterfactual mugging is based on the question of whether a large pseudorandom number is prime, how many heuristics of prime-number testing are we supposed to use in our judgement of whether to take the gamble?

The "right thing" may have to achieve reflective consistency without being fully updateless in the mixed-upside sense. Or perhaps we will have to give up reflective consistency, and move to other ways to get the thing we really wanted from reflective consistency for.

Okay, I just spent a couple days on this and here's my best take on Counterfactual Mugging with a logical coin. It mostly agrees with your conclusions, but makes them more precise in interesting ways. I'll use good old proof search because it makes things clearer.

Let's say the agent uses short proofs, the predictor uses medium proofs that are enough to fully simulate the agent (and that fact itself has a short proof), and figuring out the value of the coin requires a long proof (and that fact itself might require a long proof).

def World():
  if Coin()=True and Agent()=True:
    return -100
  else if Coin()=False and there's a medium proof that "Coin()=True implies Agent()=True":
    return 10000
    return 0

def Agent():
  iterate through all short proofs of the form
    "if Agent()=X, then Coin()=True implies World()=Y and Coin()=False implies World()=Z"
  return the X corresponding to highest Y+Z found

Obviously there's a short proof that the agent returning True would lead to utility 9900, because medium proofs are enough to fully simulate the agent. Can the agent find some better proof corresponding to False, and return False as a result? That's only possible if there's a medium proof that "Coin()=True implies Agent()=True". But if the agent returns False, there's a medium proof that "Agent()=False". The two can be combined into a medium proof that "Coin()=False", which is impossible by assumption. So the agent returns True, QED.

Seems like a victory for updatelessness, right? But my formalization has a huge problem. Only we know the coin is hard, the agent doesn't know that. It can imagine a world where "Coin()=False" has a medium proof, so "Coin()=True implies Agent()=True" has a medium proof for all agents, so the predictor rewards all agents. Ruling out that possibility is at least as hard as proving bounded consistency up to medium proofs, so it's out of reach for any agent relying on short proofs only. Basically the agent can't build a full picture of counterfactuals. It successfully returns True, but only because it can't figure out what happens in case of False.

That, to me, is the main problem with Counterfactual Mugging and logical coins in general. If an agent is too weak to prove the coin's value, it's certainly too weak to prove that the coin is hard. The agent can't rule out the possibility that the answer is just around the corner, so it can't treat the coin as "distant" from counterfactuals about its own reasoning.

Switching from proof search to logical induction probably won't help. The essential difficulty will remain: using easy reasoning to build the full counterfactual picture requires knowing that the coin won't accidentally fall to easy reasoning, and you cannot know that even if it's true.

Interesting! I'm not seeing why this phenomenon would remain when switching to LI, though. Can't you guess, based on cryptographic assumptions or simple experience, that a particular logical coin is hard? If the situation is set up so that you can't guess this, aren't you right in thinking it might be easy and reasoning as such (giving some probability to the predictor figuring out the coin)?

Maybe you’re right and logical induction can quickly become confident that the coin is hard. In that case, encoding "updatelessness" seems easy. The agent should spend a small fixed amount of time choosing a successor program with high expected utility according to LI, and then run that program. That seems to solve both betting on a logical coin (where it’s best for the successor to compute the coin) and counterfactual mugging (where it’s best for the successor to pay up). Though we don’t know what shape the successor will take in general, and you could say figuring that out is the task of decision theory...

Another problem with this version of "updatelessness" is that if you only run the logical inductor a short amount of time before selecting the policy, the chosen policy could be terrible, since the early belief state of the inductor could be quite bad. It seems as if there's at least something to be said about what it means to make a "good" trade-off between running too long before choosing the policy (so not being updateless enough) and running too short (so not knowing enough to choose policies wisely).


I also believe that mixed-upside updatelessness is more complex than is often presented and I'm planning to delve more into this. In fact, I just posted a question asking why you should pay in Counterfactual Mugging. Have you had any more thoughts about this since?

I still use this distinction in my thinking, but I guess I haven't had any significant payoffs yet in terms of factoring the problem this way and then solving one part or another.

Basically, all-upside cases are cases which you can solve by "updatelessness without uncertainty" -- you don't have to make trade-offs, you just have to recognize the better strategy. This is kind of natural for logical updatelessness (you are still calculating your probabilities, so you don't have probabilities yet, but you can make decisions anyway), but also kind of really unnatural (e.g., I don't know how to make this fit well with logical induction).

Could you elaborate more on why you would describe policy selection as "messiness and compromise?"

One point I don't think I see enough is that policy selection falls naturally out of Savage's theorem. Sure, Savage's theorem has a set that we verbally call the set of "Acts," but anything that obeys the desiderata obeys the theorem! To play the role of an "Act," something just has to be a function from the things we don't control to the set of things we care about. Strategies are functions from the things we don't control to the things we care about, therefore by Savage's theorem rational agents (in Savage's sense) will act as if they're picking the one with highest expected value.

Of course that's a very non-technical way of putting it. In the example of the Absent-minded Driver problem, the problem with stochastic act-based agents is that their preference ordering on individual turns doesn't obey the ordinal independence postulate, which requires that if one act is better than another (like always going straight is better than always turning), it should still be better if in some subset of world-states (e.g. the first intersection, if we're trying to use the set of intersections as the set of states) you did some other act (i.e. always going straight) no matter what - that is, the goodness of applying one "action function" to the "external state" objects in Savage's theorem can't depend on what would happen if the state were different. So in the Absent-minded Driver problem, picking the maximum-EV strategy obeys Savage's theorem, but trying to pick the the maximum-EV stochastic action at each intersection does not obey Savage's theorem, because the sets of things you're trying to plug into the theorem don't obey the desiderata.

When we consider logical uncertainty (at least, as formalized by logical induction), policy selection is a messy compromise because it's not clear how much thinking one should do before using what you know so far to select a policy. There's no apparent line in the sand between "too updateless" and "not updateless enough".

I mentioned the example of counterfactual mugging on prime-number factorization. How many heuristics about primality should be considered?

Hm. I basically always think of logical counterfactual mugging through the lens of ordinary counterfactual mugging.

Suppose Omega says "I would have given you $100 if the second prime number was 4, but only if you would have given me $1 otherwise. Can I have a dollar?" UDT tells you whether or not to pay by constructing some stochastic game that the agent is imagined to be playing, and then following a good strategy for that game. But I don't think that it's right to construct this game by using our own logical uncertainty about the second prime number, at some artificial state of logical ignorance. Instead, I think a UDT agent should construct this game by trying to model how Omega chose which mathematical fact to condition on in the first place, using its real ignorance about Omega's initial conditions. (Or not quite its real ignorance - I'm still confused about this part of UDT. Clearly you have to amputate the knowledge that the initial conditions you assign to Omega will lead to it asking you that specific question, but how does this amputation process work precisely? Do the results change if you keep rewinding farther than this, eventually trying to deduce things from the initial conditions of the universe, and if so, how far should you rewind for best results?)

I don't think using the state of ignorance which Omega used to decide is right. The main purpose of UDT is to get reflective consistency. It is the "trivial solution" to the problem of reflective consistency if you choose a policy from the agent's own prior: the agent has no reason to change from UDT, since the policy is already producing the global behavior it would want to change to.

What's the motivation to use Omega's beliefs? From what perspective can we justify that? Yes, doing so causes you to be counterfactually mugged in all cases; but, some counterfactual muggings are just bad deals. You don't want to give Omega $10 in the real world for the sake of $1M in an a priori incredibly unlikely world, just because Omega thought the other world likely. Do you?


Maybe I'm wrong.

If Omega imagines you in the incredibly unlikely world, with an altered prior to think that world is likely and the real world is incredibly unlikely, you'd like for that version of yourself to give the $10. But, this seems like an unfair problem. You can't generally do well under conditions where Omega arbitrarily edits your brain. If you see this coming you'd like to edit yourself to a DT which does well here, but if you see Omega threatens everything you hold dear and ask you to edit yourself to agent X, you'd edit yourself to agent X (unless X would then be an even worse threat than Omega). Not every case where you want to edit yourself constitutes a reflective inconsistency.

I'm confused about what condition really differentiates this case from regular counterfactual mugging, though. One possible condition is: you are reflectively inconsistent if you regret the decisions made by forseeable changes in your epistemic state which result from normal operation. If Omega copies you and presents the copy with evidence you yourself currently expect you might see, fine. If Omega copies you and puts a totally different prior in, it's unfair. An AI might self-edit if it sees such a thing coming, but we aren't trying to prevent all self-modifications; the AI should self-modify under certain kinds of external incentives.

On the other hand, maybe there's an interesting DT which bargains with its mind-edited copy based on uncertainty about whether Omega is going to approach asking for $10 in this universe or whether this is a universe with a conditional $1M (before we even know what procedure Omega will implement).

Yeah, actually, that sounds quite likely. In the universe with the conditional $1M, the mind-edited you has control over whether you get the money. In the universe where Omega asks you for $10, you have the same power over the mind-edited copy (the copy thinks -- even though you don't think the copy even exists if Omega asks you for $10). So, this is like cooperating in Prisoner's dilemma.

In other words, it seems like a DT will act more or less as you suggest if it is updateless and has good bargaining properties with agents whose DT is similar. This line of reasoning still speaks against giving Omega the $10 in the case where your prior gives high probability to Omega asking for the $10 and low probability for the other side -- but this makes sense; an Omega like that is a cheapskate not worth dealing with.

Hmm, yeah, I think I was wrong. You probably shouldn't pay for "the second prime is 4," unless Omega is doing some really dramatic brain-editing in its counterfactual simulation. This highlights where the center of the problem is - the problem isn't designing the agent, the problem is understanding, in detail, an Omega that evaluates logical counterfactuals.

But, and this is what tripped me up, Omega's beliefs are also important. Consider the case where Omega is running a search process for complicated false statements. Now you definitely shouldn't pay Omega, even if you're a priori uncertain about the statements Omega mugs you with.

This seems to get into problems with what the decision problem actually is. If you're "updateless" but you already know you're in this particular counterfactual mugging, you may respond much differently than if you're merely considering this as one of many possible muggings. Specifically, in the first case, any Omega is a cheapskate Omega if it chooses something like "P and not P". In the latter case, however, we might know that Omega arrives at absurdities such as this through a fair process which is equally likely to select true sentences. In that case, despite not being updateless enough to be ignorant about "P and not P", we might go along with the mogging as part of a policy which achieves high payout on average.

[+][comment deleted]10