Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: crystallizing uncertain lore

(this is mostly a writeup of a discussion with Abram, plus some additional thoughts of my own)

Note: Counterlogical Mugging will be used as a term for the counterfactual mugging problem that uses a digit of some mathematical constant.

Lately there’s been a bit more doubt about updatelessness (in full generality instead of the partial version attained by policy selection) as being an achievable desiderata. This came from thinking a bit more about what universal inductors do on the counterlogical mugging problem.

A universal inductor can be thought of as "what happens when you take a logical inductor, but don't show it anything", subject to the extra constraint that it's a probability distribution at every stage. Surprisingly enough, this allows us to find out information about arbitrary consistent theories simply by conditioning. This is theorem 4.7.2, closure under conditioning, which says that for any efficiently computable sequence, the conditional probabilities of sentences will act as a logical inductor relative to a new deductive process that contains the sentences that are being conditioned on. Conditioning substitutes for having an upstream deductive process. Therefore, if you fix some mapping of bitstring places to sentences, and condition on theorems of PA, it acts as a logical inductor over PA.

An interesting thing happens when a sufficiently advanced inductor (advanced enough to know the digit ahead of time) is placed in a counterlogical mugging scenario.

Consider the problem where the agent is using the limit of a universal inductor, , and it recieves/is conditioned on the first digits of the binary expansion of , so the agent is using as its probability distribution. The agent must select a policy , which is just a decision whether to pay the mugging or not when asked. The mugging will occur on a distant digit of , , where is some fast-growing function.

And omega's algorithm is:

if , then if , +2 dollars to agent

if , then if , -1 dollar to agent

Now, let's compute the payoffs for the 2 possible policies.

Now, to begin with, because the binary expansion of has positive probability in the resulting semimeasure, then for a sufficiently long-running universal inductor, if they condition on a sufficiently long initial sequence of digits, they'll assign high probability to the true distant digit.

So, if the initial segment is sufficiently long, and , then the payoff of is ~, so the agent will pay up if asked. If , then the payoff of is ~, so the agent will refuse to pay up.

So a universal inductor limit, when given the problem setting of "omega asks you about a distant digit of a sequence", will decline the logical mugging exactly when omega is paying up in a non-real/very low probability world, and successfully bluff omega into paying up when it actually gets a reward in the real world for it. This is the best outcome! It doesn’t throw away money on unreal worlds, but manages to get (that particular type) of omega to give it money anyways in the real world sometimes. We could think of it as going “well, I know the true digit, so it’s nearly costless to me, in expected value, to adopt a policy that pays up when it sees a digit out of place”. A universal inductor limit can recognize when it’s inside a counterlogical and act accordingly.

Translated out of the logical inductor setting, the agent is essentially going "I'll compute the hard-to-compute-fact, and it'll come out the same way as it does in reality, so I can figure out that I'm in a counterlogical from the inside". Now, intuitively, this doesn't work because, in the counterlogical world where the distant digit is something else, your computation of it would be spoofed, and you wouldn’t know you’re in a counterlogical. Maybe the traders would just make different trades, if the fabric of math was different. However, universal inductors are surprisingly hard to fool in this way, as we'll see.

To take a brief digression, for some real-world algorithms, there are some counterlogicals that just make absolutely no sense. If the algorithm for policy selection, say, used the singular-value decomposition of a matrix, then it doesn't make sense to ask what the algorithm would output in the counterlogical where there is no singular-value-decomposition. Changing basic properties of matrices would alter too many things.

Now, because all poly-time patterns are being enumerated by traders, there are some traders computing using well-known algorithms, some traders computing using algorithms that haven't been invented yet, some computing through homomorphically encrypted algorithms, others approximating the digits by monte-carlo simulation, and others computing by an algorithm that constantly checks itself and if bits are randomly changed in its operation it returns an error message. There can also be families of algorithms that use trades on a little-traded sentence to ensure that the fixed-point price encodes digits of somehow. And there can be another algorithm extracting late digits of that sentence…. There's no clean counterlogical for universal inductors, because the set of all traders is too entangled with all areas of math, just like the matrix example. In order to tell a lie, omega has to lie about other stuff entangled with that, and other stuff entangled with that, until it escalates to all of math.

However, it is possible to imagine omega just picking out the most obvious invocations of that it can find by some means, and leaving the rest alone, and running things forward. Then there is a spoofed prior where some computations output something they don’t.

Now, if the agent knows the distant digit of , and knows how omega will alter the probability distribution, the best move is to self-modify to a decision theory that pays up exactly when the probability distribution is spoofed in a way indicative of omega simulating it.

This only applies if omega is going to simulate the agent in the counterlogical after it self-modifies. If omega goes “ah, in the world where the digit of is 0, the agent will see this before the game gets underway, and self-modify to refuse to pay up", and in the true world, the relevant digit of is 1, then the agent will get no money.

However, counterlogical mugging seems to fall into the category of omega rewarding or penalizing the ritual of cognition the agent does, instead of what the agent actually does.

To be more specific, imagine an inductor A where most of the trader mass is comprised of traders which compute in an obscured way that omega's counterlogical surgery can't pick up. And imagine another inductor B where most of the trader mass is comprised of traders that omega's counterlogical surgery changes the behavior of. Then for the first trader, omega would go "in the world where the digit of is 0, this inductor still assigns high probability to the digit of being 1, for some reason, and it would pay up". And for the second trader, omega would go "in the world where the digit of is 0, this inductor assigns high probability to the digit of being 0, so it wouldn't pay up"

Back in the real world (assuming the digit of is 1), inductor A gets rewarded with 1 dollar, and inductor B gets rewarded with nothing, due to details of the cognition of A and B which affected what omega thought they would do in the counterlogical. This indicates that it's just an unfair problem, and sufficiently hard cases of mixed-upside updatelessness can be solved by policy selection/self modification, but not in a really principled way, because counterlogicals don't exist.

So, to summarize, updatelessness for a universal inductor is quite tricky to define, because it is secretly using the “true properties of reality” in a way that can’t be cleanly isolated. And if omega is going through and tweaking the traders so they output something other than what they output, then omega is basically tampering with the agent’s prior! It’s unreasonable to expect the agent to adopt a policy that works even in cases where it is adapated to have an unreasonable probability distribution by some external agent. (put another way, if something is simulating what you would do if you had a probability distribution that thought the moon was going to crash into earth, and making decisions based on that, then you probably can't control that version of yourself)

#Learning as Much as Possible

Why were we interested in mixed-upside updatelessness in the first place? Reflective consistency. The standard story goes something like this. "If you have an agent that is learning about the world over time, there's an early self who is uncertain about which way things will go. It can be thought of as having a caring measure (probability x utility) over impossible possible worlds, and as it learns more, this changes, so in cases where one possible world can influence another, there can be cases where the future self of the agent would take actions that are predictably bad from the perspective of the current agent, so it would self-modify to remove this property."

I have a few doubts about the "caring measure" picture of policy selection. The first doubt is that, for an agent that doesn't know their own decision in advance, the "updateless caring measure" actually does change, but only once. (when the agent imagines selecting a certain policy, it's setting the caring measure of all the worlds where it selects a different policy to 0, and renormalizing). This feels hacky and inelegant. It feels like there are more issues with this picture of updatelessness, but it's a topic for another day.

The reflective consistency argument actually isn't quite as forceful as it seems at first glance, given a certain condition on the impossible possible worlds. We'll go through a concrete example first, and then generalize it a bit.

Consider the problem where there is an auxiliary bit. If the auxiliary bit is 1, the next bit is 1 with 75% probability. If it's 0, the next bit is 0 with 75% probability. The agent knows it will be going through a counterfactual mugging (+2 dollars on 1 if you give me 1 dollar on 0) on the second bit, but it is given the option to peek at the auxiliary bit.

It's to the agent's benefit if it looks at the auxiliary bit! If we consider a setting where the agent pays up 1 dollar to gain 2 dollars, the policy of just unconditionally paying up gets dollars. If we consider the policy where the agent refuses to pay if the auxiliary bit is 0, and pays up if the auxiliary bit is 1 (and a 0 shows up), then this policy earns dollars.

This only applies if, in omega's counterfactual, the auxiliary bit came up the same way as it did in reality. If omega's counterfactual (where the final bit is 0) has the auxiliary bit being 0, the best move is to just select a policy before seeing the auxiliary bit.

Time to generalize.

Let's say we've got a space of information equipped with some probability measure , another space of states equipped with the probability measure , and there's some space of actions , and the agent is choosing a partial policy of type , the set of all of them is , and the agent gets a reward .

There's a key assumption that's implicit in this problem setup. Because the reward only depends on the partial policy, and the event in Y, this forbids counterfactual muggings where the counterfactual involves the event from X coming out differently. X can be intuitively thought of as the space of information that will be assumed to hold in some problem where the agent is rewarded or not, based on what it would do under various outcomes selected from . So, in the above counterfactual mugging example, would be the state of the auxiliary bit, and would be the state of the bit that the agent is actually being counterfactually mugged with.

(this is just because either , or there's a that gets an even higher reward)

Therefore, the expected value of sample information is always nonnegative. This is just the usual "value of information is nonnegative" proof, but swapping out the space of actions for the space of partial policies. Because there's an assumption that we'll only face counterfactual muggings on information in , the optimal policy can be found by just looking at the information from and doing policy selection over accordingly.

So, for all the math features that omega's counterlogical preserves, this seems to imply (the theorem doesn't quite apply to the logical induction setting because it used classical probability measures) that the agent will try to update as far as it possibly can on information it won't be penalized for, before doing policy selection.

Due to the fact that advanced agents will try their best to fool omega into concluding that they paid up, whether by self-modification, or trying to reason that they're in a counterlogical from the inside, and the fact that these counterlogicals don't actually exist but are dependent on the implementation of the agent, and the fact that the best policy is for the agent to update on everything that will remain the same inside the predictor's what-if... It seems to indicate that problems involving logical updatelessness have more to do with controlling the computation that the predictor is running, instead of caring about mathematically impossible worlds.

New Comment
3 comments, sorted by Click to highlight new comments since:

"if they condition on a sufficiently long initial sequence of digits, they’ll assign high probability to the true distant digit."

I might be missing something, but this seems to be wrong? If the distant digit is sufficiently distant to be not efficiently computable, then obviously the logical inductor will not be able to predict it. The whole idea of logical counterfactual mugging is considering questions which the agent cannot solve in advance.

Counterfactual mugging doesn't require spoofing. Consider the following problem:

Suppose no one, given steps of computation, is able to compute any information about the parity of the th digit of , and everyone, given steps of computation, is able to compute the th digit of . Suppose that at time , everyone has steps of computation, and at a later time , everyone has steps of computation. At the initial time , Omega selects a probability equal to the conditional probability Omega assigns to the agent paying $1 at time conditional on the digit being odd. (This could be because Omega is a logical inductor, or because Omega is a CDT agent whose utility function is such that selecting this value of is optimal). At time , if the digit is even, a fair coin with probability of coming up heads is flipped, and if it comes up heads, Omega pays the agent $10. If instead the digit is odd, then the agent has the option of paying Omega $1.

This contains no spoofing, and the optimal policy for the agent is to pay up if asked.

What do you mean by "in full generality instead of the partial version attained by policy selection"?