Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The view of counterfactuals as just conditioning on low-probability events has a lot going for it. To begin with, in a bayesian setting, updates are done by conditioning. A probability distribution, conditioned on some event (an imaginary update), and a probability distribution after actually seeing (an actual update) will be identical.

There is an issue with conditioning on low-probability events, however. When has a low probability, the conditional probability has division by a small number, which amplifies noise and small changes in the probability of the conjunction, so estimates of probability conditional on lower-probability events are more unstable. The worst-case version of this is conditioning on a zero-probability event, because the probability distribution after conditioning can be literally anything without affecting the original probability distribution. One useful intution for this is that probabilities conditional on are going to be less accurate, when you've seen very few instances of occuring, as the sample size is too small to draw strong conclusions.

However, in the logical inductor setting, it is possible to get around this with infinite exploration in the limit. If you act unpredictably enough to take bad actions with some (very small) probability, then in the limit, you'll experiment enough with bad actions to have well-defined conditional probabilities on taking actions you have (a limiting) probability 0 of taking. The counterfactuals of standard conditioning are those where the exploration step occured, just as the counterfactuals of modal UDT are those where the agents implicit chicken step went off because it found a spurious proof in a nonstandard model of PA.

Now, this notion of counterfactuals can have bad effects, because zooming in on the little slice of probability mass where you do is different from the intuitive notion of counterfacting on doing . Counterfactual on me walking off a cliff, I'd be badly hurt, but conditional on me doing so I'd probably also have some sort of brain lesion. Similar problems exist with Troll Bridge, and this mechanism is the reason why logical inductors converge to not giving Omega money in a version of Newcomb's problem where Omega can't predict the exploration step. Conditional on them 2-boxing, they are probably exploring in an unpredictable way, which catches Omega unaware and earns more money.

However, there's no better notion of counterfactuals that currently exists, and in fully general environments, this is probably as well as you can do. In multi-armed bandit problems, there are many actions with unknown payoff, and the agent must converge to figuring out the best one. Pretty much all multi-armed bandit algorithms involve experimenting with actions that are worse than baseline, which is a pretty strong clue that exploration into bad outcomes is necessary for good performance in arbitrary environments. If you're in a world that will reward or punish you in arbitrary if-then fashion for selecting any action, then learning the reward given by three of the actions doesn't help you figure out the reward of the fourth action. Also, in a similar spirit as Troll Bridge, if there's a lever that shocks you, but only when you pull it in the spirit of experimentation, then if you don't have access to exactly how the lever is working, but just the external behavior, it's perfectly reasonable to believe that it just always shocks you (after all, it's done that all other times it was tried).

And yet, despite these arguments, humans can make successful engineering designs operating in realms they don’t have personal experience with. And humans don’t seem to reason about what-ifs by checking what they think about the probability of and , and comparing the two. Even when thinking about stuff with medium-high probability, humans seem to reason by imagining some world where the thing is true, and then reasoning about consequences of the thing. To put it another way, humans are using some notion of in place of conditional probabilities.

Why can humans do this at all?

Well, physics has the nice property that if you know some sort of initial state, then you can make accurate predictions about what will happen as a result. And these laws have proven their durability under a bunch of strange circumstances that don't typically occur in nature. Put another way, in the multi-armed bandit case, knowing the output of three levers doesn't tell you what the fourth will do, while physics has far more correlation among the various levers/interventions on the environment, so it makes sense to trust the predicted output of pulling a lever you've never pulled before. Understanding how the environment responds to one sequence of actions tells you quite a bit about how things would go if you took some different sequence of actions. (Also, as a side note, conditioning-based counterfactuals work very badly with full trees of actions in sequence, due to combinatorial explosion and the resulting decrease in the probability of any particular action sequence)

The environment of math, and figuring out which algorithms you control when you take some action you don't, appears to be intermediate between the case of fully general multi-armed bandit problems, and physics, though I'm unsure of this.

Now, to take a detour to Abram’s old post on gears . I’ll exerpt a specific part.

Here, I'm siding with David Deutsch's account in the first chapter of The Fabric of Reality. He argues that understanding and predictive capability are distinct, and that understanding is about having good explanations. I may not accept his whole critique of Bayesianism, but that much of his view seems right to me. Unfortunately, he doesn't give a technical account of what "explanation" and "understanding" could be.

Well, if you already have maxed-out predictive capability, what extra thing does understanding buy you? What useful thing is captured by “understanding” that isn’t captured by “predictive capability”?

I’d personally put it this way. Predictive capability is how accurate you are about what will happen in the environment. But when you truly understand something, you can use that to figure out actions and interventions to get the environment to exhibit weird behavior that wouldn’t have precedent in the past sequence of events. You "understand" something when you have a compact set of rules and constraints telling you how a change in starting conditions affects some set of other conditions and properties, which feels connected to the notion of a gears-level model.

To summarize, conditioning-counterfactuals are very likely the most general type, but when the environment (whether it be physics or math) has the property that the change induced by a different starting condition is describable by a much smaller program than an if-then table for all starting conditions, then it makes sense to call to call it a "legitimate counterfactual". The notion of there being something beyond epsilon-exploration is closely tied to having compact descriptions of the environment and how it behaves under interventions, instead of the max-entropy prior where you can't say anything confidently about what happens when you take a different action than you normally do, and this also seems closely tied to Abram's notion of a "gears-level model"

There are interesting paralells to this in the AIXI setting. The "models" would be the Turing machines that may be the environment, and the Turing machines are set up such that any action sequence could be input into them and they would behave in some predictable way. This attains the property of accurately predicting consequences of various action sequences AIXI doesn't take if the world it is interacting with is low-complexity, for much the same reason as humans can reason through consequences of situations they have never encountered using rules that accurately describe the situations they have encountered.

However if AIXI has some high-probability world (according to the starting distribution) where an action is very dangerous, it will avoid that action, at least until it can rule out that world by some other means. As Leike and Hutter entertainingly show, this "Bayesian paranoia" can make AIXI behave arbitrarily badly, just by choosing the universal turing machine appropriately, to assign high probability to a world where AIXI goes to hell and gets 0 reward forever if it ever takes some action.

This actually seems acceptable to me. Just don't be born with a bad prior. Or, at least, there may be some self-fulfilling prophecies, but it's better then having exploration into bad outcomes in every world with irreversible traps. In particular, note that exploration steps are reflectively inconsistent, because AIXI (when considering the future) will do worse (according to the current probability distribution over Turing machines) if it uses exploration, rather than using the current probability distribution. AIXI is optimal according to the environment distribution it starts with, while AIXI with exploration is not.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 12:36 PM

This reminds me of the proposal which tsvi posted, in which logical induction predicts logic using actions as input to the traders, much as AIXI predicts observations and rewards by treating actions as input to the environment. This allows predictions to be well-defined in cases which would otherwise be divide-by-zero errors, though without providing any obvious guarantees about what predictions will look like in such cases.