## LESSWRONGLW

Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

# Wiki Contributions

Maybe I am confused by what you mean by . I thought it was the state space, but that isn't consistent with  in your post which was defined over ?

I'm not entirely sure what you mean by the state space.  is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is , not . I slightly abused notation by defining  in the parent comment. Let's say it's  and  is defined by using  to translate the history to the (last) state and then applying .

One more question, this one about the priors: what are they a prior over exactly? ...I ask because the term  will be positive infinity if  is zero for any value where  is non-zero.

The prior is just an environment i.e. a partial mapping  defined on every history to which it doesn't itself assign probability . The expression  means that we consider all possible ways to choose a Polish space probability distributions  and a mapping  s.t.  and  (where the expected value is defined using the Bayes law and not pointwise, see also the definition of "instrumental states" here), and take the minimum over all of them of .

I think that step 6 is supposed to say "from 5 and 3" instead of "from 4 and 1"?

Good idea!

## Example 1

Fix some alphabet . Here's how you make an automaton that checks that the input sequence (an element of ) is a subsequence of some infinite periodic sequence with period . For every in , let be an automaton that checks whether the symbols in the input sequences at places s.t. are all equal (its number of states is ). We can modify it to make a transducer that produces its unmodified input sequence if the test passes and if the test fails. It also produces when the input is . We then chain and get the desired automaton. Alternatively, we can connect the in parallel and then add an automaton with boolean inputs that acts as an AND gate. is a valid multi-input automaton in our language because AND is associative and commutative (so we indeed get a functor on the product category).

Notice that the description size of this automaton in our language is polynomial in . On the other hand, a tabular description would be exponential in (the full automaton has exponentially many states). Moreover, I think that any regular expression for this language is also exponentially large.

## Example 2

We only talked about describing deterministic (or probabilistic, or monadic) automata. What about nondeterministic? Here is how you can implement a nondeterministic automaton in the same language, without incurring the exponential penalty of determinization, assuming non-free categories are allowed.

Let be some category that contains an object and a morphism s.t. and . For example it can be the closed cartesian category freely generated by this datum (which I think is well-defined). Then, we can simulate a non-deterministic automaton on category by a deterministic transducer from to :

• The state set is always the one element set (or, it can be two elements: "accept" and "reject").
• For every state of , we have a variable of signature . This variable is intended to hold when the state is achievable with the current input and otherwise.
• We use the fact that composition of endomorophisms of behaves as an "OR" operator to implement the transition rules of .

However, this trick doesn't give us nondeterministic transducers. A completely different way to get nondeterminism, which works for transducers as well, is using infra-Bayesian Knightian uncertainty to represent it. We can do via the memo monad approach, but then we get the constraint that nondeterministic transitions happen identically in e.g. identical subgraphs. This doesn't matter for ordinary automata (that process words) but it matters for tree/graph automata. If we don't want this constraint, we can probably do it in a framework that uses "infra-Markov" categories (which I don't know how to define atm, but it seems likely to be possible).

Together with Example 1, this implies that (this version of) our language is strictly more powerful than regular expressions.

## Example 3

Suppose we want to take two input sequences (elements of ) and check them for equality. This is actually impossible without the meta-vertex, because of the commutativity properties of category products. With the meta-vertex (the result is not an automaton anymore, we can call it a "string machine"), here is how we can do it. Denote the input sequences and . We apply a transducer to that transforms it into a string diagram. Each symbol in is replaced by a transducer that checks whether the first symbol of its input is and outputs the same sequence with the first symbol removed. These transducers are chained together. In the end of the chain we add a final transducer that checks whether its input is empty. Finally, we use the meta-vertex to feed into this chain of transducers.

Notice that this requires only one level of meta: the string diagram we generate doesn't contain any meta-vertices.

## Example 4

More generally, we can simulate any multi-tape automaton using a succinct string machine with one level of meta: First, there is a transducer that takes the tapes as separate inputs and simulates a single step of the multi-tape automaton. Here, moving the tape head forward corresponds to outputting a sequence with the first symbol omitted. Second, there is a transducer that takes the inputs and produces a string diagram that consists of copies of chained together (this transducer just adds another to the chain per every input symbol). Finally, the desired string machine runs and then uses a meta-vertex to apply the diagram produces to the input.

## Example 5

A string machine with unbounded meta can simulate a 1D cellular automaton, and hence any Turing machine: First, there is a transducer which simulates a single step of the automaton (this is a classical transducer, not even the stronger version we allow). Second, we already know there is an automaton that tests for equality. Third, fix some automaton whose job is to read the desired output off the final state of the cellular automaton. We modify to make a transducer , which (i) when the inputs are equal, produces a string diagram that consists only of (ii) when the inputs are different, produces this string machine. This string macine applies to the input , then applies to and , then uses the meta-vertex to apply to ...

Except that I cheated here because the description of references a string machine that contains . However, with closed machines such quining can be implemented: instead of producing a string diagram, the transducer produces an morphism that converts transducers to string diagrams, and then we feed the same transducer composed in the same way with itself into this morphism.

Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs and s.t. is a refinenement of , then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of . This doesn't imply low regret, but we can make it an independent requirement. However, I'm not sure how useful that is. Let's look at a simple example.

There are two observations and two actions . Seeing observation incurs a loss of and seeing observation incurs a loss of . Taking action incurs a loss of and taking action incurs a loss of . These losses add up straightforwardly. For the maximal ignorance hypothesis , "always take action " is the only robustly optimal policy. However, "take action unless you observed , in which case take action " is also optimal (for sufficiently large). Worse, is robustly optimal for some refinements of : maybe taking action causes you to get observation instead of .

Instead of , we can consider the law which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in ). Now is the sole robust policy of any refinement of ! However, to express this as an ultra-POMDP we need to introduce the notion of "Murphy observations" (i.e. Murphy doesn't see the state directly, just like the agent). I'm not sure about the consistency of robustness desideratum in this case.

Intuitively, I'm guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.

More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as "I learned that the hot stove always burns my hand". For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won't touch the stove when pleasantly surprised, because there are no surprises!

Well, it's true that the IB regret bound doesn't imply not touching the hot stove, but. When I think of a natural example of an algorithm satisfying an IB regret bound, it is something like UCB: every once in a while it chooses a hypotheses which seems plausible given previous observations and follows its optimal policy. There is no reason such an algorithm would touch the hot stove, unless there is some plausible hypothesis according to which it is beneficial to touch the hot stove... assuming the optimal policy it follows is "reasonable": see definition of "robustly optimal policy" below.

The interesting question is, can we come up with a natural formal desideratum strictly stronger than IB regret bounds that would rule out touching the hot stone. Some ideas:

• Let's call a policy "robustly optimal" for an ultra-POMDP if it is the limit of policies optimal for this ultra-POMDP with added noise (compare to the notion of "trembling game equilibrium").
• Require that your learning algorithm converges to the robustly optimal policy for the true hypothesis. Converging to the exact optimal policy is a desideratum that is studied in the RL theory literature (under the name "sample complexity").
• Alternatively, require that your learning algorithm converges to the robustly optimal policy for a hypothesis close to the true hypothesis under some metric (with the distance going to 0).
• Notice that this desideratum implies an upper regret bound, so it is strictly stronger.
• Conjecture: the robustly Bayes-optimal policy has the property above, or at least has it whenever it is possible to have it.

The problem is that any useful prior must be based on Occam's razor, and Occam's razor + first-person POV creates the same problems as with the universal prior. And deliberately filtering out simulation hypotheses seems quite difficult, because it's unclear to specify it. See also this.

FWIW, I'm a female AI alignment researcher and I never experienced anything even remotely adjacent to sexual misconduct in this community. (To be fair, it might be because I'm not young and attractive; more likely the Bloomberg article is just extremely biased.)

Three remarks:

• "I might as well celebrate by touching the hot stove" is a misinterpretation of what happens. The reason the agent might touch the hot stove is because it's exploring, i.e. investigating hypotheses according to which touching the hot stove is sometimes good. This is obviously necessary if you want to have learnability. The "hot stove" is just an example chosen to predispose us to interpret the behavior as incorrect. But, for example, scientists playing with chemicals might have also looked stupid to other people in the past, and yet it let to discovering chemistry and creating fertilizers, medications, plastics etc.
• The existence of a prior that's simultaneously rich and tractable seems to me like a natural assumption. Mathematically, the bounded Solomonoff prior "almost" works: it is "only" intractable because [1] i.e. for some very complicated reason that might easily fail in some variation, and in a universe with oracles it just works. And, emprically we know that simple algorithms (deep learning) can have powerful cross-domain abilities. It seems very likely there is a mathematical reason for it, which probably takes the form of such a prior.
• Another way to effectively consider a larger space of hypotheses is Turing RL, where you're essentially representing hypotheses symbolically: something that people certainly seem to use quite effectively.

1. Roughly speaking, more precisely it seems to be related to average-case complexity separations (e.g. one-way functions). ↩︎

I have higher expectations from learning agents - that they learn to solve such problems despite the difficulties.

I'm saying that there's probably a literal impossibility theorem lurking there.

But, after reading my comment above, my spouse Marcus correctly pointed out that I am mischaracterizing IBP. As opposed to IBRL, in IBP, pseudocausality is not quite the right fairness condition. In fact, in a straightforward operationalization of repeated full-box-dependent transparent Newcomb, an IBP agent would one-box. However, there are more complicated situations where it would deviate from full-fledged UDT.

Example 1: You choose whether to press button A or button B. After this, you play Newcomb. Omega fills the box iff you one-box both in the scenario in which you pressed button A and in the scenario in which you pressed button B. Random is not allowed. A UDT agent will one-box. An IBP agent might two-box because it considers the hypothetical in which it pressed a button different from what it actually intended to press to be "not really me" and therefore unpredictable. (Essentially, the policy is ill-defined off-policy.)

Example 2: You see either a green light or a red light, and then choose between button A and button B. After this, you play Newcomb. Omega fills the box iff you either one-box after seeing green and pressing A or one-box after seeing green and pressing B. However, you always see red. A UDT agent will one-box if it saw the impossible green and two-box if it saw red. An IBP agent might two-box either way, because if it remembers seeing green then it decides that all of its assumptions about the world need to be revised.

You don't need the Nirvana trick if you're using homogeneous or fully general ultracontributions and you allow "convironments" (semimeasure-environments) in your notion of law causality. Instead of positing a transition to a "Nirvana" state, you just make the transition kernel vanish identically in those situations.

However, this is a detail, there is a more central point that you're missing. From my perspective, the reason Newcomb-like thought experiments are important is because they demonstrate situations in which classical formal approaches to agency produce answers that seems silly. Usually, the classical approaches examined in this context are CDT and EDT. However, CDT and EDT are both too toyish for this purpose, since they ignore learning and instead assume the agent already knows how the world works, and moreover this knowledge is represented in the preferable form of the corresponding decision theory. Instead, we should be thinking about learning agents, and the classical framework for those is reinforcement learning (RL). With RL, we can operationalize the problem thus: if a classical RL agent is put into an arbitrary repeated[1] Newcomb-like game, it fails to converge to the optimal reward (although it does success for the original Newcomb problem!)

On the other hand, an infra-Bayesian RL agent provably does converge to optimal reward in those situations, assuming pseudocausality. Ofc IBRL is just a desideratum, not a concrete algorithm. But examples like Tian et al and my own upcoming paper about IB bandits show that there are algorithms with reasonably good IB regret bounds for natural hypotheses classes. While an algorithm with a good regret bound[2] for ultra-POMDPs[3] has not yet been proposed, it seems very like that it exists.

Now, about non-pseudocausal scenarios (such as noiseless transparent Newcomb). While this is debatable, I'm leaning towards the view that we actually shouldn't expect agents to succeed there. This became salient to me when looking at counterfactuals in infra-Bayesian physicalism. [EDIT: actually, things are different in IBP, see comment below.] The problem with non-pseudocausal updatelessness is that you expect the agent to follow the optimal policy even after making an observation that, according to the assumptions, can never happen, not even with low probability. This sounds like it might make sense when viewing an individual problem, but in the context of learning it is impossible. Learning requires that an agent that sees an observation which is impossible according to hypothesis H, discards hypothesis H and acts on the other hypotheses in has. There is being updateless, and then there is being too updateless :)

Scott Garrabrant wrote somewhere recently that there is tension between Bayesian updates and reflective consistency, and that he thinks reflective consistency is so important that we should sacrifice Bayesian updates. I agree that there is tension, and that reflective consistency is really important, and that Bayesian updates should be partially sacrificed, but it's possible to take this too far. In Yudkowsky's original paper on TDT he gives the example of an alphabetizing agent as something that can be selected for by certain decision problems. Ofc this doesn't prove non-alphabetizing is irrational. He argues that we need some criterion of "fairness" to decide which decisions problem count. I think that pseudocausality should be a part of the fairness criterion, because without that we don't get learnability[4]: and learnability is so important that I'm willing to sacrifice reflective consistency in non-pseudocausal scenarios!

1. Instead of literal repetition, we could examine more complicated situations where information accumulates over time so that the nature of the game can be confidently inferred in the limit. But, the principle is the same. ↩︎

2. If you don't care about the specific regret bound then it's easy to come up with an algorithm based on Exp3, but that's just reducing the problem to blind trial and error of different policies, which is missing the point. The point being, the ability to exploit regularities in the world which also applies to Newcomb-like scenarios. ↩︎

3. You need ultra-POMDPs to model e.g. counterfactual mugging. Even ordinary POMDPs have been relatively neglected in the literature, because the control problem is PSPACE-hard. Dealing with that is an interesting question, but it seems orthogonal to the philosophical issues that arise from Newcomb. ↩︎

4. Although there is still room for a fairness criterion weaker than pseudocausality but stronger that the imagined fairness criterion of UDT. ↩︎