Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Wiki Contributions

Comments

Oops. What if instead of "for any " we go with "there exists "?

Here’s a plausible human circular preference. You won a prize! Your three options are: (A) 5 lovely plates, (B) 5 lovely plates and 10 ugly plates, (C) 5 OK plates.

No one has done this exact experiment to my knowledge, but plausibly (based on discussion of a similar situation in Thinking Fast And Slow chapter 15) this is a circular preference in at least some people: When people see just A & B, they'll pick B because "it's more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever". When they see just B & C, they'll pick C because "the average quality is higher". When they see just C & A, they'll likewise pick A because "the average quality is higher".

This makes no sense to me. Why would you pick C over B? B Pareto dominates C since it contains 5 lovely plates whereas C only has 5 OK plates.

I propose the axioms A1-A3 together with

B2. If  then for any  we have 
B3. If  and , then for any  we have 

I suspect that these imply C4.

Maybe I am confused by what you mean by . I thought it was the state space, but that isn't consistent with  in your post which was defined over ?

I'm not entirely sure what you mean by the state space.  is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is , not . I slightly abused notation by defining  in the parent comment. Let's say it's  and  is defined by using  to translate the history to the (last) state and then applying .

One more question, this one about the priors: what are they a prior over exactly? ...I ask because the term  will be positive infinity if  is zero for any value where  is non-zero.

The prior is just an environment i.e. a partial mapping  defined on every history to which it doesn't itself assign probability . The expression  means that we consider all possible ways to choose a Polish space probability distributions  and a mapping  s.t.  and  (where the expected value is defined using the Bayes law and not pointwise, see also the definition of "instrumental states" here), and take the minimum over all of them of .

I think that step 6 is supposed to say "from 5 and 3" instead of "from 4 and 1"?

Good idea!

Example 1

Fix some alphabet . Here's how you make an automaton that checks that the input sequence (an element of ) is a subsequence of some infinite periodic sequence with period . For every in , let be an automaton that checks whether the symbols in the input sequences at places s.t. are all equal (its number of states is ). We can modify it to make a transducer that produces its unmodified input sequence if the test passes and if the test fails. It also produces when the input is . We then chain and get the desired automaton. Alternatively, we can connect the in parallel and then add an automaton with boolean inputs that acts as an AND gate. is a valid multi-input automaton in our language because AND is associative and commutative (so we indeed get a functor on the product category).

Notice that the description size of this automaton in our language is polynomial in . On the other hand, a tabular description would be exponential in (the full automaton has exponentially many states). Moreover, I think that any regular expression for this language is also exponentially large.

Example 2

We only talked about describing deterministic (or probabilistic, or monadic) automata. What about nondeterministic? Here is how you can implement a nondeterministic automaton in the same language, without incurring the exponential penalty of determinization, assuming non-free categories are allowed.

Let be some category that contains an object and a morphism s.t. and . For example it can be the closed cartesian category freely generated by this datum (which I think is well-defined). Then, we can simulate a non-deterministic automaton on category by a deterministic transducer from to :

  • The state set is always the one element set (or, it can be two elements: "accept" and "reject").
  • For every state of , we have a variable of signature . This variable is intended to hold when the state is achievable with the current input and otherwise.
  • We use the fact that composition of endomorophisms of behaves as an "OR" operator to implement the transition rules of .

However, this trick doesn't give us nondeterministic transducers. A completely different way to get nondeterminism, which works for transducers as well, is using infra-Bayesian Knightian uncertainty to represent it. We can do via the memo monad approach, but then we get the constraint that nondeterministic transitions happen identically in e.g. identical subgraphs. This doesn't matter for ordinary automata (that process words) but it matters for tree/graph automata. If we don't want this constraint, we can probably do it in a framework that uses "infra-Markov" categories (which I don't know how to define atm, but it seems likely to be possible).

Together with Example 1, this implies that (this version of) our language is strictly more powerful than regular expressions.

Example 3

Suppose we want to take two input sequences (elements of ) and check them for equality. This is actually impossible without the meta-vertex, because of the commutativity properties of category products. With the meta-vertex (the result is not an automaton anymore, we can call it a "string machine"), here is how we can do it. Denote the input sequences and . We apply a transducer to that transforms it into a string diagram. Each symbol in is replaced by a transducer that checks whether the first symbol of its input is and outputs the same sequence with the first symbol removed. These transducers are chained together. In the end of the chain we add a final transducer that checks whether its input is empty. Finally, we use the meta-vertex to feed into this chain of transducers.

Notice that this requires only one level of meta: the string diagram we generate doesn't contain any meta-vertices.

Example 4

More generally, we can simulate any multi-tape automaton using a succinct string machine with one level of meta: First, there is a transducer that takes the tapes as separate inputs and simulates a single step of the multi-tape automaton. Here, moving the tape head forward corresponds to outputting a sequence with the first symbol omitted. Second, there is a transducer that takes the inputs and produces a string diagram that consists of copies of chained together (this transducer just adds another to the chain per every input symbol). Finally, the desired string machine runs and then uses a meta-vertex to apply the diagram produces to the input.

Example 5

A string machine with unbounded meta can simulate a 1D cellular automaton, and hence any Turing machine: First, there is a transducer which simulates a single step of the automaton (this is a classical transducer, not even the stronger version we allow). Second, we already know there is an automaton that tests for equality. Third, fix some automaton whose job is to read the desired output off the final state of the cellular automaton. We modify to make a transducer , which (i) when the inputs are equal, produces a string diagram that consists only of (ii) when the inputs are different, produces this string machine. This string macine applies to the input , then applies to and , then uses the meta-vertex to apply to ...

Except that I cheated here because the description of references a string machine that contains . However, with closed machines such quining can be implemented: instead of producing a string diagram, the transducer produces an morphism that converts transducers to string diagrams, and then we feed the same transducer composed in the same way with itself into this morphism.

Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs and s.t. is a refinenement of , then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of . This doesn't imply low regret, but we can make it an independent requirement. However, I'm not sure how useful that is. Let's look at a simple example.

There are two observations and two actions . Seeing observation incurs a loss of and seeing observation incurs a loss of . Taking action incurs a loss of and taking action incurs a loss of . These losses add up straightforwardly. For the maximal ignorance hypothesis , "always take action " is the only robustly optimal policy. However, "take action unless you observed , in which case take action " is also optimal (for sufficiently large). Worse, is robustly optimal for some refinements of : maybe taking action causes you to get observation instead of .

Instead of , we can consider the law which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in ). Now is the sole robust policy of any refinement of ! However, to express this as an ultra-POMDP we need to introduce the notion of "Murphy observations" (i.e. Murphy doesn't see the state directly, just like the agent). I'm not sure about the consistency of robustness desideratum in this case.

Intuitively, I'm guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.

More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as "I learned that the hot stove always burns my hand". For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won't touch the stove when pleasantly surprised, because there are no surprises!

Well, it's true that the IB regret bound doesn't imply not touching the hot stove, but. When I think of a natural example of an algorithm satisfying an IB regret bound, it is something like UCB: every once in a while it chooses a hypotheses which seems plausible given previous observations and follows its optimal policy. There is no reason such an algorithm would touch the hot stove, unless there is some plausible hypothesis according to which it is beneficial to touch the hot stove... assuming the optimal policy it follows is "reasonable": see definition of "robustly optimal policy" below.

The interesting question is, can we come up with a natural formal desideratum strictly stronger than IB regret bounds that would rule out touching the hot stone. Some ideas:

  • Let's call a policy "robustly optimal" for an ultra-POMDP if it is the limit of policies optimal for this ultra-POMDP with added noise (compare to the notion of "trembling game equilibrium").
  • Require that your learning algorithm converges to the robustly optimal policy for the true hypothesis. Converging to the exact optimal policy is a desideratum that is studied in the RL theory literature (under the name "sample complexity").
  • Alternatively, require that your learning algorithm converges to the robustly optimal policy for a hypothesis close to the true hypothesis under some metric (with the distance going to 0).
  • Notice that this desideratum implies an upper regret bound, so it is strictly stronger.
  • Conjecture: the robustly Bayes-optimal policy has the property above, or at least has it whenever it is possible to have it.

The problem is that any useful prior must be based on Occam's razor, and Occam's razor + first-person POV creates the same problems as with the universal prior. And deliberately filtering out simulation hypotheses seems quite difficult, because it's unclear to specify it. See also this.

FWIW, I'm a female AI alignment researcher and I never experienced anything even remotely adjacent to sexual misconduct in this community. (To be fair, it might be because I'm not young and attractive; more likely the Bloomberg article is just extremely biased.)

Load More