## LESSWRONGLW

I'm posting a few research directions in my research agenda about which I haven't written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven't got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.

Showing 3 of 4 replies (Click to show all)

This idea was inspired by a discussion with Discord user @jbeshir

Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrix ). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game.

The simpl

4Vanessa Kosoy8moIt is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won't go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form "program template P augmented by a hard-coded bit string b of length n". Such a set contains 2n hypotheses, and its total probability mass is approximately 2−|P|, which is significant for short P (even when n is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both P and b. Such a contest is very hard since learning n bits can take much time for large n. Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string x is defined as follows. First, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs ( Q,y) where Q is a program, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the sophistication of x . For our problem, we are interested in the minimal Q itself: I call it the "sophisticated core" of x. We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping. Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of P.
4Vanessa Kosoy8moA variant of Christiano's IDA amenable to learning-theoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions: (i) If we make the null prediction always, the expected utility will be lower bounded by some constant. (ii) If our predictions sample the n-step future for a given policy π, then our expected utility will be lower bounded by some function F(u,n) of the the expected utility u of π and n. F is s.t. for sufficiently low u, F(u,n)≤u but for sufficiently high u, F(u,n)>u (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it's natural to assume F(F(u,n),m)≈F(u,nm) for u>>0. The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad self-fulfilling prophecies and "corrupting" predictions that cause damage just by seeing them. However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric d. Then, total variation distance is replaced by Kantorovich-Rubinstein distance defined w.r.t. d. The question is, where do we get the m