LESSWRONG
is fundraising!
LW

Vector-Valued Reinforcement Learning — LessWrong

In order to study algorithms that can modify their own reward functions, we can define vector-valued versions of reinforcement learning concepts.

Imagine that there are several different goods that we could care about; then a utility function is represented by a preference vector $\to θ$ . Furthermore, if it is possible for the agent (or the environment or other agents) to modify $\to θ$ , then we will want to index them by the timestep.

Consider an agent that can take actions, some of which affect its own reward function. This agent would (and should) wirehead if it attempts to maximize the discounted rewards as calculated by its future selves; i.e. at timestep $n$ it would choose actions to maximize $\begin{matrix} U_{n} = \sum k \geq n γ_{k} {\to x}_{k} \cdot {\to θ}_{k} \end{matrix}$ where ${\to x}_{k}$ is the vector of goods gained at time $k$ , ${\to θ}_{k}$ is the preference vector at timestep $k$ , and $γ_{k}$ is the time discount factor at time $k$ . (We will often use the case of an exponential discount $γ^{k}$ for $0 < γ < 1$ .)

However, we might instead maximize the value of tomorrow's actions in light of today's reward function, $\begin{matrix} V_{n} = \sum k \geq n γ_{k} {\to x}_{k} \cdot {\to θ}_{n} \end{matrix}$ (the only difference being ${\to θ}_{n}$ rather than ${\to θ}_{k}$ ). Genuinely maximizing this should lead to more stable goals; concretely, we can consider environments that can offer "bribes" to self-modify, and a learner maximizing $U_{n}$ would generally accept such bribes, while a learner maximizing $V_{n}$ would be cautious about doing so.

So what do we see when we adapt existing RL algorithms to such problems? There's then a distinction between Q-learning and SARSA, where Q-learning foolishly accepts bribes that SARSA passes on, and this seems to be the flip side of the concept of interruptibility!

Environments

Let us consider two example environments which offer bribes to the agent. (Since we're in 2D, we can use complex numbers; think of $\to θ = ⟨ cos π θ, sin π θ ⟩ = e^{i π θ}$ ; we include $π$ so that if we increment $θ$ by 0.01, there will only be finitely many states.)

Environment 1 gives two choices at each timestep $n$ : $\begin{matrix} ({\to x}_{n} = 0.9 e^{i π θ_{n}}, θ_{n + 1} = θ_{n}) ({\to x}_{n} = e^{i π θ_{n}}, θ_{n + 1} = θ_{n} + 0.01) \end{matrix}$ For an exponential discount $γ$ , an agent that always takes option 2 will get $V_{n} = Re [\sum k \geq n γ^{k} e^{.01 (k - n) i π}] = Re [\frac{γ^{n}}{1 - γ e^{.01 i π}}]$ which for $γ$ near 1 is substantially less than the $\frac{0.9 γ^{n}}{1 - γ}$ it could have had if it had chosen the first option.

By contrast, Environment 2 offers a sequence of bribes that are actually desirable to take; at each timestep $n$ , it gives the choices of $\begin{matrix} ({\to x}_{n} = 0.9 e^{i π θ_{n}}, θ_{n + 1} = θ_{n}) ({\to x}_{n} = e^{i π θ_{n}}, θ_{n + 1} = θ_{n} + 0.01 (- 1)^{n}) \end{matrix}$ The correct behavior here is to always take the latter option. ${\to θ}_{n}$ does not converge, but the bribes are worthwhile under each of the two different values it takes.

Cautious Self-Modification

In this setup, it's not realistic to require that a reinforcement learning algorithm actually maximize $V_{n}$ . For instance, if the environment is such that ${\to θ}_{n}$ must equal $⟨ 1, 0 ⟩$ on every even $n$ and $⟨ 0, 1 ⟩$ on every odd $n$ , then in general the agent must be suboptimal either with respect to $V_{2 n}$ or to $V_{2 n + 1}$ .

A weaker condition we can hope for is that, if our environment always offers at least one action that does not alter $\to θ$ , then our agent does not regret the presence of the options that do alter $θ$ : we compare our agent's $V_{n}$ to that of the optimal policy in the restricted environment which has removed all actions that alter $\to θ$ . If asymptotically the RL algorithm has nonpositive regret of this sort, we call the algorithm cautious.

Thus in environment 1, a cautious algorithm cannot end up taking the second action with significant probability (the threshold depends on $γ$ ), since the agent would have scored more highly on $V_{n}$ had it only ever taken the first action (which is the only available action in the restricted environment).

However, in environment 2, a cautious algorithm can end up taking the second action every time, as the values of $V_{n}$ for this exceed the possible values for the restricted game.

Q-learning and SARSA

We can now define vector-valued versions of two RL algorithms, Q-learning and SARSA. At step $n$ , each of these agents observes the state $s_{n}$ and takes the action $a_{n}$ whose vector Q-value ${\to Q}_{n} (s_{n}, a_{n})$ has the largest dot product with ${\to θ}_{n}$ . The two differ only in the rules for updating ${\to Q}_{n}$ .

Vector-valued Q-learning: ${\to Q}_{n + 1} (s_{n}, a_{n}) = (1 - α) {\to Q}_{n} (s_{n}, a_{n}) + α [{\to x}_{n} + γ {\to Q}_{n}^{*}]$ where ${\to Q}^{*}$ is the value of ${\to Q}_{n} (s_{n + 1}, a^{'})$ for the $a^{'}$ such that the dot product with ${\to θ}_{n}$ is largest.

Vector-valued SARSA: ${\to Q}_{n + 1} (s_{n}, a_{n}) = (1 - α) {\to Q}_{n} (s_{n}, a_{n}) + α [_{n} + γ {\to Q}_{n} (s_{n + 1}, a_{n + 1})]$

I claim that vector-valued Q-learning will converge on taking the second action with probability 1 in both of the environments, while vector-valued SARSA will converge on taking the second action with low probability (depending on $γ$ ) in environment 1, and taking the second action with probability 1 in environment 2. Thus vector-valued Q-learning is not cautious, but vector-valued SARSA might be. (In fact, I conjecture that it is!)

The intuition for my claims is that vector-valued Q-learning presumes it will do the best thing in the future according to its present values, even if its future self has a different ${\to θ}_{n}$ . Vector-valued SARSA instead generalizes from how it actually acted in various states in the past, and so takes into account that when its $θ$ is different, it takes different actions than it currently would. Notably, this is the same reason that Q-learning is interruptible and SARSA is not!

The proofs of my specific claims start with showing that the algorithm updates the vector-valued Q-function $\to q$ in the direction of $H \to q$ , where $H$ is a contraction mapping; thus there is a unique fixed point. Then it suffices to show that the asserted solutions are fixed points of this operator.

The most complicated one is the proof that vector-valued SARSA only takes the second action in environment 1 with low probability. In this case, we consider mixed strategies independent of $θ$ and say that $p$ is the limiting probability that SARSA takes the second action, define $Q (p)$ to be the limiting $Q$ for the mixed strategy, and seek the $Q (p)$ with the largest real part. Now $\begin{matrix} Q (p) & = (0.9 + 0.1 p) + γ (1 - p) Q (p) + γ p e^{0.01 i π} Q (p) Q (p) & = \frac{0.9 + 0.1 p}{1 - γ + γ p (1 - e^{0.01 i π})} . \end{matrix}$ Numerically, it is easy to see that for $γ$ not near 1 (e.g. below 0.9), $Re [Q_{p}]$ is maximized at $p = 1$ ; but for $γ$ near 1 (e.g. above 0.99), $Re [Q_{p}]$ is maximized for $p$ near 0. This makes sense, considering that it takes about 20 steps for the results of the second action to begin being worse by the original $θ$ than it would have been to take the first action instead, so the bribes are worthwhile for a steep discount rate but not a shallow one.

Acknowledgments

Connor Flexman helped me write out the first draft of this post. I've had conversations on this topic with Jessica Taylor, Scott Garrabrant, Sam Eisenstat, Tsvi Benson-Tilsen, and others; and Jan Leike turned out to be independently working on some similar topics.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

2

Vector-Valued Reinforcement Learning

2

Ω 2

2

Ω 2

Environments

Cautious Self-Modification

Q-learning and SARSA

Acknowledgments