Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction'

TurnTrout

The safety of artificial intelligence applications involving reinforcement learning is a topic that deserves careful attention.

Foreword

Let's get down to business.

Reinforcement Learning

1: Introduction

2: Multi-armed Bandits

Bandit basics, including nonstationarity, the value of optimism for incentivizing exploration, and upper-confidence-bound action selection.

Some explore/exploit results are relevant to daily life - I highly recommend reading Algorithms to Live By: The Computer Science of Human Decisions.

3: Finite Markov Decision Processes

The framework.

4: Dynamic Programming

Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.

5: Monte Carlo Methods

Prediction, control, and importance sampling.

Importance Sampling

After gathering data with our behavior policy $b$ , we then want to approximate the value function for the target policy $π$ . In off-policy methods, the policy we use to gather the data is different from the one whose value $v_{π}$ we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of $v_{π}$ , so we must overcome this bias.

If $b$ can take all of the actions that $π$ can (i.e., $\forall a, s : π (a | s) > 0 ⟹ b (a | s) > 0$ ), we can overcome by adjusting the return $G_{t}$ of taking a series of actions $A_{t}, \dots, A_{T - 1}$ using the importance-sampling ratio $ρ_{t : T - 1} := \prod_{k = t}^{T - 1} \frac{π (A_{k} | S_{k})}{b (A_{k} | S_{k})}$ . This cleanly recovers $v_{π}$ by the definition of expectation:

\begin{matrix} ρ_{t : T - 1} E_{b} [G_{t} | S_{t} = s] & = (T - 1 \prod k = t \frac{π (A_{k} | S_{k})}{b (A_{k} | S_{k})}) E_{b} [G_{t} | S_{t} = s] = E_{π} [G_{t} | S_{t} = s] = v_{π} (s) . \end{matrix}

Then after observing some set of returns (where ${G_{t}}_{t \in T (s)}$ are the relevant returns for state $s$ ), we define the state's value as the average adjusted observed return

V (s) := \frac{\sum_{t \in T (s)} ρ_{t : T (t) - 1} G_{t}}{| T (s) |} .

However, the $ρ_{t : T (t) - 1}$ 's can be arbitrarily large (suppose $π (A | S) = .5$ and $b (A | S) = 1 \times 10^{- 10}$ ; $\frac{.5}{1 \times 10^{- 10}} = .5 \times 10^{10}$ ), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try

V (s) := \frac{\sum_{t \in T (s)} ρ_{t : T (t) - 1} G_{t}}{\sum_{t \in T (s)} ρ_{t : T (t) - 1}} .

Death and Discounting

Instead of viewing the discount factor $γ$ as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with $1 - γ$ probability to die on the next turn, so we discount rewards accordingly.

This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.

6: Temporal-Difference Learning

The tabular triple threat: $TD(0)$ , $SARSA$ , and $Q$ -learning.

Learning TD Learning

V (S_{t}) \leftarrow V (S_{t}) + α [TD target      R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})      TD error]

$_{TD(0)} ⋆_{Q}^{SARSA}$

$TD(0)$ is one-step bootstrapping of state values $v_{π}$ . It helps you learn the value of a given policy, and is not used for control.
$SARSA$ is on-policy one-step bootstrapping of action values $q_{π}$ using the quintuples $(S, A, R, S^{'}, A^{'})$ .

As in all on-policy methods, we continually estimate $q_{π}$ for the policy $π$ , and at the same time change $π$ toward greediness with respect to $q_{π}$ .

$Q$ -learning is off-policy one-step bootstrapping of action values $q_{π}$ .

You take an action using $π$ and then use the maximal action value at the next state in your TD target term.

$max$ imization bias

With great branching factors come great biases, and optimistic bias is problematic for $Q$ -learning.

Refresher: the $Q$ -learning update rule for state $S_{t}$ , action $A_{t}$ , and new state $S_{t + 1}$ is

Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max a Q (S_{t + 1}, a) - Q (S_{t}, A_{t})] .

Suppose the rewards for all 100 actions at $S_{t + 1}$ are distributed according to $N (0, 1)$ . All of these actions have a true (expected) value of 0, but the probability that none of their estimates are greater than .8 after 1 draw each is $Φ (.8)^{100} \approx 4.6 \times 10^{- 11}$ . The more actions you have, the higher the probability is that at the maximum is just an outlier. See: regression toward the mean and mutual funds.

To deal with this, we toss another $Q$ -learner into the mix; at any given update, one has their value updated and the other greedifies the policy. The double $Q$ -learning scheme works because both $Q$ -learners are unlikely to be biased in the same way. For some reason, this wasn't discovered until 2010.

7: $n$ -step Bootstrapping

$n$ -step everything.

8: Planning and Learning with Tabular Methods

Models, prioritized sweeping, expected vs. sample updates, MCTS, and rollout algorithms.

Roles Models Play

Distribution models include the full range of possible futures and their probabilities. For example, a distribution model for two fair coins: ${HH: .25, HT: .5, TT: .25}$ .

Sample models just let you, well, sample. $HH, HT, HT, HT, HH, TT, HH, HT$ ... You could sample thousands of times and gain a high degree of confidence that the above distribution model is generating the data, but this wouldn't be your only hypothesis (granted, it might have the lowest Kolmogorov complexity).

Note that a distribution model also can function as a sample model.

9: On-policy Prediction with Approximation

We finally hit the good stuff: value-function approximators and stochastic-/semi-gradient methods.

10: On-policy Control with Approximation

11: Off-policy Methods with Approximation

The $d e a d l y t r i a d$ of function approximation, bootstrapping, and off-policy training converge to sow d̴iv̢erge͏nc̸e͟ and i͟nstab̕il̡i̡t̷y in your methods. Metal.

12: Eligibility Traces

I was told to skip this chapter; the first half was my favorite part of the book.

$TD (λ)$ uses a backwards-view eligibility trace to update features, which I find elegant. Suppose you have some feature extraction function $ϕ : S \to R^{d}$ . Then you apply your TD update not only based on the features relevant at the current state, but also to the time-decaying traces of the features of previously-visited states. $λ \in [0, 1]$ sets how quickly this eligibility decay happens; $λ = 0$ recovers $TD(0)$ , while $λ = 1$ recovers Monte-Carlo return methods.

When I was a kid, there was a museum I loved going to - it had all kinds of wacky interactive devices for kids. One of them took sounds and "held them in the air" at a volume which slowly decreased as a function of how long ago the sound was made. State features are sounds, and volume is eligibility.

13: Policy Gradient Methods

The policy gradient theorem, REINFORCE, and actor-critic.

14: Psychology

Creating a partial mapping between reinforcement learning and psychology.

Mental, Territorial

There was a word I was looking for that "mental model" didn't quite seem to fit: "the model with respect to which we mentally simulate courses of action". CFAR's "inner sim" terminology didn't quite map either, as to me, that points to the system-in-motion more than that-on-which-the-system-runs. The literature dubs this a cognitive map.

Individual place cells within the hippocampus correspond to separate locations in the environment with the sum of all cells contributing to a single map of an entire environment. The strength of the connections between the cells represents the distances between them in the actual environment. The same cells can be used for constructing several environments, though individual cells' relationships to each other may differ on a map by map basis. The possible involvement of place cells in cognitive mapping has been seen in a number of mammalian species, including rats and macaque monkeys. Additionally, in a study of rats by Manns and Eichenbaum, pyramidal cells from within the hippocampus were also involved in representing object location and object identity, indicating their involvement in the creation of cognitive maps.

Wikipedia, Cognitive map

In light of my work on whitelisting, I've been thinking about why we're so "object-oriented" in our mental lives. An off-the-cuff hypothesis: to better integrate with the rest of our mental models, the visual system directly links up to our object schema. One such object is then recognized and engraved as a discrete "thing" in our map. Hence, we emotionally "know" that the world "really is" made up of objects, and isn't just a collection of particles.

Most of the information used by people for the cognitive mapping of spaces is gathered through the visual channel (Lynch, 1960).

Lahav and Mioduser's research somewhat supports this idea, suggesting that blind people not only have lower-fidelity and more declarative (as opposed to procedural / interactive) cognitive maps, they're also less likely to provide object-to-object descriptions.

Epistemic status: I made a not-obviously-wrong hypothesis and found two pieces of corroborating evidence. If anyone who knows more about this wants to chime in, please do!

15: Neuroscience

The reward prediction error hypothesis, dopamine, neural actor-critic, hedonistic neurons, and addiction.

16: Applications and Case Studies

From checkers to checkmate.

17: Frontiers

Temporal abstraction, designing reward signals, and the future of reinforcement learning. In particular, the idea I had for having a whitelist-enabled agent predict expected object-level transitions is actually one of the frontiers: general value functions. Rad.

Pandora's AI Boxing

The rapid pace of advances in AI has led to warnings that AI poses serious threats to our societies, even to humanity itself.

This chapter talks a fair amount about the serious challenges in AI alignment (not sure if you all have heard of that problem), which is heartening.

As to safety, hazards possible with reinforcement learning are not completely different from those that have been managed successfully for related applications of optimization and control methods.

I'm not sure about that. Admittedly, I'm not as familiar with those solutions, but the challenge here seems to be of a qualitatively different caliber. Conditional on true AI's achievement, we'd want to have extremely high confidence that it pans out before we flip the switch. The authors acknowledge that

it may be impossible for the agent to achieve the designer's goal no matter what its reward signal is.

I don't think it's impossible, but it's going to be extremely hard to get formal probabilistic guarantees. I mean, if you don't know an agent's rationality, you can't learn their utility function. If you do know their rationality but not their probability distribution over outcomes, you still can't learn their utility function.

This is just one of the many open problems in alignment theory. If that's not enough, there are always the unknown unknowns...

Final Thoughts

I read the "nearly-complete" draft of the second edition, available here. It was pretty good, but I did find most of the exercises either too easy or requiring considerable programming investment to set up the environment described. The former makes sense, as I've already taken a course on this, and I'm probably a bit above the introductory level.

Some graphs could have been more clearly designed, and some math in the proof of Linear $TD(0)$ 's convergence (p. 206-207) is underspecified:

In general, $w_{t}$ will be reduced toward zero whenever $A$ is positive definite, meaning $y^{⊤} A y > 0$ for real vector $y$ .

An additional precondition: $y$ can't be the zero vector.

For a key matrix of this type, positive definiteness is assured if all of its columns sum to a nonnegative number.

Unless I totally missed what "this type" entails, this is false if taken at face value:

(\begin{matrix} - 2 & 5 5 & - 2 \end{matrix})

has nonnegative column sums and is also indefinite, having eigenvalues of 3 and -7.

However, the claim is true in the subtle way they use it - they're actually showing that since the matrix is symmetric, real, and strictly diagonally dominant with positive diagonal entries, it's also positive definite. This could be made clearer.

In all, reading this book was definitely a positive experience.

Forwards

I'll be finishing Analysis II before moving on to Jaynes's Probability Theory in preparation for a class in the fall.

Dota 2

Recently, OpenAI made waves with their OpenAI Five Dota 2 bot. To REINFORCE what I just learned and solidified, I might make a post in the near future breaking down how Five differs from the Alpha(Go) Zero approach, quantifying my expectations for The International for calibration.

No Longer a Spectator

Four months and one week ago, I started my journey through the MIRI reading list. In those dark days, attempting a proof induced a stupor similar to that I encountered approaching a crush in grade school, my words and thoughts leaving me.

Six textbooks later and with a little effort, I'm able to prove things like the convergence of Monte Carlo integration to the Riemann integral, threading together lessons from All of Statistics and Analysis I; target in mind, words and thoughts remaining firmly by my side.

The rapid pace of advances in artificial intelligence has led to warnings that artificial intelligence poses serious threats to our societies, even to humanity itself. The renowned scientist and artificial intelligence pioneer Herbert Simon anticipated the warnings we are hearing today...

He spoke of the eternal conflict between the promise and perils of any new knowledge, reminding us of the Greek myths of Prometheus, the hero of modern science, who stole fire from the gods for the benefit of mankind, and Pandora, whose box could be opened by a small and innocent action to release untold perils on the world.

Simon urged us to recognize that as designers of our future and not mere spectators, the decisions we make can tilt the scale in Prometheus' favor.

If you are interested in working together to learn MIRI-relevant math, if you have a burning desire to knock the alignment problem down a peg, if you're in a scale-tilting mood - I would be more than happy to work with you. Messaging me may also net you an invitation to the MIRIx Discord server.

[-]Rafael Harth6y70

You're insanely fast. It's quite humbling. I've been doing a course on Reinforcement Learning which is based on the same book, so I've been reading it, too. Got an exam coming up soon. Didn't expect to see it mentioned on LW; I didn't feel like it was on the same level as stuff in Miri's research guide, though I do agree that it's pretty good.

One thing I found a bit irritating was that, during the Multi-armed bandit chapter, it never mentioned what seems like the obvious approach: to do all the exploring at the beginning, and then only exploit for the remainder of the time (given that the problem is stationary). This should be far better than being $ϵ$ -greedy. My approach would have been to start from that idea and see how it can be optimized. In a talk on AI Alignment, Eliezer actually briefly mentions the setting, and that's exactly the approach he says is best.

It is sort of like the optimistic approach the book mentions at some point, where you assign unrealistically high expected values to all bandits, such that any usage will update their estimate downward and thus greedy selection will alternate between them in the beginning. But that seems like a clunky way to do it.

An aside: your link to the book below "Final Thoughts" is broken, there's a bad ']' at the end.

[-]TurnTrout6y30

So I think that only works in environments which are both stationary and deterministic. Otherwise, you'd need to frontload an infinite number of trials for each arm to ensure you have precise estimates; anything less would incur infinite regret as there would be a nonzero probability you'd spend infinite time exploiting a suboptimal arm. This reminds me of conditional convergence, where you can't always rearrange the components of a series and have it still converge to the same sum / at all. I think interleaving exploration and exploitation such that you minimize your regret is the best way to go here.

[-]codyrioux6y20

This more or less regresses to an offline supervised learning model in which a bunch of samples are collected upfront, a model trained, and then used to predict all future actions. While you might be framing your problem as an MDP, you're not doing reinforcement learning in this case. As TurnTrout mentioned in a sibling to this comment it works only in the stationary & deterministic environments which represent toy problems for the space, but ultimately the goal for RL is to function in non-stationary non-deterministic environments so it makes little sense to focus on this path.

[-]Gram Stone6y10

The Lahav and Mioduser link in section 14 is broken for me. Maybe it's just paywalled?

[-]TurnTrout6y10

Sorry about that - should be fixed now?

Yeah it's fixed.

LESSWRONG
LW

33