The safety of artificial intelligence applications involving reinforcement learning is a topic that deserves careful attention.

# Foreword

Let's get down to business.

# Reinforcement Learning

## 1: Introduction

## 2: Multi-armed Bandits

*Bandit basics, including nonstationarity, the value of optimism for incentivizing exploration, and upper-confidence-bound action selection.*

Some explore/exploit results are relevant to daily life - I highly recommend reading *Algorithms to Live By: The Computer Science of Human Decisions*.* *

## 3: Finite Markov Decision Processes

*The framework.*

## 4: Dynamic Programming

*Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.*

## 5: Monte Carlo Methods

*Prediction, control, and importance sampling.*

### Importance Sampling

After gathering data with our behavior policy , we then want to approximate the value function for the target policy . In off-policy methods, the policy we use to gather the data is different from the one whose value we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of , so we must overcome this bias.

If can take all of the actions that can (*i.e.*, ), we can overcome by adjusting the return of taking a series of actions using the importance-sampling ratio . This cleanly recovers by the definition of expectation:

Then after observing some set of returns (where are the relevant returns for state ), we define the state's value as the average adjusted observed return

However, the 's can be arbitrarily large (suppose and ; ), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try

### Death and Discounting

Instead of viewing the discount factor as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with probability to die on the next turn, so we discount rewards accordingly.

This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.

## 6: Temporal-Difference Learning

*The tabular triple threat: **, **, and **-learning.*

### Learning TD Learning

- is one-step bootstrapping of
*state values*. It helps you learn the value of a given policy, and is not used for control. - is on-policy one-step bootstrapping
*action values*using the quintuples .

As in all on-policy methods, we continually estimate for the policy , and at the same time change toward greediness with respect to .

- -learning is
*off-policy*one-step bootstrapping of action values . - You take an action using and then use the maximal action value at the next state in your TD target term.

### imization bias

With great branching factors come great biases, and optimistic bias is problematic for -learning.

Refresher: the -learning update rule for state , action , and new state is

Suppose the rewards for all 100 actions at are distributed according to . All of these actions have a true (expected) value of 0, but the probability that none of their estimates are greater than .8 after 1 draw each is . The more actions you have, the higher the probability is that at the maximum is just an outlier. See: regression toward the mean and mutual funds.

To deal with this, we toss another -learner into the mix; at any given update, one has their value updated and the other greedifies the policy. The double -learning scheme works because both -learners are unlikely to be biased in the same way. For some reason, this wasn't discovered until 2010.