Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.

5: Monte Carlo Methods

Prediction, control, and importance sampling.

Importance Sampling

After gathering data with our behavior policy b, we then want to approximate the value function for the target policy π. In off-policy methods, the policy we use to gather the data is different from the one whose value vπ we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of vπ, so we must overcome this bias.

If b can take all of the actions that π can (i.e., ∀a,s:π(a|s)>0⟹b(a|s)>0), we can overcome by adjusting the return Gt of taking a series of actions At,…,AT−1 using the importance-sampling ratio ρt:T−1:=∏T−1k=tπ(Ak|Sk)b(Ak|Sk). This cleanly recovers vπ by the definition of expectation:

Then after observing some set of returns (where {Gt}t∈T(s) are the relevant returns for state s), we define the state's value as the average adjusted observed return

V(s):=∑t∈T(s)ρt:T(t)−1Gt|T(s)|.

However, the ρt:T(t)−1's can be arbitrarily large (suppose π(A|S)=.5 and b(A|S)=1×10−10; .51×10−10=.5×1010), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try

V(s):=∑t∈T(s)ρt:T(t)−1Gt∑t∈T(s)ρt:T(t)−1.

Death and Discounting

Instead of viewing the discount factor γ as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with 1−γ probability to die on the next turn, so we discount rewards accordingly.

This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.

6: Temporal-Difference Learning

The tabular triple threat: TD(0), SARSA, and Q-learning.

Suppose the rewards for all 100 actions at St+1 are distributed according to N(0,1). All of these actions have a true (expected) value of 0, but the probability that none of their estimates are greater than .8 after 1 draw each is Φ(.8)100≈4.6×10−11. The more actions you have, the higher the probability is that at the maximum is just an outlier. See: regression toward the mean and mutual funds.

To deal with this, we toss another Q-learner into the mix; at any given update, one has their value updated and the other greedifies the policy. The double

## Foreword

Let's get down to business.

## Reinforcement Learning

## 1: Introduction

## 2: Multi-armed Bandits

Bandit basics, including nonstationarity, the value of optimism for incentivizing exploration, and upper-confidence-bound action selection.Some explore/exploit results are relevant to daily life - I highly recommend reading

Algorithms to Live By: The Computer Science of Human Decisions.## 3: Finite Markov Decision Processes

The framework.## 4: Dynamic Programming

Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.## 5: Monte Carlo Methods

Prediction, control, and importance sampling.## Importance Sampling

After gathering data with our behavior policy b, we then want to approximate the value function for the target policy π. In off-policy methods, the policy we use to gather the data is different from the one whose value vπ we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of vπ, so we must overcome this bias.

If b can take all of the actions that π can (

i.e., ∀a,s:π(a|s)>0⟹b(a|s)>0), we can overcome by adjusting the return Gt of taking a series of actions At,…,AT−1 using the importance-sampling ratio ρt:T−1:=∏T−1k=tπ(Ak|Sk)b(Ak|Sk). This cleanly recovers vπ by the definition of expectation:Then after observing some set of returns (where {Gt}t∈T(s) are the relevant returns for state s), we define the state's value as the average adjusted observed return

However, the ρt:T(t)−1's can be arbitrarily large (suppose π(A|S)=.5 and b(A|S)=1×10−10; .51×10−10=.5×1010), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try

## Death and Discounting

Instead of viewing the discount factor γ as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with 1−γ probability to die on the next turn, so we discount rewards accordingly.

This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.

## 6: Temporal-Difference Learning

The tabular triple threat:TD(0),SARSA, andQ-learning.## Learning TD Learning

## TD(0)⋆SARSAQ

state valuesvπ. It helps you learn the value of a given policy, and is not used for control.action valuesqπ using the quintuples (S,A,R,S′,A′).off-policyone-step bootstrapping of action values qπ.## maximization bias

With great branching factors come great biases, and optimistic bias is problematic for Q-learning.

Refresher: the Q-learning update rule for state St, action At, and new state St+1 is

Suppose the rewards for all 100 actions at St+1 are distributed according to N(0,1). All of these actions have a true (expected) value of 0, but the probability that none of their estimates are greater than .8 after 1 draw each is Φ(.8)100≈4.6×10−11. The more actions you have, the higher the probability is that at the maximum is just an outlier. See: regression toward the mean and mutual funds.

To deal with this, we toss another Q-learner into the mix; at any given update, one has their value updated and the other greedifies the policy. The double