Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.
5: Monte Carlo Methods
Prediction, control, and importance sampling.
Importance Sampling
After gathering data with our behavior policy b, we then want to approximate the value function for the target policy π. In off-policy methods, the policy we use to gather the data is different from the one whose value vπ we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of vπ, so we must overcome this bias.
If b can take all of the actions that π can (i.e., ∀a,s:π(a|s)>0⟹b(a|s)>0), we can overcome by adjusting the return Gt of taking a series of actions At,…,AT−1 using the importance-sampling ratio ρt:T−1:=∏T−1k=tπ(Ak|Sk)b(Ak|Sk). This cleanly recovers vπ by the definition of expectation:
Then after observing some set of returns (where {Gt}t∈T(s) are the relevant returns for state s), we define the state's value as the average adjusted observed return
V(s):=∑t∈T(s)ρt:T(t)−1Gt|T(s)|.
However, the ρt:T(t)−1's can be arbitrarily large (suppose π(A|S)=.5 and b(A|S)=1×10−10; .51×10−10=.5×1010), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try
V(s):=∑t∈T(s)ρt:T(t)−1Gt∑t∈T(s)ρt:T(t)−1.
Death and Discounting
Instead of viewing the discount factor γ as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with 1−γ probability to die on the next turn, so we discount rewards accordingly.
This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.
6: Temporal-Difference Learning
The tabular triple threat: TD(0), SARSA, and Q-learning.
Suppose the rewards for all 100 actions at St+1 are distributed according to N(0,1). All of these actions have a true (expected) value of 0, but the probability that none of their estimates are greater than .8 after 1 draw each is Φ(.8)100≈4.6×10−11. The more actions you have, the higher the probability is that at the maximum is just an outlier. See: regression toward the mean and mutual funds.
To deal with this, we toss another Q-learner into the mix; at any given update, one has their value updated and the other greedifies the policy. The double Q-learning scheme works because both Q-learners are unlikely to be biased in the same way. For some reason, this wasn't discovered until 2010.
7: n-step Bootstrapping
n-step everything.
8: Planning and Learning with Tabular Methods
Models, prioritized sweeping, expected vs. sample updates, MCTS, and rollout algorithms.
Roles Models Play
Distribution models include the full range of possible futures and their probabilities. For example, a distribution model for two fair coins:
Foreword
Let's get down to business.
Reinforcement Learning
1: Introduction
2: Multi-armed Bandits
Bandit basics, including nonstationarity, the value of optimism for incentivizing exploration, and upper-confidence-bound action selection.
Some explore/exploit results are relevant to daily life - I highly recommend reading Algorithms to Live By: The Computer Science of Human Decisions.
3: Finite Markov Decision Processes
The framework.
4: Dynamic Programming
Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.
5: Monte Carlo Methods
Prediction, control, and importance sampling.
Importance Sampling
After gathering data with our behavior policy b, we then want to approximate the value function for the target policy π. In off-policy methods, the policy we use to gather the data is different from the one whose value vπ we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of vπ, so we must overcome this bias.
If b can take all of the actions that π can (i.e., ∀a,s:π(a|s)>0⟹b(a|s)>0), we can overcome by adjusting the return Gt of taking a series of actions At,…,AT−1 using the importance-sampling ratio ρt:T−1:=∏T−1k=tπ(Ak|Sk)b(Ak|Sk). This cleanly recovers vπ by the definition of expectation:
Then after observing some set of returns (where {Gt}t∈T(s) are the relevant returns for state s), we define the state's value as the average adjusted observed return
However, the ρt:T(t)−1's can be arbitrarily large (suppose π(A|S)=.5 and b(A|S)=1×10−10; .51×10−10=.5×1010), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try
Death and Discounting
Instead of viewing the discount factor γ as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with 1−γ probability to die on the next turn, so we discount rewards accordingly.
This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.
6: Temporal-Difference Learning
The tabular triple threat: TD(0), SARSA, and Q-learning.
Learning TD Learning
TD(0)⋆SARSAQ
maximization bias
With great branching factors come great biases, and optimistic bias is problematic for Q-learning.
Refresher: the Q-learning update rule for state St, action At, and new state St+1 is
Suppose the rewards for all 100 actions at St+1 are distributed according to N(0,1). All of these actions have a true (expected) value of 0, but the probability that none of their estimates are greater than .8 after 1 draw each is Φ(.8)100≈4.6×10−11. The more actions you have, the higher the probability is that at the maximum is just an outlier. See: regression toward the mean and mutual funds.
To deal with this, we toss another Q-learner into the mix; at any given update, one has their value updated and the other greedifies the policy. The double Q-learning scheme works because both Q-learners are unlikely to be biased in the same way. For some reason, this wasn't discovered until 2010.
7: n-step Bootstrapping
n-step everything.
8: Planning and Learning with Tabular Methods
Models, prioritized sweeping, expected vs. sample updates, MCTS, and rollout algorithms.
Roles Models Play
Distribution models include the full range of possible futures and their probabilities. For example, a distribution model for two fair coins: