Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.
5: Monte Carlo Methods
Prediction, control, and importance sampling.
Importance Sampling
After gathering data with our behavior policy b, we then want to approximate the value function for the target policy π. In off-policy methods, the policy we use to gather the data is different from the one whose value vπ we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of vπ, so we must overcome this bias.
If b can take all of the actions that π can (i.e., ∀a,s:π(a|s)>0⟹b(a|s)>0), we can overcome by adjusting the return Gt of taking a series of actions At,…,AT−1 using the importance-sampling ratio ρt:T−1:=∏T−1k=tπ(Ak|Sk)b(Ak|Sk). This cleanly recovers vπ by the definition of expectation:
Then after observing some set of returns (where {Gt}t∈T(s) are the relevant returns for state s), we define the state's value as the average adjusted observed return
V(s):=∑t∈T(s)ρt:T(t)−1Gt|T(s)|.
However, the ρt:T(t)−1's can be arbitrarily large (suppose π(A|S)=.5 and b(A|S)=1×10−10; .51×10−10=.5×1010), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try
V(s):=∑t∈T(s)ρt:T(t)−1Gt∑t∈T(s)ρt:T(t)−1.
Death and Discounting
Instead of viewing the discount factor γ as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with 1−γ probability to die on the next turn, so we discount rewards accordingly.
This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.
6: Temporal-Difference Learning
The tabular triple threat: TD(0), SARSA, and Q-learning.
Foreword
Let's get down to business.
Reinforcement Learning
1: Introduction
2: Multi-armed Bandits
Bandit basics, including nonstationarity, the value of optimism for incentivizing exploration, and upper-confidence-bound action selection.
Some explore/exploit results are relevant to daily life - I highly recommend reading Algorithms to Live By: The Computer Science of Human Decisions.
3: Finite Markov Decision Processes
The framework.
4: Dynamic Programming
Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.
5: Monte Carlo Methods
Prediction, control, and importance sampling.
Importance Sampling
After gathering data with our behavior policy b, we then want to approximate the value function for the target policy π. In off-policy methods, the policy we use to gather the data is different from the one whose value vπ we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of vπ, so we must overcome this bias.
If b can take all of the actions that π can (i.e., ∀a,s:π(a|s)>0⟹b(a|s)>0), we can overcome by adjusting the return Gt of taking a series of actions At,…,AT−1 using the importance-sampling ratio ρt:T−1:=∏T−1k=tπ(Ak|Sk)b(Ak|Sk). This cleanly recovers vπ by the definition of expectation:
Then after observing some set of returns (where {Gt}t∈T(s) are the relevant returns for state s), we define the state's value as the average adjusted observed return
However, the ρt:T(t)−1's can be arbitrarily large (suppose π(A|S)=.5 and b(A|S)=1×10−10; .51×10−10=.5×1010), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try
Death and Discounting
Instead of viewing the discount factor γ as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with 1−γ probability to die on the next turn, so we discount rewards accordingly.
This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.
6: Temporal-Difference Learning
The tabular triple threat: TD(0), SARSA, and Q-learning.
Learning TD Learning
TD(0)⋆SARSAQ