My impression is that steps and episodes are both time periods in a training process, and that these terms are somewhat common in RL. An episode is larger than a step and usually contains many steps.
Is this correct?
Some related questions:
- Are rewards issued at the end of steps, at the end of episodes, or it could be either/both just depending on the particular training process being used?
- Is a step the smallest unit of time in which any observable activity happens? Or could there be a lot of things going on within a single "step"?
- Do these units of time persist in deployment, or are they only relevant during training?
I'd appreciate answers to any of these questions if you know them. (It's ok if you don't know all the answers or don't have time to include them all in a single answer.)
What's the context?
I have seen these terms come up in discussions about myopia in AI.
For example, in Evan Hubinger's post about AI safety via market making, he invokes the terms "per-step" as well as "per-episode" myopia:
Before I talk about the importance of per-step myopia, it's worth noting that debate is fully compatible with per-episode myopia—in fact, it basically requires it. If a debater is not per-episode myopic, then it will try to maximize its reward across all debates, not just the single debate—the single episode—it's currently in. Such per-episode non-myopic agents can then become deceptively aligned, as they might choose to act deceptively during training in order to defect during deployment. Per-episode myopia, however, rules this out. Unfortunately, in my opinion, per-episode myopia seems like a very difficult condition to enforce—once your agents are running multi-step optimization algorithms, how do you tell whether that optimization passes through the episode boundary or not? Enforcing per-step myopia, on the other hand, just requires detecting the existence of multi-step optimization, rather than its extent, which seems considerably easier. Thus, since AI safety via market making is fully compatible with per-step myopia verification, it could be significantly easier to prevent the development of deceptive alignment.
Understanding these terms more clearly will help me understand proposals about myopia, which is relevant to some research I'm doing right now. These also seem like basic or generally important concepts in ML which I'd like to understand better.
This is correct, but at least in the quote above, the most important distinction is that most RL algorithms propagate credit assignment back across steps but not across episodes.