Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I don't think finite-state MDPs are a particularly powerful conceptual tool for designing strong RL algorithms. I'll consider the case of no function approximation first.

It is certainly easier to do RL in a finite-state MDP. The benefit of modeling an environment as a finite-state MDP, and then using an MDP-inspired RL algorithm, is that when the agent searches for plans to follow, it doesn't evaluate the same plans twice.

Instead, it caches the (approximate) "value" for each possible "state", and then if a plan would take it to a state that it has already evaluated, it doesn't have to re-evaluate what the plan would be from that point on. It already knows, more or less, how much utility it could get thereafter. Compare that to naïve approach of using a world-model to do full expectimax search at each timestep.

The model-the-environment-as-finite-state-MDP-then-do-dynamic-programming approach, or just "the MDP approach" for short, is, I think, all about not searching the same region of the planning search space twice. This is clearly a good thing, but I don't think the MDP approach in RL contains much more conceptual progress toward AGI than that. If I were to try to do a pre-natum of a fairly advanced RL agent, that is, if I tried to anticipate a response to "things went well; why did that happen?", my guess would be that a big part of the answer would be:

It avoids searching much of the planning search space even once, certainly not twice.

The MDP approach with function approximation is more powerful, depending on how good the function approximation is. There's no upper bound on how good the MDP approach with function approximation could be, because buried inside the function approximation (whether that's approximation of the value, or the optimal policy, or both) could be some clever RL algorithm that does most of the work on its own. A good function approximator that is able to generate accurate predictions of the value and/or the optimal policy might appear to us to "generalize" well across "similar states". But it's not clear to me to what extent it is a useful abstraction to say that the function approximator thinks in terms of the agent bouncing around a set of states that it classifies as more or less similar to each other.

I don't mean to say that the MDP approach is useless. I'm certainly not against using a TD-style update instead of a full Monte Carlo rollout for training a function approximator; it's better than not using one and effectively searching parts of planning search space many times over. I just don't think it's a hugely big deal conceptually.

I think this is one small, very disputable argument against defaulting to a finite-state MDP formalism in AGI safety work. A natural alternative is to consider the agent's entire interaction history as the state, and suppose that the agent is still somehow using clever, efficient heuristics for approximating expectimax planning, with or without built-in methods for caching plans that have already been evaluated. None of this says that there's any cost to using a finite-state MDP formalism for AGI safety work, only that the benefits don't seem so great as to make it a "natural choice".

New to LessWrong?

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 9:45 AM

IMO there are two reasons why finite-state MDPs are useful.

First, proving regret bounds for finite-state MDPs is just easier than for infinite-state MDPs (of course any environment can be thought of as an infinite-state MDP), so it serves as good warm-up even if you want to go beyond it. Certainly many problems can be captured already within this simple setting. Moreover, some algorithms and proof techniques for finite-state MDPs can be generalized to e.g. continuous MDPs (which is already a far more general setting).

Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where "features" in this case corresponds to a mapping from histories to states. Now, of course there needn't be any projection into a finite state space that preserves the exact dynamics of the environment. However, if your algorithm can work with approximate models (as it must anyway), for example using my quasi-Bayesian approach, then such MDP models can be powerful.

Regarding regret bounds, I don't think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.

Certainly many problems can be captured already within this simple setting.

Definitely. But I think many of the difficulties with general intelligence are not captured in the simple setting. I certainly don't want to say there's no place for MDPs.

continuous MDPs

I don't quite know what to think of continuous MDPs. I'll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it's not a whole lot more powerful than the finite-state MDP formalism.

Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where "features" in this case corresponds to a mapping from histories to states.

Yeah, I think there's been some good progress on this. But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there's a good chance that we'll get for free the don't-evaluate-plans-twice behavior.

Regarding regret bounds, I don't think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.

First, you can have a subjective regret bound which doesn't require all actions to be recoverable (it does require some actions to be approximately recoverable, which is indeed the case in the real world).

Second, dealing rationally with non-recoverable actions should still translate into mathematical conditions some of which might still look like sort of regret bounds, and in any case finite MDPs are a natural starting point for analyzing them.

Third, checking regret bounds for priors in which all actions are recoverable serves as a sanity test for candidate AGI algorithms. It is not a sufficient desideratum, but I do think it is necessary.

But I think many of the difficulties with general intelligence are not captured in the simple setting

I agree that some of the difficulties are not captured. I am curious whether you have more concrete examples in mind than what you wrote in the post?

I don't quite know what to think of continuous MDPs. I'll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it's not a whole lot more powerful than the finite-state MDP formalism.

This seems wrong to me. Can you elaborate what do you mean by "powerful" in this context? Continuous MDPs definitely describe a large variety of environments that cannot be captured by a finite state MDP, at least not without approximations. Solving continuous MDPs can also be much more difficult than finite state MDPs. For example, any POMDP can be made into a continuous MDP by treating beliefs as states, and finding the optimal policy for a POMDP is PSPACE-hard (as opposed to the case of finite state MDPs which is P-easy).

But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there's a good chance that we'll get for free the don't-evaluate-plans-twice behavior.

I guess that you might be thinking exclusively of algorithms that have something like a uniform prior over transition kernels. In this case there is obviously no way to learn about a state without visiting it. But we can also consider algorithms with more sophisticated priors and get much faster learning rates (if the environment is truly sampled from this prior ofc). The best example is, I think, the work of Osband and Van Roy where a regret bound is derived that scales with a certain dimension parameter of the hypothesis space (that can be much smaller than the number of states and actions), work on which I continued to build.

If I get it correctly, your issue is with the Markov Property of MDP? It simplifies the computation of the policy by not requiring to know the path by which the agent arrived at a given state; but it also removes the information about the history that is not written down into the state itself.

Not sure if you know it or if it is that useful, but this section of "Reinforcement Learning: an introduction" discuss ways to go beyond MDP and the Markov property.

I don't really go into the potential costs of a finite-state-Markov assumption here. The point of this post is mostly to claim that it's not a hugely useful framework for thinking about RL.

The short answer for why I think there are costs to it is that the world is not finite-state Markov, certainly not fully observable finite state Markov. So yes, it could "remove information" by oversimplifying.

That section of the textbook seems to describe the alternative I mentioned: treating the whole interaction history as the state. It's not finite-state anymore, but you can still treat the environment as fully observable without losing any generality, so that's good. So if I were to take issue more strongly here, my issue would not be with the Markov property, but the finite state-ness.

The point of this point is mostly to claim that it's not a hugely useful framework for thinking about RL.

Even though I agree it's unrealistic, MDPs are still easier to prove things in and I still think that they can give us important insights. for example, if I had started with more complex environments when I was investigating instrumental convergence, I would've spent a ton of extra time grappling with the theorems for little perceived benefit. that is, the MDP framework let me more easily cut to the core insights. sometimes it's worth thinking about more general computable environments, but probably not always.