How important are MDPs for AGI (Safety)?

IMO there are two reasons why finite-state MDPs are useful.

First, proving regret bounds for finite-state MDPs is just easier than for infinite-state MDPs (of course any environment can be thought of as an infinite-state MDP), so it serves as good warm-up even if you want to go beyond it. Certainly many problems can be captured already within this simple setting. Moreover, some algorithms and proof techniques for finite-state MDPs can be generalized to e.g. continuous MDPs (which is already a far more general setting).

Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where "features" in this case corresponds to a mapping from histories to states. Now, of course there needn't be any projection into a finite state space that preserves the exact dynamics of the environment. However, if your algorithm can work with approximate models (as it must anyway), for example using my quasi-Bayesian approach, then such MDP models can be powerful.

[-]michaelcohen6yΩ110

Regarding regret bounds, I don't think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.

Certainly many problems can be captured already within this simple setting.

Definitely. But I think many of the difficulties with general intelligence are not captured in the simple setting. I certainly don't want to say there's no place for MDPs.

continuous MDPs

I don't quite know what to think of continuous MDPs. I'll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it's not a whole lot more powerful than the finite-state MDP formalism.

Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where "features" in this case corresponds to a mapping from histories to states.

Yeah, I think there's been some good progress on this. But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there's a good chance that we'll get for free the don't-evaluate-plans-twice behavior.

[-]Vanessa Kosoy6yΩ120

Regarding regret bounds, I don't think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.

First, you can have a subjective regret bound which doesn't require all actions to be recoverable (it does require some actions to be approximately recoverable, which is indeed the case in the real world).

Second, dealing rationally with non-recoverable actions should still translate into mathematical conditions some of which might still look like sort of regret bounds, and in any case finite MDPs are a natural starting point for analyzing them.

Third, checking regret bounds for priors in which all actions are recoverable serves as a sanity test for candidate AGI algorithms. It is not a sufficient desideratum, but I do think it is necessary.

But I think many of the difficulties with general intelligence are not captured in the simple setting

I agree that some of the difficulties are not captured. I am curious whether you have more concrete examples in mind than what you wrote in the post?

I don't quite know what to think of continuous MDPs. I'll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it's not a whole lot more powerful than the finite-state MDP formalism.

This seems wrong to me. Can you elaborate what do you mean by "powerful" in this context? Continuous MDPs definitely describe a large variety of environments that cannot be captured by a finite state MDP, at least not without approximations. Solving continuous MDPs can also be much more difficult than finite state MDPs. For example, any POMDP can be made into a continuous MDP by treating beliefs as states, and finding the optimal policy for a POMDP is PSPACE-hard (as opposed to the case of finite state MDPs which is P-easy).

But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there's a good chance that we'll get for free the don't-evaluate-plans-twice behavior.

I guess that you might be thinking exclusively of algorithms that have something like a uniform prior over transition kernels. In this case there is obviously no way to learn about a state without visiting it. But we can also consider algorithms with more sophisticated priors and get much faster learning rates (if the environment is truly sampled from this prior ofc). The best example is, I think, the work of Osband and Van Roy where a regret bound is derived that scales with a certain dimension parameter of the hypothesis space (that can be much smaller than the number of states and actions), work on which I continued to build.

[-]adamShimi6yΩ230

If I get it correctly, your issue is with the Markov Property of MDP? It simplifies the computation of the policy by not requiring to know the path by which the agent arrived at a given state; but it also removes the information about the history that is not written down into the state itself.

Not sure if you know it or if it is that useful, but this section of "Reinforcement Learning: an introduction" discuss ways to go beyond MDP and the Markov property.

[-]michaelcohen6y*Ω110

I don't really go into the potential costs of a finite-state-Markov assumption here. The point of this post is mostly to claim that it's not a hugely useful framework for thinking about RL.

The short answer for why I think there are costs to it is that the world is not finite-state Markov, certainly not fully observable finite state Markov. So yes, it could "remove information" by oversimplifying.

That section of the textbook seems to describe the alternative I mentioned: treating the whole interaction history as the state. It's not finite-state anymore, but you can still treat the environment as fully observable without losing any generality, so that's good. So if I were to take issue more strongly here, my issue would not be with the Markov property, but the finite state-ness.

[-]TurnTrout6yΩ120

The point of this point is mostly to claim that it's not a hugely useful framework for thinking about RL.

Even though I agree it's unrealistic, MDPs are still easier to prove things in and I still think that they can give us important insights. for example, if I had started with more complex environments when I was investigating instrumental convergence, I would've spent a ton of extra time grappling with the theorems for little perceived benefit. that is, the MDP framework let me more easily cut to the core insights. sometimes it's worth thinking about more general computable environments, but probably not always.