Epistemological Status: Correct, but lazily written.
Basically, a while ago John made a post attempting to fix the Good Regulator Theorem. I've refrained from writing this simply because I know it may come off as "why does anyone care isn't this trivial"? However, more recently John has set up a research program centered around creating more results of this type so perhaps it's worth introducing a new opinion.
I guess I'm also slowly becoming a cynical grad student. Since I'm at UIUC and Conant/Ashby were affiliated I simply went around and asked a few ECE friends about the internal model principle. While it's a skewed sample, the friend and teacher I know didn't recognize it. When I explained it further, my teacher told me the result is trivial and that's probably why no one knows about it.
I personally felt like the result is obviously just a corollary of the minimal map theorem. The internal model principle is a special case. There is no contradiction between the internal model principle and the existence of model-free RL because we only care about modeling outcomes. I'll spend the rest of this post trying to articulate these intuitions. While I'm going to use some math, I personally didn't think these results justified rigor so you may notice holes in some of the reasoning.
I'll suppose we have some sort of decision process that depends on history arbitrarily. I'll abstract most of that away and propose something like, We'll take in an initial state and then our agent will propose a policy $\pi $ that acts in the environment as a function of its observation history. This ultimately will yield some terminal state. Specifically, we'll suppose the terminal state is equivalent to a finite collection of outcome features that can be used for evaluation using a function . Note, I've restricted to a finite outcome space out of laziness, not necessity.
Minimal Model Lemma: Up to isomorphism, any information in collectively observing the outcome of for a given is equivalent to . Moreover, this measure is minimal in the sense that any other representation is either equivalent in information content or lossy.
This is a simple corollary of the Minimal Map Lemma so I'm not going to prove this here. This is just an alternative way of saying that probability is a minimal sufficient statistic for describing outcomes. When these outcomes are to be thought of sequences it makes sense to consider this as a model.
The (Minimal) Sufficient Regulator: Modeling is sufficient to maximize the expected utility of . Modeling the utility is minimally sufficient to maximize expected utility.
Proof: Assuming the evaluation is carried out over a finite collection of outcome features we may invoke the minimal model lemma to conclude that is a minimal model of the outcome features . This immediately implies a condition on the mutual information between the variables, Therefore, is a minimal sufficient statistic for . However, when we try to perform utility maximization on the outcome features induced by a policy we have, This time the minimal model theorem implies that is a sufficient statistic.
Example 1: In the MDP setting this indicates finding a policy with a good value function at the initial states is all that really matters. If you can compute the value function efficiently then you can find a policy efficiently using something like policy improvement. The model of the environment itself does not seem so important.
Example 2: In optimal control this would reduce to something like the internal model principle. Our basic setup is, where is an error signal we'd like to minimize, is a reference signal, is an output signal, and is a control input based on the error signal. This is a closed-loop transfer function and describes the scalar error of a system controller from a reference point. The system output is formed through a transfer function which itself accepts input from a control that is generated by observing the error. This is the open-loop for the system.
The objective is simply to have . Our previous theorem tells us that modeling the error as a function of our controller is sufficient to produce our desired outcome if it's possible at all. This is equivalent to modeling the reference signal. However, what is minimally sufficient is to just model the parts of the reference signal relevant to driving the error to zero. We can achieve this last goal directly since the system is linear. We may take the Laplace transform and obtain, The main utility of this is that we can analyze what form the controller must take under certain conditions. In fact, assuming every pole of is either in the open left half-plane or at the origin and that has no poles at the origin the final value theorem states, Thus, to ensure the error goes to zero we must have we must satisfy these conditions. This is the Internal Model Principle.
Theorem (Internal Model Principle): Assume that the reference and controller have distinct poles and that the error function's poles are in the open left half-plane. Then the open-loop controller must contain a model of the reference signal.
Proof: To ensure the error goes to zero we must have, for some polynomial . Since the transfer and reference signals share no poles, the open-loop control has the reference signal as a factor.
Is it interesting that the open-controller literally incorporates a model for the reference system? From the previous theorem, we saw it was possible to strip down the controller to the point of having a model of the reference signal and no further. Here the reference signal is seen as a factor. We've really shown one notion of 'containing a model' is equivalent to another notion. Perhaps, that's interesting in it's own right, I'm not sure.