**Followup to:** Reinforcement Learning: A Non-Standard Introduction, Reinforcement, Preference and Utility

A reinforcement-learning agent interacts with its environment through the perception of observations and the performance of actions. A very abstract and non-standard description of such an agent is in two parts. The first part, the *inference policy*, tells us what states the agent can be in, and how these states change when the agent receives new input from its environment. The second part, the *action policy*, tells us what action the agent chooses to perform on the environment, in each of its internal states.

There are two special choices for the inference policy, marking two extremes. One extreme is for the agent to remain absolutely oblivious to the information coming its way. The transition from a past internal state to a current one is made independent of the observation, and no entanglement is formed between the agent and the environment. A rock, for example, comes close to being this little alive.

This post focuses on the other extreme, where the agent updates perfectly for the new information.

Keeping track of all the information is easy, on paper. All the agent has to do is maintain the sequence of past observations, the *observable history* O_{1}, O_{2}, ..., O_{t}. As each new observation is perceived, it can simply be appended to the list. Everything the agent can possibly know about the world, anything it can possibly hope to use in choosing actions, is in the observable history - there's no clairvoyance.

But this is far from practical, for many related reasons. Extracting the useful information from the raw observations can be a demanding task. The number of observations to remember grows indefinitely with time, putting a strain on the resources of an agent attempting longevity. The number of possible agent states grows exponentially with time, making it difficult to even specify (let alone decide) what action to take in each one.

Clearly we need some sort of compression when producing the agent's memory state from the observable history. Two requirements for the compression process: one, as per the premise of this post, is that it preserves all information about the world; the other is that it can be computed sequentially - when computing M_{t} the agent only has access to the new observation O_{t} and the previous compression M_{t-1}. The explicit value of all previous observations is forever lost.

This is a good moment to introduce proper terminology. A function of the observable history is called a *statistic*. Intuitively, applying a function to the data can only decrease, never increase the amount of information we end up having about the world. This intuition is solid, as the Data Processing Inequality proves. If the function does not lose any information about the world, if looking at the agent's memory is enough and there's nothing more relevant in the observations themselves, then the memory state is a *sufficient statistic* of the observable history, for the world state. The things the agent does forget about past perceptions are not at all informative for the present. Ultimately, when nothing further can be forgotten this way, we are left with a *minimal sufficient statistic*.

If I tell you the observable history of the agent, what will you know about the world state? If you know the dynamics of the world and how observations are generated, you'll have the *Bayesian belief*, assigning to each world state the posterior distribution:

B_{t}(W_{t}) = Pr(W_{t}|O_{1},...,O_{t})

(where Pr stands for "probability"). Importantly, this can be computed sequentially from B_{t-1} and O_{t}, using Bayes' theorem. (The gory details of how to do this are below in a comment.)

Aha! So the Bayesian belief is an expression for *precisely everything* the agent can possibly know about the world. Why not have the agent's memory represent exactly that?

M_{t} = B_{t}

As it turns out, the Bayesian belief is indeed a minimal sufficient statistic of the observable history for the world state. For the agent, it is the truth, the whole truth, and nothing but the truth - and a methodical way to remember the truth, to boot.

Thus we've compressed into the agent's memory all and only the information from its past that is relevant for the present. We've discarded any information that is an artifact of the senses, and is not real. We've discarded any information that *used to be* real, but isn't anymore, because the world has since changed.

The observant reader will notice that we haven't discussed actions yet. We're getting there. The question of what information is relevant *for future actions* is deep enough to justify this meticulous exposition. For the moment, just note that keeping a sufficient statistic for the current world state is also sufficient for the *controllable future*, since the future is independent of the past given the present.

What we have established here is an "ultimate solution" for how a reinforcement-learning agent should maintain its memory state. It should update a Bayesian belief of what the current world state is. This inference policy is so powerful and natural, that standard reinforcement learning doesn't even make a distinction between the Bayesian belief and the agent's memory state, ignoring anything else we could imagine the latter to be.

**Continue reading:** Point-Based Value Iteration