**Followup to:** Reinforcement Learning: A Non-Standard Introduction, Reinforcement, Preference and Utility

A reinforcement-learning agent interacts with its environment through the perception of observations and the performance of actions. A very abstract and non-standard description of such an agent is in two parts. The first part, the *inference policy*, tells us what states the agent can be in, and how these states change when the agent receives new input from its environment. The second part, the *action policy*, tells us what action the agent chooses to perform on the environment, in each of its internal states.

There are two special choices for the inference policy, marking two extremes. One extreme is for the agent to remain absolutely oblivious to the information coming its way. The transition from a past internal state to a current one is made independent of the observation, and no entanglement is formed between the agent and the environment. A rock, for example, comes close to being this little alive.

This post focuses on the other extreme, where the agent updates perfectly for the new information.

Keeping track of all the information is easy, on paper. All the agent has to do is maintain the sequence of past observations, the *observable history* O_{1}, O_{2}, ..., O_{t}. As each new observation is perceived, it can simply be appended to the list. Everything the agent can possibly know about the world, anything it can possibly hope to use in choosing actions, is in the observable history - there's no clairvoyance.

But this is far from practical, for many related reasons. Extracting the useful information from the raw observations can be a demanding task. The number of observations to remember grows indefinitely with time, putting a strain on the resources of an agent attempting longevity. The number of possible agent states grows exponentially with time, making it difficult to even specify (let alone decide) what action to take in each one.

Clearly we need some sort of compression when producing the agent's memory state from the observable history. Two requirements for the compression process: one, as per the premise of this post, is that it preserves all information about the world; the other is that it can be computed sequentially - when computing M_{t} the agent only has access to the new observation O_{t} and the previous compression M_{t-1}. The explicit value of all previous observations is forever lost.

This is a good moment to introduce proper terminology. A function of the observable history is called a *statistic*. Intuitively, applying a function to the data can only decrease, never increase the amount of information we end up having about the world. This intuition is solid, as the Data Processing Inequality proves. If the function does not lose any information about the world, if looking at the agent's memory is enough and there's nothing more relevant in the observations themselves, then the memory state is a *sufficient statistic* of the observable history, for the world state. The things the agent does forget about past perceptions are not at all informative for the present. Ultimately, when nothing further can be forgotten this way, we are left with a *minimal sufficient statistic*.

If I tell you the observable history of the agent, what will you know about the world state? If you know the dynamics of the world and how observations are generated, you'll have the *Bayesian belief*, assigning to each world state the posterior distribution:

B_{t}(W_{t}) = Pr(W_{t}|O_{1},...,O_{t})

(where Pr stands for "probability"). Importantly, this can be computed sequentially from B_{t-1} and O_{t}, using Bayes' theorem. (The gory details of how to do this are below in a comment.)

Aha! So the Bayesian belief is an expression for *precisely everything* the agent can possibly know about the world. Why not have the agent's memory represent exactly that?

M_{t} = B_{t}

As it turns out, the Bayesian belief is indeed a minimal sufficient statistic of the observable history for the world state. For the agent, it is the truth, the whole truth, and nothing but the truth - and a methodical way to remember the truth, to boot.

Thus we've compressed into the agent's memory all and only the information from its past that is relevant for the present. We've discarded any information that is an artifact of the senses, and is not real. We've discarded any information that *used to be* real, but isn't anymore, because the world has since changed.

The observant reader will notice that we haven't discussed actions yet. We're getting there. The question of what information is relevant *for future actions* is deep enough to justify this meticulous exposition. For the moment, just note that keeping a sufficient statistic for the current world state is also sufficient for the *controllable future*, since the future is independent of the past given the present.

What we have established here is an "ultimate solution" for how a reinforcement-learning agent should maintain its memory state. It should update a Bayesian belief of what the current world state is. This inference policy is so powerful and natural, that standard reinforcement learning doesn't even make a distinction between the Bayesian belief and the agent's memory state, ignoring anything else we could imagine the latter to be.

**Continue reading:** Point-Based Value Iteration

If you're a devoted Bayesian, you probably know how to update on evidence, and even how to do so repeatedly on a sequence of observations. What you may not know is how to update in a changing world. Here's how:

As usual with Bayes' theorem, we only need to calculate the numerator for different values of , and the denominator will normalize them to sum to 1, as probabilities do. We know as part of the dynamics of the system, so we only need ). This can be calculated by introducing the other variables in the process:

An important thing to notice is that, given the observable history, the world state and the action are independent - the agent can't act on unseen information. We continue:

Recall that the agent's belief is a function of the observable history, and that the action only depends on the observable history through its memory . We conclude:

I'm not seeing how this lets the agent update itself. The formula requires knowledge of sigma, pi, and p. (BTW, could someone add to the comment help text instructions for embedding Latex?) pi is part of the agent but sigma and p are not. You say

But all the agent knows, as you've described it so far, is the sequence of observations. In fact, it's stretching it to say that we know sigma or p -- we have just given these names to them. sigma is a complete description of how the world state determines what the agent senses, and p is a complete description of how the agent's actions affect the world. As the designer of the agent, will you be explicitly providing it with that information in some future instalment?

Everything you say is essentially true.

Technically, we don't need to provide the agent with p and sigma explicitly. We use these parameters when we build the agent's memory update scheme, but the agent is not necessarily "aware" of the values of the parameters from inside the algorithm.

Let's take for example an autonomous rover on Mars. The gravity on Mars is known at the time of design, so the rover's software, and even hardware, is built to operate under these dynamics. The wind velocity at the time and place of landing, on the other hand, is unknown. The rover may need to take measurements to determine this parameter, and encode it in its memory, before it can take it into account in choosing further actions.

But if we are thoroughly Bayesian, then

somethingis known about the wind prior to experience. Is it likely to change every 5 minutes or can the rover wait longer before measuring again? What should be the operational range of the instruments? And so on. In this case we would include this prior in p, while the actual wind velocity is instead hidden in the world state (only to be observed occasionally and partially).Ultimately, we could include all of physics in our belief - there's always some Einstein to tell us that Newtonian physics is wrong. The problem is that a large belief space makes learning harder. This is why most humans struggle with intuitive understanding of relativity or quantum mechanics - our brains are not made to represent this part of the belief space.

This is also why reinforcement learning gives special treatment to the case where there are unknown but unchanging parameters of the world dynamics: the "unknown" part makes the belief space large enough to make special algorithms necessary, while the "unchanging" part makes these algorithms possible.

For LaTeX instructions, click "Show help" and then "More Help" (or go here).

The usual reasons for a lossless compressor not compressing maximally are that it is expensive and slow to compress - and for many applications you need rapid random access to the data.

It is perfectly legal under the bayes to learn nothing from your observations. Or learn in the wrong direction, or sideways, or whatever. All depending on prior and inductive bias. There is no unique "Bayesian belief". If you had the "right" prior, you would find that would have to do very little updating, because the right prior is already right.

Right, in degenerate cases, when there's nothing to be learned, the two extremes of learning nothing and everything coincide.

To the extent that I understand your navigational metaphor, I disagree with this statement. Would you kindly explain?

If you mean to say that there's no unique justifiable prior, I agree. The prior in our setting is basically what you assume you know about the dynamics of the system - see my reply to RichardKennaway.

However, given that prior and the agent's observations, there

isa unique Bayesian belief, the one I defined above. That's pretty much the whole point of Bayesianism, the existence of a subjectively objective probability.This is true in a constant world, or with regard to parts of the world which are constant. And mind you, it's true only with high probability: there's always the slight chance that the sky is not, after all, blue.

But in a changing world, where part of the change is revealed to you through new observations, you have to keep pace. The right prior was right yesterday, today there's new stuff to know.

In the case where your prior says "the past is not informative about the future". You learn nothing. A degenerate prior, not degenerate situation.

Imagine a bowl of jellybeans. you put in ten red and ten white. You take out 3, all of which are red, the probability of getting a red on the next draw is 7/17.

Take another boal, have a monkey toss in red beans and white beans with 50% probability. You draw 3 red, the draw probability is now 50% (becuase you had a maxentropy prior).

Take another boal. Beans were loaded in with unknown probabilitities. You draw 3 red, your draw probability is 4/5 red.

See how depening on your assumptions, you learn in different directions with the same observations? Hence you can learn in the wrong direction with a bad prior.

Learning sideways is a bit of metaphor-stretching, but if you like you can imagine observing 3 red beans proves the existence of god under some prior.

Yes yes. I was being pedantic because your post didn't talk about priors and inductive bias.

I thought of that. I didn't think enough. "very little" was the wrong phrasing. It's not that you do less updating, it's that your updates are on concrete things like "who took the cookies" instead of "does gravity go as the squre or the cube" because your prior already encodes correct physics. Very little updating on physics.

Allow me to suggest a simpler thought experiment, that hopefully captures the essence of yours, and shows why your interpretation (of the correct math) is incorrect.

There are 100 recording studios, each recording each day with probability 0.5. Everybody knows that.

There's a red light outside each studio to signal that a session is taking place that day, except for one rogue studio, where the signal is reversed, being off when there's a session and on when there isn't. Only persons B and C know that.

A, B and C are standing at the door of a studio, but only C knows that it's the rogue one. How do their beliefs that there's a session inside change by observing that the red light is on? A keeps the 50-50. B now thinks it's 99-1. Only C knows that there's no session.

So your interpretation, as I understand it, would be to say that A and B updated in the "wrong direction". But wait! I practically gave you the same prior information that C has - of course you agree with her! Let's rewrite the last paragraph:

A, B and C are standing at the door of a studio. For some obscure reason, C secretly believes that it's the rogue one. Wouldn't you now agree with B?

And now I can do the same for A, by not revealing to you, the reader, the significance of the red lights. My point is that as long as someone runs a Bayesian update, you can't call that the "wrong direction". Maybe they now believe in things that

youjudge less likely, based on the information thatyouhave, but that doesn't make you right and them wrong. Reality makes them right or wrong, unfortunately there's no one around who knows reality in any other way than through their subjective information-revealing observations.No matter what, someone is still updating in the wrong direction, even if we don't know who it is.

...

inverse, with whatever relativistic corrections you would know that I woudn't

I am not clear on that cube thing, actually.

What? is gravity not inverse quadratic?

Yes, the Newtonian force a mass exerts on another mass far away from the first one falls off as the square of the distance. It's the word "cube" that confuses me.

I think the quote was an image for a mental question, which could be rephrased as:

Is this power a 2 or a 3?

2, but his original statement was 2x3 :)

This doesn't establish when the information can actually be compressed to any sensible amount...

What you've got is probably complicated enough already, but (generalizing from one example) I think the way humans work is that the compressed memories aren't necessarily lost, they just get harder to access.

When it's necessary to make a major update, memories might get reexamined (and possibly remade, but perhaps not entirely so) and different features of the memories get high-lighted.

You might enjoy Crutchfield's epsilon machines, and Shalizi's CSSR algorithm for learning them:

http://masi.cscs.lsa.umich.edu/~crshalizi/notabene/computational-mechanics.html