Would readers be interested in a sequence of posts offering an intuitive explanation of my underway thesis on the application of information theory to reinforcement learning? Please also feel free to comment on the quality of my presentation.

In this first post I offer a high-level description of the Perception-Action Cycle as an intuitive explanation of reinforcement learning.



Imagine that the world is divided into two parts: one we shall call the agent and the rest - its environment. Imagine that the two interact in turns. One moment the agent receives information from its environment in the form of an observation. Then the next moment the agent sends out information to its environment in the form of an action. Then it makes another observation, then another action, and so on.

To break down the cycle, we start with the agent having a belief about the state of its environment. This is actually the technical term: the belief is the probability that the agent assigns, implicitly, to each possible state of the environment. The cycle then proceeds in 4 phases.

In the first phase, the agent makes an observation. Since the observation conveys information of the environment, the agent needs to update its belief, ideally using Bayes' theorem. The agent now has more information about the environment.

In the second phase, the agent uses this new information to update its plan. Note the crucial underlying principle that information about the environment is useful in making better plans. This gives a desired fusion between Bayesian updates and decision making.

In the third phase, the agent executes a step of its plan - a single action. This changes the environment. Some of the things that the agent knew about the previous state of the environment may no longer be true, and the agent is back to having less information.

In the fourth phase, the agent makes a prediction about future observations. The importance of making a prediction before a scientific experiment is well understood by philosophers of science. But the importance of constantly making predictions of all of our sensory inputs as a functional part of our cognition, is only now dawning on neuroscientists and machine learning researchers.



The Perception-Action Cycle is an intuitive explanation of the technical setting of reinforcement learning. Reinforcement learning is a powerful model of machine learning, in which decision making, learning and evaluation occur simultaneously and somewhat implicitly while a learner interacts with its environment. This can be used to describe a wide variety of real-life scenarios, including biological and artificial agents. It is so general, in fact, that our work is still ahead of us if we want it to have any explanatory power, and solving it in the most general form is a computationally hard problem.

But the Perception-Action Cycle still offers symmetries to explore, analogies to physics to draw, practical learning algorithms to develop; all of which improve its Occam's razor prior score as a good model of intelligence. And to use it to actually explain things, we can narrow it down further. Not everything that it makes possible is equally probable. By applying information theory, a collection of statistical concepts, theorems and methods implied by strong Bayesianism, we can get a better picture of what intelligence is and isn't.

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 4:02 PM

I'm interested in this - my PhD and postdoc work has all been in motor control, which is of course very much tied up with perception and action. I'm less interested in motor control now and more interested in beliefs, but this analysis demonstrates that the two systems are very much intertwined. You need to have beliefs about the world, which come from perception, before you can generate a useful motor command, for example.

Only thing I'd take issue with is that linking this process solely to reinforcement learning is a little simplistic. Motor learning is a rich field in its own right and learning can (and does) proceed without the presence of a reinforcing stimulus.

"...the importance of constantly making predictions of all of our sensory inputs as a functional part of our cognition, is only now dawning on neuroscientists and machine learning researchers."

This sounds a lot like Emo Todorov's work.


P.S. Somebody please tell me how to blockquote and link. Can I use HTML here?

When commenting, click the "Show help button" on the bottom right of the text box to see the commnts. Block quotes are lines starting with >

It seems to me that actions can increase the amount of information the agent has as well as decreasing it. For example, if I don't know where my laptop is an I ask my friend to go find it and leave it n my desk, then my amount of information (about where my laptop is) goes up as a result of an action (talking to my friend).

Excellent point. It will be a few posts (if the audience is interested) before I can answer you in a way that is both intuitive and fully convincing.

The technical answer is that the belief update caused by an action is deterministically contractive. It never increases the amount of information.

A more intuitive answer (but perhaps not yet convincing) is that, proximally, your action of asking your friend did not change the location of your laptop, only your friend's mental state. And the effect it had on your friend's mental state is that you are now less sure of what it is. You took some of the things you knew about it ("she was probably going to keep sitting around") and made them no longer true.

Regarding information about the future (of the location of your laptop), it is always included in information about the present. Your own mental state is independent of the future given the present. Put another way, you can't update without more observations. In this case, the location of your laptop merely becomes entangled with information that you already had on your friend's mental state ("this is where she thinks my desk is").

(if the audience is interested)

I'm interested.

I'm interested too, mostly because I don't don't find royf's current explanation convincing.

What is 'amount of information' ? If I do not know if coin is heads or tails, then I have 0 bits of information about state of the coin, if I open my eyes and see it is heads, I have 1 bit. The information is in narrowing of the possibilities. That is conventional meaning. edit: though of course the information is not increased until next perception - perhaps that is what you meant? edit: still, there is a counter example - you can have axially magnetized coin, and electromagnet that can make the coin flip to heads up when its powered. You initially don't know which way the coin is up, but if the action is to magnetize the electromagnet, you will have the coin be heads up. (Of course the overall entropy of world still did go up, but mostly in form of heat). One could say that it doesn't increase knowledge of environment, but decreases the entropy in environment.

You are expressing a number of misconceptions here. I may address some in future posts, but in short:

By information I mean the Shannon information (see also links in OP). Your example is correct.

By the action of powering the electromagnet you are not increasing your information on the state of the world. You are increasing your information on the state of the coin, but through making it dependent on the state of the electromagnet which you already knew. This point is clearly worth a future post.

There is no "entropy in environment". Entropy is subjective to the viewer.

I think it is mostly a matter of definitions. I am not familiar with your terminology. Also, if I have an atom that is in unknown alignment, and I align it using magnetic field, then take away the resulting heat, then the entropy (number of states) of that subsystem decreases, and this is used to attain extremely low temperatures. I am more familiar with the physical notion of entropy.

edit: Also, after powering electromagnet, I know that the direction of coin and direction of electromagnet relate in a particular way, which I did not know before. At the same time, I have physically restricted the number of states that the environment can be in - the coin can not now be other way around. It's in this sense that entropy of environment (as seen on large scale) decreases . (and it is of course subjective, that the number of states that environment can be in, decreases. It does not decrease according to agent that already knows which way the coin is)

To pick a trivial case: A blind person with acute hearing taps a cane on the floor in order to ascertain, from echoes, the relative positions of nearby objects.

The issue is that "action" and "observation" can be entangled; your description of observation makes it into a passive process, ignoring the role of activity in observation. "Step one of my plan: Figure out where the table is so I don't run into it." Which is to say, your pattern is overly rigid.

You might argue that the tapping of the cane is itself an observation, in which case you'd also have to treat walking into a room to see what's in it as an observation; the former removes no information, but the latter reduces your certainty of the positions of objects in the room you've just left, meaning either actions can generate information, or observations can reduce it. You could preserve the case that actions cannot generate information if you instead treat hearing the echoes as a secondary observation, but this still leaves you with the case that an action did not, in fact, eliminate any information.

I realize now that an example would be helpful, and yours is a good one.

Any process can be described on different levels. The trick is to find a level of description that is useful. We make an explicit effort to model actions and observation so as to separate the two directions of information flow between the agent and the environment. Actions are purely "active" (no information is received by the agent) while observations are purely "passive" (no information is sent by the agent). We do this because these two aspects of the process have very different properties, as I hope to make clear in future posts.

The process of "figuring out where the table is" involves information flowing in both directions, and so is neither an action nor an observation. Some researchers call such a thing "a subgoal". We should break it down further, for example we could have taps as actions and echoes as observations, as you suggest.

If you want to argue that no information is lost by tapping, then fine, I won't be pedantic and point out the tiny bits of information that do get lost. The point is that some information being lost is a representing feature of the process of taking an action. Over time, if you don't take in new information through observations, your will have less and less information about the world, even if some actions you take have a high probability of not losing any information.

This would be true regardless of whether you engaged in any action at all, however. The passing of time since your last verification of a piece of information is that by which information is lost.

I'm assuming this model is AI-related, so my responses are going to be in line with information modeling with that in mind. If this isn't accurate, let me know.

I would, indeed, suggest time since last verification as the mechanism in your model for information contraction, rather than action; assigning a prior probability that your information will remain accurate does a good job of completing the model. Imagine memorizing a room, closing your eyes, and firing a canon into the room. Contemporaneous to your action, your information is still valid. Shortly thereafter, it ceases to be in a rather dramatic way. Importantly for your model, I think, this is so regardless of whether you fire the canon, or another agent does. If it's a soundproof room, and you close the door with another agent inside, your information about the state of the room can contract quite violently through no action of your own.