Reinforcement Learning: A Non-Standard Introduction (Part 1)

29th Jul 2012

6A1987dM

2royf

0duwease

1Johnicholas

0duwease

0JohnEPaton

0harshhpareek

0royf

0fubarobfusco

0royf

0fubarobfusco

0royf

0fubarobfusco

0royf

7VincentYu

2Sniffnoy

1Robert Miles

1arundelo

0TheOtherDave

New Comment

19 comments, sorted by Click to highlight new comments since: Today at 11:35 AM

I'm having a hard time understanding what the arrows from W-node to W-node and M-node to M-node represent in the chess example, given the premise that the world and memory states take turns changing.

If I understand correctly, W is the board state at the start of the player's turn, and M is the state of the memory containing the model of the board and possible moves/outcomes. W(t) is the state that precedes M(t), and likewise the action resulting from the completion of remodelling the memory at M(t), plus the opposing player's action, results in new world state W(t+1).

This interpretation seems to suggest a simple, linear, linked list of alternating W and M nodes instead of the idea that, for example, the W(t-1) node is the direct precursor to W(t). The reason being, it seems that one could generate W(t) simply from the memory model in M(t-1), regardless of what W(t-1) was.. and the same goes for M(t) and W(t-1).

Perhaps it's that the arrow from one W-node to another does not represent the causal/precursor relationship that a W-node to M-node arrow represents, but a different relationship? If so, what is that relationship? Sorry if this seems picky, but I do think that the model is causing some confusion as to whether I properly understand your point.

The arrows all mean the same thing, which is roughly 'causes'.

Chess is a perfect-information game, so you could build the board entirely from the player's memory of the board, but in general, the state of the world at time t-1, together with the player, causes the state of the world at time t.

Ah, so what we're really talking about here is situations where the world state keeps changing as the memory builds its model.. or even just a situation where the memory has an incomplete subset of the world information. Reading the second article's example, which makes the limitations of the memory explicit, I understand. I'd say the chess example is a bit misleading in this case, as the discrepancies between the memory and world are a big part of the discussion -- and as you said, chess is a perfect-information game.

I'm just wondering whether it's true that the Markov property holds for minds. I'm thinking that a snapshot of the world is not enough, but you also need to know something about the rate at which the world is changing. Presumably this information would require the knowledge of states further back.

Also, isn't there an innate element of randomness when it comes to decision making and how our minds work. Neurons are so small that presumably there are some sort of quantum effects, and wouldn't this mean again that information from one step previous wasn't enough.

I don't know, but just some thoughts.

(Assuming Mind=Brain, i.e. the entire mind is just the physical brain and no "soul" is involved. Also, Neurons aren't really all that small, they're quite macroscopic -- though the processes in the neurons like chemical interactions need quantum mechanics for their description)

In Newtonian Mechanics, it is sufficient to know the positions and velocities (i.e. derivaties of position) of particles to determine future states. So, the world is Markov given this informatio.

In Schrodinger's equation, you again only need to know \Psi and it's time derivative to know all future states. I think the quantum properties of the brain are adequately described just with Schodinger's equation. You do need to include nuclear forces etc in a description of the brain. You *may* need quantum electrodynamics, but I think Schrodinger's equation is sufficient.

My physics education stopped before I got here, but Dirac's equation which may be necessary to model the brain *seems* to require the second time-derivative of the wavefunction -- so you may need the second order time-derivatives to make the model Markov. Can someone who knows a bit more quantum physics chime in here?

EDIT: Reading the wiki article more carefully, it seems Dirac's equation is also first order

In the model there's the distribution p, which determines how the world is changing. In the chess example this would include: a) how the agent's action changes the state of the game + b) some distribution we assume (but which we may or may not actually know) about the opponent's action and the resulting state of the game. In a physics example, p should include the relevant laws of physics, together with constants which tell the rate (and manner) in which the world is changing. Any changing parameters should be part of the state.

It seems that you're saying that it may be difficult to know what p is. Then you are very much correct. You probably couldn't infer the laws of physics from the current wave function of the universe, or the rules of chess from the current state of the game. But at this point we're only assuming that such laws exist, not that we know how to learn them.

p and q are probability distributions, which is where we allow for randomness in the process. But note that randomness becomes a tricky concept if you go deep enough into physics.

As for the "quantum mind" theory, as far as I can tell it's fringe science at best. Personally, I'm very skeptical. Regardless, such a model can still have the Markov property, if you include the wave function in your state.

I never found a way to do it using LW's implementation of Markdown, but I *have* successfully used this Unicode strikethrough tool before (a̶n̶ ̶e̶x̶a̶m̶p̶l̶e̶).

If LW's markdown is like reddit's, double tilde before and after will strike through text. Let's see if that works

Edit: It doesn't. Does anyone know how I would go about fixing this?

Edit2: The issue tracker suggests it's been fixed, but it doesn't seem to be.

Imagine that the world is divided into two parts: one we shall call the agent and the rest - its environment. Imagine you could describe in full detail the state of both the agent and the environment. The state of the agent is denoted M: it could be a Mind if you're a philosopher, a Machine if you're researching machine learning, or a Monkey if you're a neuroscientist. Anyway, it's just the Memory of the agent. The state of the rest of the World (or just World, for short) is denoted W.

These states change over time. In general, when describing the dynamics of a system, we specify how each state is determined by the previous states. So we have probability distributions for the states W

_{t}and M_{t}of the world and the agent in time t:p(W

_{t}|W_{t-1},M_{t-1})q(M

_{t}|W_{t-1},M_{t-1})This gives us the probabilities that the world is currently in state W

_{t}, and the agent in state M_{t}, given that they previously were in states W_{t-1}and M_{t-1}. This can be illustrated in the following Bayesian network (see also):Bayesian networks look like they represent causation: that the current state is "caused" by the immediately previous state. But what they

reallyrepresent is statisticalindependence: that the current joint state (W_{t}, M_{t}) dependsonlyon the immediately previous joint state (W_{t-1}, M_{t-1}), and not on any earlier state. So the power of Bayesian networks is in what theydon'tshow, in this case there's no arrow from, say, W_{t-2}to W_{t}.The current joint state of the world and the agent represents everything we need to know in order to continue the dynamics forward. Given this state, the past is independent of the future. This property is so important, that it has a name, borrowed from one of its earliest researchers, Markov.

The Markov property is not enough for our purposes. We are going to make a further assumption, which is that the states of the world and the agent don't both change together. Rather, they take turns changing, and while one does the other remains the same. This gives us the dynamics:

p(W

_{t}|W_{t-1},M_{t-1})q(M

_{t}|M_{t-1},W_{t})and the Bayesian network:

Sometimes this assumption can be readily justified. For example, let's use this model to describe a chess player.

Suppose that at time t the game has reached state W

_{t}where it is our agent's turn to play. Our agent has also reached a decision of what to do next, and its mind is now in state M_{t}, including memory, plan, general knowledge of chess, and all.Our agent takes its turn, and then enters stasis: we are going to assume that it's not thinking off-turn. This is true of most existing artificial chess players, and disregarding time constraints their play is not worse off for it. They are not missing out on anything other than time to think. So the agent keeps its state until the opponent has taken its turn. This completes the change of the state of the game from W

_{t}to W_{t+1}.Now the agent takes a look at the board, and starts thinking up a new strategy to counter the last move of the opponent. If reaches a decision, and commits to its next action. This completes the change of the agent's state from M

_{t}to M_{t+1}.Chess is a turn-based game. But even in other scenarios, when such division of the dynamics into turns is not a good approximation of the process, our assumption can still be justified. If the length of each time step is taken to be smaller and smaller, the state of each of the parties remains more and more the same during each step, with increasing probability and accuracy. In the limit where we describe a continuous change of state over time, the turn-based assumption disappears, and we are back to the general model.

This is the first part of an intuitive and highly non-standard introduction to reinforcement learning. This is more typical of what neuroscientists mean when they use the term. We, on the other hand, will get closer as we move forward to its meaning in machine learning (but not too close).

In following posts we will continue to assume the Markov property in its turn-based variant. We will describe the model in further detail and explore its decision-making aspect.

Continue reading:Part 2