**Followup to:** The Bayesian Agent

This post explains one interesting and influential algorithm for achieving high utility of the actions of a Bayesian agent, called Point-Based Value Iteration (original paper). Its main premise resembles some concept of internal availability.

A reinforcement-learning agent chooses its actions based on its internal memory state. The memory doesn't have to include an exact account of all past observations - it's sufficient for the agent to keep track of a belief of the current state of the world.

This mitigates the problem of having the size of the memory state grow indefinitely. The memory-state space is now the set of all distributions over the world state. Importantly, this space doesn't grow exponentially with time like the space of observable histories. Unfortunately, it still has volume exponential in the number of world states.

So we moved away from specifying the action to take after each observable history, in favor of specifying the action to take in each belief state. This was justified by the belief being a sufficient statistic for the world state, and motivated by the belief space being much smaller (for long processes) than the history space. But for the purpose of understanding this algorithm, we should go back to describing the action policy in terms of the action to take after each observable history, π(A_{t}|O_{1},...,O_{t}).

Now join us as we're watching the dynamics of the process unfold. We're at time t, and the agent has reached Bayesian belief B_{t}, the distribution of the world state to the best of the agent's knowledge. What prospects does the agent still have for future rewards? What is its *value-to-go* of being in this belief state?

If we know what action the agent will take after each observable history (if and when it comes to pass), it turns out that the expected total of future rewards is linear in B_{t}. To see why this is so, consider a specific future (and present):

w_{t}, o_{t}, a_{t}, w_{t+1}, o_{t+1}, a_{t+1}, ..., w_{n}, o_{n}, a_{n}

The reward of this future is:

R(w_{t},a_{t}) + R(w_{t+1},a_{t+1}) + ... + R(w_{n},a_{n})

The probability of this future, given that the agent already knows o_{1},...,o_{t}, is:

B_{t}(w_{t})·π(a_{t}|o_{1},...,o_{t})·p(w_{t+1}|w_{t},a_{t})·σ(o_{t+1}|w_{t+1})·π(a_{t+1}|o_{1},...,o_{t+1})···

···p(w_{n}|w_{n-1},a_{n-1})·σ(o_{n}|w_{n})·π(a_{n}|o_{1},...,o_{n})

We arrive at this by starting with the agent's belief B_{t}, which summarizes the distribution of w_{t} given the observable history; then going on to multiply the probability for each new variable, given those before it that have part in causing it.

Note how B_{t} linearly determines the probability for w_{t}, while the dynamics of the world and the agent determine the probability for the rest of the process. Now, to find the expected reward we sum, over all possible futures, the product of the future's reward and probability. This is all linear in the belief B_{t} at time t.

If the value of choosing a specific action policy is a linear function of B_{t}, the value of choosing the *best* action policy is the maximum of many linear functions of B_{t}. How many? If there are 2 possible observations, the observable history that will have unfolded in the next n-t steps can be any of 2^{n-t} possible "future histories". If there are 2 possible actions to choose for each one, there are 2^{2n-t} possible policies. A scary number if the end is still some time ahead of us.

Of course, we're not interested in the value of just any belief, only of those that can actually be reached at time t. Since the belief B_{t} is determined by the observable history o_{1},...,o_{t}, there are at most 2^{t} such values. Less scary, but still not practical for long processes.

The secret of the algorithm is that it limits both of the figures above to a fixed, manageable number. Suppose that we have a list of 100 belief states which are the most interesting to us, perhaps because they are the most likely to be reached at time t. Suppose that we also have a list Π_{t} of 100 policies: for each belief on the first list, we have in the second list a policy that gets a good reward if indeed that belief is reached at time t. Now what is the value, at time t, of choosing the best action policy of these 100? It's the maximum of, not many, but |Π_{t}|=100 linear functions of B_{t}. Much more manageable.

Ok, we're done with time t. What about time t-1? We are going to consider 100 belief states which can be reached at time t-1. For each such belief B_{t-1}, we are going to consider each possible action A_{t-1}. Then we are going to consider each possible observation O_{t} that can come next - we can compute its probability from B_{t-1} and the dynamics of the system. Given the observation, we will have a new belief B_{t}. And the previous paragraph tells us what to do in this case: choose one of the policies in Π_{t} which is best for B_{t}, to use for the rest of the process.

(Note that this new B_{t} doesn't have to be one of the 100 beliefs that were used to construct the 100 policies in Π_{t}.)

So this is how the recursive step goes:

- For each of 100 possible values for B
_{t-1}- For each possible A
_{t-1}- For each possible O
_{t}- We compute B
_{t} - We choose for B
_{t}the best policy in Π_{t}, and find its value (it's a linear function of B_{t})

- We compute B
- We go back and ask: how good is it really to choose A
_{t-1}? What is the expected value before knowing O_{t}?

- For each possible O
- We choose the best A
_{t-1} - This gives us the best policy for B
_{t-1}, under the restriction that we are only willing to use one of the policies in Π_{t}from time t onward

- For each possible A
- After going over the 100 beliefs for time t-1, we now have a new stored list, Π
_{t-1}, of 100 policies for time t-1.

We continue backward in time, until we reach the starting time t_{0}, where we choose the best policy in Π_{t0} for the initial (prior) belief B_{t0}.

We still need to say how to choose those special 100 beliefs to optimize for. Actually, this is not a fixed set. Most point-based algorithms start with only one or a few beliefs, and gradually expand the set to find better and better solutions.

The original algorithm chooses beliefs in a similar way to an internal availability mechanism that may be responsible for conjuring up images of likely futures in human brains. It starts with only one belief (the prior B_{t0}) and finds a solution where only one policy is allowed in each Π_{t}. Then it adds some beliefs which seem likely to be reached in this solution, and finds a new and improved solution. It repeats this computation until a good-enough solution is reached, or until it runs out of time.

Point-based algorithms are today some of the best ones for finding good policies for Bayesian agents. They drive some real-world applications, and contribute to our understanding of how intelligent agents can act and react.