Point-Based Value Iteration

royf

This post explains one interesting and influential algorithm for achieving high utility of the actions of a Bayesian agent, called Point-Based Value Iteration (original paper). Its main premise resembles some concept of internal availability.

A reinforcement-learning agent chooses its actions based on its internal memory state. The memory doesn't have to include an exact account of all past observations - it's sufficient for the agent to keep track of a belief of the current state of the world.

This mitigates the problem of having the size of the memory state grow indefinitely. The memory-state space is now the set of all distributions over the world state. Importantly, this space doesn't grow exponentially with time like the space of observable histories. Unfortunately, it still has volume exponential in the number of world states.

So we moved away from specifying the action to take after each observable history, in favor of specifying the action to take in each belief state. This was justified by the belief being a sufficient statistic for the world state, and motivated by the belief space being much smaller (for long processes) than the history space. But for the purpose of understanding this algorithm, we should go back to describing the action policy in terms of the action to take after each observable history, π(A_t|O₁,...,O_t).

Now join us as we're watching the dynamics of the process unfold. We're at time t, and the agent has reached Bayesian belief B_t, the distribution of the world state to the best of the agent's knowledge. What prospects does the agent still have for future rewards? What is its value-to-go of being in this belief state?

If we know what action the agent will take after each observable history (if and when it comes to pass), it turns out that the expected total of future rewards is linear in B_t. To see why this is so, consider a specific future (and present):

w_t, o_t, a_t, w_t+1, o_t+1, a_t+1, ..., w_n, o_n, a_n

The reward of this future is:

R(w_t,a_t) + R(w_t+1,a_t+1) + ... + R(w_n,a_n)

The probability of this future, given that the agent already knows o₁,...,o_t, is:

B_t(w_t)·π(a_t|o₁,...,o_t)·p(w_t+1|w_t,a_t)·σ(o_t+1|w_t+1)·π(a_t+1|o₁,...,o_t+1)···

···p(w_n|w_n-1,a_n-1)·σ(o_n|w_n)·π(a_n|o₁,...,o_n)

We arrive at this by starting with the agent's belief B_t, which summarizes the distribution of w_t given the observable history; then going on to multiply the probability for each new variable, given those before it that have part in causing it.

Note how B_t linearly determines the probability for w_t, while the dynamics of the world and the agent determine the probability for the rest of the process. Now, to find the expected reward we sum, over all possible futures, the product of the future's reward and probability. This is all linear in the belief B_t at time t.

If the value of choosing a specific action policy is a linear function of B_t, the value of choosing the best action policy is the maximum of many linear functions of B_t. How many? If there are 2 possible observations, the observable history that will have unfolded in the next n-t steps can be any of 2^n-t possible "future histories". If there are 2 possible actions to choose for each one, there are 2^{2^n-t} possible policies. A scary number if the end is still some time ahead of us.

Of course, we're not interested in the value of just any belief, only of those that can actually be reached at time t. Since the belief B_t is determined by the observable history o₁,...,o_t, there are at most 2^t such values. Less scary, but still not practical for long processes.

The secret of the algorithm is that it limits both of the figures above to a fixed, manageable number. Suppose that we have a list of 100 belief states which are the most interesting to us, perhaps because they are the most likely to be reached at time t. Suppose that we also have a list Π_t of 100 policies: for each belief on the first list, we have in the second list a policy that gets a good reward if indeed that belief is reached at time t. Now what is the value, at time t, of choosing the best action policy of these 100? It's the maximum of, not many, but |Π_t|=100 linear functions of B_t. Much more manageable.

Ok, we're done with time t. What about time t-1? We are going to consider 100 belief states which can be reached at time t-1. For each such belief B_t-1, we are going to consider each possible action A_t-1. Then we are going to consider each possible observation O_t that can come next - we can compute its probability from B_t-1 and the dynamics of the system. Given the observation, we will have a new belief B_t. And the previous paragraph tells us what to do in this case: choose one of the policies in Π_t which is best for B_t, to use for the rest of the process.

(Note that this new B_t doesn't have to be one of the 100 beliefs that were used to construct the 100 policies in Π_t.)

So this is how the recursive step goes:

For each of 100 possible values for B_t-1
- For each possible A_t-1
  - For each possible O_t
    - We compute B_t
    - We choose for B_t the best policy in Π_t, and find its value (it's a linear function of B_t)
  - We go back and ask: how good is it really to choose A_t-1? What is the expected value before knowing O_t?
- We choose the best A_t-1
- This gives us the best policy for B_t-1, under the restriction that we are only willing to use one of the policies in Π_t from time t onward
After going over the 100 beliefs for time t-1, we now have a new stored list, Π_t-1, of 100 policies for time t-1.

We continue backward in time, until we reach the starting time t₀, where we choose the best policy in Π_t₀ for the initial (prior) belief B_t₀.

We still need to say how to choose those special 100 beliefs to optimize for. Actually, this is not a fixed set. Most point-based algorithms start with only one or a few beliefs, and gradually expand the set to find better and better solutions.

The original algorithm chooses beliefs in a similar way to an internal availability mechanism that may be responsible for conjuring up images of likely futures in human brains. It starts with only one belief (the prior B_t₀) and finds a solution where only one policy is allowed in each Π_t. Then it adds some beliefs which seem likely to be reached in this solution, and finds a new and improved solution. It repeats this computation until a good-enough solution is reached, or until it runs out of time.

Point-based algorithms are today some of the best ones for finding good policies for Bayesian agents. They drive some real-world applications, and contribute to our understanding of how intelligent agents can act and react.

LESSWRONG
LW

Point-Based Value Iteration

13

13