This post is an informal explainer of our paper which can be found on arxiv. This work was funded by the Advanced Research + Invention Agency (ARIA) Safeguarded AI Programme through project code MSAI-SE01-P005.
Introduction
There is an intuition that a powerful agent might have to contain some kind of world model as part of its structure in order to achieve its goals[1]. Part of this intuition comes from the fact that the actions taken by a successful agent often themselves contain information about the world. Since the actions taken by the agent are somehow generated by its internal structure, the information that is present in the actions must somehow correspond to something happening inside the agent. In this way, external, behavioural actions can give us some insight into the internal structure of an agent. We are interested in quantifying the information contained in the actions of an agent since it might provide a way to bridge the gap between observed behaviour and internal structure.
One way to operationalize this question is to ask: 'if I lack knowledge about the environment but can observe an optimal agent's actions, how much information do I gain about the environment?'. Imagine a mouse who has learned that yellow boxes always contain cheese and that blue boxes are always empty. If we observe that the mouse always chews through a yellow box and ignores a blue box, the mouse's actions contain information about the true state of the world. If you didn't know what colour boxes contained cheese, but you did know that the mouse always successfully finds the cheese, observing the mouse's actions would give you information about the world (ie. the location of the cheese).
In this sense, we can say that the actions of an optimal agent (the mouse) contain information about the world (the location of the cheese). From this claim, we might want to ask 'for what kind of goals is this true?' and 'how much information does the actions of the mouse contain about this world?'.
This post contains an informal presentation of some results contained in our paper which gives some answers to these questions. In particular, we consider a world characterised by a Controlled Markov Process and a goal characterised by a reward function over states. We show that the policies of optimal reward-maximizing agents have mutual information with their environments equal to bits where is the number of states in the environment and is the number of actions the agent can take. This is true for all reward functions (except trivial constant ones) and all common methods of reward aggregations (including time-discounted, time-averaged and finite time). Here, we have tried to present the ideas informally and intuitively. Readers who enjoy finicky details are encouraged to read the paper instead.
Background
This kind of result has previously been explored in the Touchette and Lloyd theorem which quantified the difference between 'open loop' and 'closed loop' controllers in terms of mutual information between a controller's action and the environmental state. The result was suggestive, but only considered a simplified, single-timestep setup with entropy reduction as its goal.
The paper 'General Agents need world Models' by Richens et al. considered an agent in a similar setup to ours. The authors show that if an agent is general enough to be able to succeed at wide variety of goals then one can extract approximations to each of the individual transition probabilities of the environment from the agent's actions. As the number of goals the agent is able to achieve grows, the approximation becomes more exact. The authors require that the 'wide variety of goals' for which the agent is successful includes a set of goals which involve jumping between each pair of states in the environment a large number of times. See this footnote[2] for more details.
Our result considers a class of goals which we think is more conventional in AI theory than either entropy reduction or the goals considered in the 'General Agents' paper. Namely, we consider the goals which can be expressed as reward maximization for a reward function over states.
Setup
We consider an environment characterised by a Controlled Markov Process (CMP) with states and actions. The set of states is labelled and the set of actions is labelled . The 'world' is fully characterised by the set of transition probabilities which give the probability of transitioning to state when action is taken in state . There is one such transition probability for every such combination of and . Since there are possible actions and possible initial states, the environment lives in a space characterized by the product of simplexes each with dimension . We call this space .
We assume that a policy taken by an agent in this environment is Markovian and deterministic. We also assume that the environment is fully observable, meaning that the policy takes in the full environmental state as its input (as opposed to a limited-information 'observation'). This means that the policy can be represented by a function which maps states to actions . Note that there are possible functions of this form. When we place a policy in world, it will take actions depending on what state it is in at each timestep. This results in the world moving from state to state like a stationary Markov chain.
If we start with full ignorance of the environment, we can represent this subjectively using a uniform prior over all of . If we learn that a particular policy is optimal for a particular task this causes us to update our beliefs about what the environment is. Obviously, how our beliefs change depends on the kind of goal for which the policy is optimal. If a policy is optimal for a trivial goal (such as 'go to any state') then learning the policy will not reveal any information about the environment since any policy will be successful for that goal. Alternatively, if the policy is optimal for a very detailed goal (such as the goals considered in the General Agents paper) then knowing the policy causes our subjective distribution over possible environments to narrow, representing the fact that we have learned information about the environment.
We can quantify the amount of information gained about the environment using the differential entropy formulation of mutual information. If we use to denote the random variable corresponding to the environment and to denote the random variable corresponding to the optimal policy for some task, the mutual information between them is
This is the expected change in differential entropy of our subjective distribution over possible environments[3]. This quantifies the number of bits of information which we have obtained about the environment. Since the environment is a list of real numbers, it is fully specified by an infinite number of bits. So any finite mutual information represents partial information about the environment.
Proof Sketch
We have structured our result around one lemma and four theorems. We'll briefly go through them here, leaving the proofs for the paper. The first lemma (Lemma 3.1 in the paper) is a straightforward information-theoretic result. We'll quote it (almost) directly from the paper and then explain it.
Lemma 3.1 Assume a goal represented by the maximization of a value function . We say that a policy is optimal for this value function in environment if Let . If:
is a partition for almost every environment - i.e, the set , has measure zero for all . In other words, there is exactly one optimal policy for almost all environments
and have the same volume for all .
Then:
, i.e, knowing which optimal is policy reduces the differential entropy of the environment from to The mutual information between the environment and the optimal policy is therefore:
This sounds a bit complicated but it is really just expressing straightforward property of entropy. If you start with a uniform distribution over a space and partition this space up into equal-sized chunks, learning which of the chunks the true value is in gives you bits of information. So here we are saying that, if you split up the space of possible environments into equal-sized chunks (one for each policy) you get bits of information. If we can prove that each policy is uniquely optimal (for a particular task) in an equal-sized chunk of the space of possible environments then knowing the optimal policy will give us bits of information. The one subtlety in this lemma is that we don't actually require that every possible environment has a unique optimal policy. We only require that all environments, except a measure zero subset have a unique optimal policy. There in fact exist environments where multiple policies are optimal for a given task (for example, the environment where all actions result in the same outcome). But provided that these environments are measure zero[4] in the space of possible environments, the result that the mutual information equals still holds.
In order to apply Lemma 3.1, to a particular goal, we need to prove that the goal has an optimal policy in all but a measure zero set of environments and we need to prove that the 'chunks' for which each possible policy is optimal have equal volume. This last requirement which is addressed by Theorem 3.2.
Theorem 3.2 Let be a value function which is only influenced by the policy through its effect on the transition matrix . Let be the subset of environments for which policy is the unique optimal policy for maximising this value function. Then:
ie. all policies are uniquely optimal in an equal volume of environments.
This theorem classifies a broad class of tasks (as encoded by value functions) for which each possible policy is optimal in an equal volume of environments which is one of the requirements we needed in order to apply Lemma 3.1. In particular, we show that this is true for any value function 'which is only influenced by the policy through its effect on the transition matrix '. What does this mean?
If you have a deterministic policy acting in an environment characterized by a Controlled Markov Process then the whole system behaves as a Markov chain ie. the probability distribution over states at timestep depends only on the system state at time . The standard way to represent the transition probabilities in such a Markov chain is using a transition matrix whose entries correspond to the probability of transitioning from state to . By taking different actions when in different states, the choice of policy can affect the transition matrix. If the only way in which the policy can affect the outcome of the value function, we that the value function 'is only influenced by the policy through its effect on the transition matrix ' and Theorem 3.2 applies. This is not true for all goals. If I give an agent the goal 'visit state and take action , then visit state and take action ', this task depends on the actions taken, and cannot be expressed solely in terms of the elements of the transition matrix. As a result, we cannot apply Theorem 3.2 to this goal.
But it turns out that most reward maximization goals are of this type, so we can apply Theorem 3.2 to them. Before going on to our 'headline' theorems, we will now introduce the three kinds of reward maximization that we considered in the paper.
Types of reward maximization
We consider an arbitrary reward function over states. Let be the set of states, then the reward function is any function of the form ie. it is a function which takes in a state and outputs a reward between 0 and 1. The only restriction that we put on is that it is non-constant ie. there are at least two states which give different reward. The agent gets a reward for being in state at a timestep.
In reinforcement learning agents are expected to maximize expected reward summed over some number of timesteps. There are few different ways of aggregating reward over time. We refer to these as 'value functions'. A policy is optimal for a particular value function and environment if it achieves a higher value than any other policy. The three that we consider in our paper are time-discounted, finite time, and time-averaged.
Time-discounted reward maximization is common in reinforcement learning. For an environment . It is characterised by a value function of the form
where is the random variable corresponding to the environment state at timestep and is the discount rate. In words, this kind of value function sums the expected value of the reward at each timestep but values each subsequent timestep less than the previous one (by a factor of ).
Finite-time reward maximization is similar and characterised by value functions of the form:
In this case, the expected reward is only summed over the first timesteps. Since we are not summing over infinite timesteps, we can also consider finite-time reward maximization with no discount rate (ie. ) without the sum diverging.
Finally, time-averaged reward maximization allows us to consider an infinite number of timesteps with no discount rate:
Time-averaged reward maximization assigns reward based on the average proportion of time spent in each state over an infinite number of timesteps.
The Main Theorems
The main results of our paper come from applying Lemma 3.1 to these three types of reward maximization. These are the three 'headline' theorems of the paper.
Theorem 3.7. An optimal deterministic policy for maximising the discount reward over states for any non-constant reward function and any discount rate contains bits of information about its environment.
Theorem 3.10 An optimal deterministic policy for maximising the summed reward over states for a finite number of timesteps with any discount rate for any non-constant reward function contains bits of information about its environment provided that or the starting distribution over states has full support.
Theorem 3.14. An optimal deterministic policy for maximising the time-averaged reward over states for any non-constant reward function contains bits of information about its environment.
The finicky bit, which takes up the bulk of the paper is proving that the conditions for Lemma 3.1 apply for these kind of value functions. In particular, for each kind of value function, we needed to prove that there is a unique optimal policy in all environments except a measure zero subset. We also needed to prove that each of these value functions is 'only influenced by the policy through its effect on the transition matrix', meaning that we can apply Theorem 3.2 to prove that each policy is optimal in an equal volume of possible environments. If you want to see how we did this, read the paper!
In summary, our result is that for all common types of reward maximization, each of the policies is uniquely optimal in the same volume of possible environments and that all environments (except a measure zero subset) have a unique optimal policy. This means that you can divide up the space of possible environments into equal-sized chunks, each one corresponding to a set of worlds for which a particular policy is uniquely optimal. Thus, finding out that a particular policy is optimal is results in your subjective distribution over possible environments changing from a uniform distribution over to a uniform distribution over a volume which as large. This corresponds to learning bits of information about the environment.
An intuition for the number is as follows. In almost all environments, for all (non-constant) reward functions, every action you take matters. To be an optimal policy you need to take the correct action in every possible state. If your policy is such that you take the wrong action in just one state, your policy is not optimal (in all environments except a measure zero subset). In a particular state, picking the correct action requires picking one action out of possible actions. Doing this requires bits. And there are states where you need to decide which action to take so you get lots of bits in an optimal policy.
Why are we interested?
We are interested in this kind of result because we want to quantify lower bounds on the amount of information that a world model must contain. Alternatively, we could view this as an upper bound on the size of an internal world model that could possibly be deduced by observing an agents actions (in the particular setup we considered). A lot of goals can be executed without full information about the world so we cannot expect to deduce the existence of a full internal world model based on an agent's performance at these goals.
But since an agent's actions originate somewhere from its internal structure, information contained in its actions must have some internal counterpart. If information about the world is present in the actions, it seems reasonable to conjecture that it will also be present in the architecture which generates the action. Of course, what we have presented here is a long way from demonstrating that the structure or type of this internal information corresponds to something that we might want to call a 'world model', but we believe that this result is an interesting step in the right direction.
The authors consider the following class of goals which involves steering the environment to a state and attempting to transition to state by taking an action . The agent is given attempts to do this (making its way back to after each attempt) and must achieve the transition at least times. The agent is also given a choice about whether it pursues this goal or whether it pursues a different goal which involves first taking action in state and then achieving the to transition (using action ) less than times. If the agent successfully achieves one of these goals with high probability then it must implicitly have a model of the transition probability . For example if we set very high but is very low, then a successful agent will not attempt to achieve the to transition many times using action and instead choose the task involving action . By querying the agent and seeing whether it takes action or action for different values of we can obtain an estimate of just by looking at the agent's output. As increases, this estimate becomes more accurate.
The claim that 'General Agents need World Models' comes from the fact that if an agent is 'general' to the point where it can achieve a wide class of goals which includes all goals of this form, then it must contain enough information to estimate for all combinations.
For an explainer on differential entropy see Alex's post. People are sometimes wary about using differential entropy because it is coordinate-dependent (see eg. Cover and Thomas Elements of Information Theory Ch.9). Thankfully this problem is avoided when using mutual information. Since mutual information can be expressed as a KL-divergence, it is invariant, even if the two differential entropies on which it is based are not.
An intuition for why these environments have zero measure. They require that two (or more) different actions (taken by different policies) result in the same outcome. This requires the probability distributions over the next state induced by taking each of the actions are exactly the same. But the probabilities are real numbers between 0 and 1. The set of pairs of probabilities defines a 2-dimensional space. But the set where the two probabilities are exactly the same only defines a 1-dimensional line in this space.
This post is an informal explainer of our paper which can be found on arxiv. This work was funded by the Advanced Research + Invention Agency (ARIA) Safeguarded AI Programme through project code MSAI-SE01-P005.
Introduction
There is an intuition that a powerful agent might have to contain some kind of world model as part of its structure in order to achieve its goals[1]. Part of this intuition comes from the fact that the actions taken by a successful agent often themselves contain information about the world. Since the actions taken by the agent are somehow generated by its internal structure, the information that is present in the actions must somehow correspond to something happening inside the agent. In this way, external, behavioural actions can give us some insight into the internal structure of an agent. We are interested in quantifying the information contained in the actions of an agent since it might provide a way to bridge the gap between observed behaviour and internal structure.
One way to operationalize this question is to ask: 'if I lack knowledge about the environment but can observe an optimal agent's actions, how much information do I gain about the environment?'. Imagine a mouse who has learned that yellow boxes always contain cheese and that blue boxes are always empty. If we observe that the mouse always chews through a yellow box and ignores a blue box, the mouse's actions contain information about the true state of the world. If you didn't know what colour boxes contained cheese, but you did know that the mouse always successfully finds the cheese, observing the mouse's actions would give you information about the world (ie. the location of the cheese).
In this sense, we can say that the actions of an optimal agent (the mouse) contain information about the world (the location of the cheese). From this claim, we might want to ask 'for what kind of goals is this true?' and 'how much information does the actions of the mouse contain about this world?'.
This post contains an informal presentation of some results contained in our paper which gives some answers to these questions. In particular, we consider a world characterised by a Controlled Markov Process and a goal characterised by a reward function over states. We show that the policies of optimal reward-maximizing agents have mutual information with their environments equal to bits where is the number of states in the environment and is the number of actions the agent can take. This is true for all reward functions (except trivial constant ones) and all common methods of reward aggregations (including time-discounted, time-averaged and finite time). Here, we have tried to present the ideas informally and intuitively. Readers who enjoy finicky details are encouraged to read the paper instead.
Background
This kind of result has previously been explored in the Touchette and Lloyd theorem which quantified the difference between 'open loop' and 'closed loop' controllers in terms of mutual information between a controller's action and the environmental state. The result was suggestive, but only considered a simplified, single-timestep setup with entropy reduction as its goal.
The paper 'General Agents need world Models' by Richens et al. considered an agent in a similar setup to ours. The authors show that if an agent is general enough to be able to succeed at wide variety of goals then one can extract approximations to each of the individual transition probabilities of the environment from the agent's actions. As the number of goals the agent is able to achieve grows, the approximation becomes more exact. The authors require that the 'wide variety of goals' for which the agent is successful includes a set of goals which involve jumping between each pair of states in the environment a large number of times. See this footnote[2] for more details.
Our result considers a class of goals which we think is more conventional in AI theory than either entropy reduction or the goals considered in the 'General Agents' paper. Namely, we consider the goals which can be expressed as reward maximization for a reward function over states.
Setup
We consider an environment characterised by a Controlled Markov Process (CMP) with states and actions. The set of states is labelled and the set of actions is labelled . The 'world' is fully characterised by the set of transition probabilities which give the probability of transitioning to state when action is taken in state . There is one such transition probability for every such combination of and . Since there are possible actions and possible initial states, the environment lives in a space characterized by the product of simplexes each with dimension . We call this space .
We assume that a policy taken by an agent in this environment is Markovian and deterministic. We also assume that the environment is fully observable, meaning that the policy takes in the full environmental state as its input (as opposed to a limited-information 'observation'). This means that the policy can be represented by a function which maps states to actions . Note that there are possible functions of this form. When we place a policy in world, it will take actions depending on what state it is in at each timestep. This results in the world moving from state to state like a stationary Markov chain.
If we start with full ignorance of the environment, we can represent this subjectively using a uniform prior over all of . If we learn that a particular policy is optimal for a particular task this causes us to update our beliefs about what the environment is. Obviously, how our beliefs change depends on the kind of goal for which the policy is optimal. If a policy is optimal for a trivial goal (such as 'go to any state') then learning the policy will not reveal any information about the environment since any policy will be successful for that goal. Alternatively, if the policy is optimal for a very detailed goal (such as the goals considered in the General Agents paper) then knowing the policy causes our subjective distribution over possible environments to narrow, representing the fact that we have learned information about the environment.
We can quantify the amount of information gained about the environment using the differential entropy formulation of mutual information. If we use to denote the random variable corresponding to the environment and to denote the random variable corresponding to the optimal policy for some task, the mutual information between them is
This is the expected change in differential entropy of our subjective distribution over possible environments[3]. This quantifies the number of bits of information which we have obtained about the environment. Since the environment is a list of real numbers, it is fully specified by an infinite number of bits. So any finite mutual information represents partial information about the environment.
Proof Sketch
We have structured our result around one lemma and four theorems. We'll briefly go through them here, leaving the proofs for the paper. The first lemma (Lemma 3.1 in the paper) is a straightforward information-theoretic result. We'll quote it (almost) directly from the paper and then explain it.
This sounds a bit complicated but it is really just expressing straightforward property of entropy. If you start with a uniform distribution over a space and partition this space up into equal-sized chunks, learning which of the chunks the true value is in gives you bits of information. So here we are saying that, if you split up the space of possible environments into equal-sized chunks (one for each policy) you get bits of information. If we can prove that each policy is uniquely optimal (for a particular task) in an equal-sized chunk of the space of possible environments then knowing the optimal policy will give us bits of information. The one subtlety in this lemma is that we don't actually require that every possible environment has a unique optimal policy. We only require that all environments, except a measure zero subset have a unique optimal policy. There in fact exist environments where multiple policies are optimal for a given task (for example, the environment where all actions result in the same outcome). But provided that these environments are measure zero[4] in the space of possible environments, the result that the mutual information equals still holds.
In order to apply Lemma 3.1, to a particular goal, we need to prove that the goal has an optimal policy in all but a measure zero set of environments and we need to prove that the 'chunks' for which each possible policy is optimal have equal volume. This last requirement which is addressed by Theorem 3.2.
This theorem classifies a broad class of tasks (as encoded by value functions) for which each possible policy is optimal in an equal volume of environments which is one of the requirements we needed in order to apply Lemma 3.1. In particular, we show that this is true for any value function 'which is only influenced by the policy through its effect on the transition matrix '. What does this mean?
If you have a deterministic policy acting in an environment characterized by a Controlled Markov Process then the whole system behaves as a Markov chain ie. the probability distribution over states at timestep depends only on the system state at time . The standard way to represent the transition probabilities in such a Markov chain is using a transition matrix whose entries correspond to the probability of transitioning from state to . By taking different actions when in different states, the choice of policy can affect the transition matrix. If the only way in which the policy can affect the outcome of the value function, we that the value function 'is only influenced by the policy through its effect on the transition matrix ' and Theorem 3.2 applies. This is not true for all goals. If I give an agent the goal 'visit state and take action , then visit state and take action ', this task depends on the actions taken, and cannot be expressed solely in terms of the elements of the transition matrix. As a result, we cannot apply Theorem 3.2 to this goal.
But it turns out that most reward maximization goals are of this type, so we can apply Theorem 3.2 to them. Before going on to our 'headline' theorems, we will now introduce the three kinds of reward maximization that we considered in the paper.
Types of reward maximization
We consider an arbitrary reward function over states. Let be the set of states, then the reward function is any function of the form ie. it is a function which takes in a state and outputs a reward between 0 and 1. The only restriction that we put on is that it is non-constant ie. there are at least two states which give different reward. The agent gets a reward for being in state at a timestep.
In reinforcement learning agents are expected to maximize expected reward summed over some number of timesteps. There are few different ways of aggregating reward over time. We refer to these as 'value functions'. A policy is optimal for a particular value function and environment if it achieves a higher value than any other policy. The three that we consider in our paper are time-discounted, finite time, and time-averaged.
Time-discounted reward maximization is common in reinforcement learning. For an environment . It is characterised by a value function of the form
where is the random variable corresponding to the environment state at timestep and is the discount rate. In words, this kind of value function sums the expected value of the reward at each timestep but values each subsequent timestep less than the previous one (by a factor of ).
Finite-time reward maximization is similar and characterised by value functions of the form:
In this case, the expected reward is only summed over the first timesteps. Since we are not summing over infinite timesteps, we can also consider finite-time reward maximization with no discount rate (ie. ) without the sum diverging.
Finally, time-averaged reward maximization allows us to consider an infinite number of timesteps with no discount rate:
Time-averaged reward maximization assigns reward based on the average proportion of time spent in each state over an infinite number of timesteps.
The Main Theorems
The main results of our paper come from applying Lemma 3.1 to these three types of reward maximization. These are the three 'headline' theorems of the paper.
The finicky bit, which takes up the bulk of the paper is proving that the conditions for Lemma 3.1 apply for these kind of value functions. In particular, for each kind of value function, we needed to prove that there is a unique optimal policy in all environments except a measure zero subset. We also needed to prove that each of these value functions is 'only influenced by the policy through its effect on the transition matrix', meaning that we can apply Theorem 3.2 to prove that each policy is optimal in an equal volume of possible environments. If you want to see how we did this, read the paper!
In summary, our result is that for all common types of reward maximization, each of the policies is uniquely optimal in the same volume of possible environments and that all environments (except a measure zero subset) have a unique optimal policy. This means that you can divide up the space of possible environments into equal-sized chunks, each one corresponding to a set of worlds for which a particular policy is uniquely optimal. Thus, finding out that a particular policy is optimal is results in your subjective distribution over possible environments changing from a uniform distribution over to a uniform distribution over a volume which as large. This corresponds to learning bits of information about the environment.
An intuition for the number is as follows. In almost all environments, for all (non-constant) reward functions, every action you take matters. To be an optimal policy you need to take the correct action in every possible state. If your policy is such that you take the wrong action in just one state, your policy is not optimal (in all environments except a measure zero subset). In a particular state, picking the correct action requires picking one action out of possible actions. Doing this requires bits. And there are states where you need to decide which action to take so you get lots of bits in an optimal policy.
Why are we interested?
We are interested in this kind of result because we want to quantify lower bounds on the amount of information that a world model must contain. Alternatively, we could view this as an upper bound on the size of an internal world model that could possibly be deduced by observing an agents actions (in the particular setup we considered). A lot of goals can be executed without full information about the world so we cannot expect to deduce the existence of a full internal world model based on an agent's performance at these goals.
But since an agent's actions originate somewhere from its internal structure, information contained in its actions must have some internal counterpart. If information about the world is present in the actions, it seems reasonable to conjecture that it will also be present in the architecture which generates the action. Of course, what we have presented here is a long way from demonstrating that the structure or type of this internal information corresponds to something that we might want to call a 'world model', but we believe that this result is an interesting step in the right direction.
See discussion here and here on the 'Agent-like Structure problem'.
The authors consider the following class of goals which involves steering the environment to a state and attempting to transition to state by taking an action . The agent is given attempts to do this (making its way back to after each attempt) and must achieve the transition at least times. The agent is also given a choice about whether it pursues this goal or whether it pursues a different goal which involves first taking action in state and then achieving the to transition (using action ) less than times. If the agent successfully achieves one of these goals with high probability then it must implicitly have a model of the transition probability . For example if we set very high but is very low, then a successful agent will not attempt to achieve the to transition many times using action and instead choose the task involving action . By querying the agent and seeing whether it takes action or action for different values of we can obtain an estimate of just by looking at the agent's output. As increases, this estimate becomes more accurate.
The claim that 'General Agents need World Models' comes from the fact that if an agent is 'general' to the point where it can achieve a wide class of goals which includes all goals of this form, then it must contain enough information to estimate for all combinations.
For an explainer on differential entropy see Alex's post. People are sometimes wary about using differential entropy because it is coordinate-dependent (see eg. Cover and Thomas Elements of Information Theory Ch.9). Thankfully this problem is avoided when using mutual information. Since mutual information can be expressed as a KL-divergence, it is invariant, even if the two differential entropies on which it is based are not.
An intuition for why these environments have zero measure. They require that two (or more) different actions (taken by different policies) result in the same outcome. This requires the probability distributions over the next state induced by taking each of the actions are exactly the same. But the probabilities are real numbers between 0 and 1. The set of pairs of probabilities defines a 2-dimensional space. But the set where the two probabilities are exactly the same only defines a 1-dimensional line in this space.