It might be relevant to note that the meaningfulness of this coherence definition depends on the chosen environment. For instance, in an deterministic forest MDP where an agent at a state s can never return to s for any s and there is only one path between any two states, suppose we have a deterministic policy π and let s1=π(s), s2=π(s1), etc. Then for the zero-current-payoff Bellman equations, we only need that V(s1)>V(s′) for any successor s′ from s, V(s2)>V(s′′) for any successor s′′ from s′, etc. We can achieve this easily by, for example, letting all values except V(si) be near-zero; since sj is a successor of si iff j=i+1 (as otherwise there would be a cycle), this fits our criterion. Thus, every π is coherent in this environment. (I haven't done the explicit math here, but I suspect that this also works for non-deterministic π and non-stochastic MDPs.)

Importantly, using the common definition of language models in an RL setting where each state represents a sequence of tokens and each action adds a token to the end of a sequence of length t to produce a sequence of length t+1, the environment is a deterministic forest, as there is only one way to "go between" two sequences (if one is a prefix of the other, choose the remaining tokens in order). Thus, any language model is coherent, which seems unsatisfying. We could try using a different environment, but this risks losing stochasticity (as the output logits of an LM is determined by its input sequence) and gets complicated pretty quickly (use natural abstractions/world model as states?).

Right, I think this somewhat corresponds to the "how long it takes a policy to reach a stable loop" (the "distance to loop" metric), which we used in our experiments.

It might be relevant to note that the meaningfulness of this coherence definition depends on the chosen environment. For instance, in an deterministic forest MDP where an agent at a state s can never return to s for any s and there is only one path between any two states, suppose we have a deterministic policy π and let s1=π(s), s2=π(s1), etc. Then for the zero-current-payoff Bellman equations, we only need that V(s1)>V(s′) for any successor s′ from s, V(s2)>V(s′′) for any successor s′′ from s′, etc. We can achieve this easily by, for example, letting all values except V(si) be near-zero; since sj is a successor of si iff j=i+1 (as otherwise there would be a cycle), this fits our criterion. Thus, every π is coherent in this environment. (I haven't done the explicit math here, but I suspect that this also works for non-deterministic π and non-stochastic MDPs.)

Importantly, using the common definition of language models in an RL setting where each state represents a sequence of tokens and each action adds a token to the end of a sequence of length t to produce a sequence of length t+1, the environment is a deterministic forest, as there is only one way to "go between" two sequences (if one is a prefix of the other, choose the remaining tokens in order). Thus,

any language model is coherent, which seems unsatisfying. We could try using a different environment, but this risks losing stochasticity (as the output logits of an LM is determined by its input sequence) and gets complicated pretty quickly (use natural abstractions/world model as states?).