I've defined generalised modelsM as being given by F, a set of features, and an (unnormalised) probability distribution Q over W=2¯¯¯¯F, the set of possible worlds defined by the all values of those features.
To make these into a category, a morphism r between M0=(F0,Q0) and M1=(F1,Q1) was defined to be a relation r between W0 and W1 (ie a subset of W0×W1), that obeyed certain conditions with respect to the Q's.
If r obeys those conditions, we can construct the "underlying model" for the morphism, a generalised model Mr=(F0⊔F1,Qr), with Qr non-zero only on r∈W0×W1.
That underlying model result basically says:
There is an underlying reality Mr of which M0 and M1 are different, consistent, facets.
That's well and good, but we also want to allow imperfect correspondences, where Q0 or Q1 (or both) might be known to be in error. After all, when we transitioned from Newtonian mechanics to relativity, it wasn't because there was an underlying reality that both were facets of. Instead, we realised that Newtonian mechanics and relativity were approximately though not perfectly equivalent in low-energy situations, and that when they diverged, relativity was overall more accurate.
We'd want to include these cases as generalised model, and measure how "imperfect" the morphism between them is, i.e. how much Q0 and Q1 diverge from being in perfect correspondence. We'll also look at how the Q and the feature sets are related - how much information Q caries relative to F.
Imperfect morphisms
So we'll loosen the definition of "morphisms" so that the Q0 and Q1 need not correspond exactly to each other or an underlying reality. If we have a relation r between W0 and W1, here are different Q-consistency requirements that we could put on r to make it into a morphism (the previous requirement has been renamed "Q-preserving", condition 5):
General binary relations r; no connection assumed between r and the Qs.
Q-relational: for all w0∈W0 with Q0(w0)>0, there exists at least one w1∈W1 with both Q1(w1)>0 and (w0,w1)∈r.
Q-functional: for all w0∈W0 with Q0(w0)>0, there exists a uniquew1∈W1 with both Q1(w1)>0 and (w0,w1)∈r.
Q-birelational: r and r−1 are both Q-relational.
Q-preserving: the same conditions as presented here. For all w0∈W0 and w1∈W1, Q0(w0)≤Q1(r(w0)) and Q1(w1)≤Q0(r−1(w1)).
Q-isomorphic: r is Q-preserving; both r and r−1 are Q-functional.
The general results about these morphisms classes are (proof found in this footnote[1]):
All the above conditions are associative (hence define classes of morphisms for different categories with the same objects): if r and p are Q-relational, Q-functional, Q-relational, Q-birelational, Q-preserving, Q-isomorphic or disconnected from Q entirely, then so is pr=p∘r.
Every morphism fulfils the conditions of the morphisms above it on the list, except that Q-preserving and Q-birelational need not imply Q-functional.
If r is Q-isomorphic, then we can pair up each non-measure zero elements w0∈W0 and w1∈W0 so that (w0,w1)∈r and Q0(w0)=Q1(w1).
Examples of morphisms
Here are four relations:
A coarsening is when multiple worlds get related to a single world, thus losing details. A refinement is the opposite: a single world gets related to multiple worlds, thus adding more details. An inclusion adds more worlds to the set. Its opposite, a restriction, removes worlds.
In terms of morphisms, if we assume that all the worlds in those sets have non-zero probability, then coarsenings, refinements, and inclusions are all Q-relational and Q-functional. Restrictions are neither. Coarsenings and refinements are Q-birelational, while inclusions and restrictions are not.
As for Q-preserving, coarsenings are Q-preserving if, for all w1∈W1, the sum of Q0(w0) for all (w0,w1)∈r, is equal to Q1(w1). Similarly, refinements are Q-preserving if, for all w0∈W0, the sum of the Q1(w1) for all (w0,w1)∈r, is equal Q0(w0). None of these four relations are Q-isomorphic (unless some of the worlds in the diagrams are of measure 0).
What about Bayesian updates? If we start with M0=({f}⊔F,Q0) and want to update on f=c for some constant c, then this corresponds to relating M0 to M1={F,Q1} such that (w0,w1)∈r if w0=(w1,f=c). We'll also require Q0(w0)=Q1(w1) on any such worlds - and assume that that defines Q1 entirely (we're ignoring renormalisation here, since we don't assume that Q0 and Q1 have measure 1).
This r is clearly a restriction, and hence does not meet any of the Q-consistency conditions. However, we can define Bayesian updates as relations r such that r−1 is Q-functional and injective. They do form a category when seen this way (since (rq)−1=q−1r−1).
Comparing the Q's
Given two generalised models M0=(F0,Q0) and M1=(F1,Q1), with a relation r between them, we'll now compare Q0 and Q1. We'll do this by defining a length operator L that gives the "length" of r, which is a measure of the divergence between Q0 and Q1 "along" the relation r.
Let (Q′0,Q′1) be a pair of probability distributions on W1 and W1, respectively. We'll say the pair is r-compatible if r is a Q-preserving morphism between (F0,Q′0) and (F1,Q′1).
Since Qi and Q′i are distributions over the same Wi, we can compare their l1 norm, defined as: ||Qi−Q′i||1=∑wi∈Wi|Qi(wi)−Q′i(wi)|. Then define L(r,Q′0,Q′1) as the sum of the l1-distances to Q0 and Q1:
L(r,Q′0,Q′1)=||Q0−Q′0||1+||Q1−Q′1||1.
We'll define L(r) to be the minimum[2] value of this norm among all the r-compatible (Q′0,Q′1). Because of its definition, it's immediately obvious that L(r)=L(r−1). The other key properties - proved here[3] - are that:
If r is Q-relational, then there exists a Q′1 such that L(r)=L(r,Q0,Q′1); ie we can use Q0 itself rather than finding a Q′0.
If r and p are Q-birelational, the L is a sensible length operator, in that L(pr)≤L(r)+L(p).
r is Q-preserving iff L(r)=0.
It's that last property that makes L such a useful distance metric: it measures the extent to which M0 and M1 fail to be aspects of the same underlying reality.
Relating features and probability distributions
In the above we've been looking at the relationship between r and Q, but have looked little at the features. Here we'll look at some of the relations between features and probability distributions. The idea is to measure how related the features are to each other.
There are several candidates for a measure of this types, but the most interesting seems to be a generalisation of mutual information. For any feature F∈F, we have the marginal distribution QF of Q over the feature F. Then if H is the informational entropy of a probability distribution/random variable, we can define a measure over Q given F as:
−H(Q)+∑F∈FH(QF).
If we define QF as the product distribution ∏F∈FQF, then that can also be defined as DKL(Q||QF), where DKL is the KL-divergence from QF to Q.
Example: gas laws
As an illustration, consider the ideal gas laws: PV=nRT, where P is pressure, V is volume, n is the amount of substance, R is the ideal gas constant, and T is the temperature. We'll consider a simple model with constant amount of substance, setting nR=1, so the law reduces to:
PV=T
We'll allow these variables to take a few values: P and V are integers that range from 1 to 4, while T is an integer that ranges from 1 to 16. The probability distribution Q will give uniform probability[4]1/16 to each (P=p,V=v,T=pv).
In this case, Q is uniform among the 16 worlds where it is non-zero; hence
H(Q)=log2(16)=4.
This characterises Q, but doesn't touch the features F. So, let QP, QV, and QP be the marginal distributions over the features. Both QP and QV are uniform over 4 elements, so H(QP)=H(QV)=2. As for QT, it is 1/16 over {1,9,16}, 2/16 over {2,3,6,8,12}, and 3/16 over {4}. Some calculations then establish that H(QT) is (54−3log2(3))/16. Then the KL-divergence from QF=QVQPQT to Q is:
Let us add another variable T′ to the feature set, which is just equal to T, but with another name, and see how things change. Then H(Q) is unchanged, and DKL(Q||QF) adds another 54−3log2(3)16, corresponding to H(T′).
So now assume that r:M0→M1 and p:M1→M2 are both Q-relational. Let w0∈W0 be such that Q0(w0)>0. Then, because r is Q-relational, there exists a w1∈W0 with Q1(w1)>0 and (w0,w1)∈r. Then since p is Q-relational, there exists a w2∈W2 with Q2(w2)>0 and (w1,w2)∈p. Combining the two gives (w0,w2)∈pr. Thus Q-relational morphisms are associative. Applying the same argument to r−1 shows that Q-birelational morphisms are also associative. A variant of the same argument with "there exists a unique w1∈W0" instead of "there exists a w1∈W0" shows that Q-functional morphisms are also associative. Hence Q-isomorphic r are also functional.
Now assume that r is Q-preserving and let w0∈W0 be such that Q0(w0)>0. Then there exists an underlying model Mr=(F0⊔F1,Qr) such that Q0(w0)=∑(w0,w1)∈rQr(w0,w1). Since this sum is greater than zero, there exists a w1 such that Qr(w0,w1)>0. Then since Q1(w
I've defined generalised models M as being given by F, a set of features, and an (unnormalised) probability distribution Q over W=2¯¯¯¯F, the set of possible worlds defined by the all values of those features.
To make these into a category, a morphism r between M0=(F0,Q0) and M1=(F1,Q1) was defined to be a relation r between W0 and W1 (ie a subset of W0×W1), that obeyed certain conditions with respect to the Q's.
If r obeys those conditions, we can construct the "underlying model" for the morphism, a generalised model Mr=(F0⊔F1,Qr), with Qr non-zero only on r∈W0×W1.
That underlying model result basically says:
That's well and good, but we also want to allow imperfect correspondences, where Q0 or Q1 (or both) might be known to be in error. After all, when we transitioned from Newtonian mechanics to relativity, it wasn't because there was an underlying reality that both were facets of. Instead, we realised that Newtonian mechanics and relativity were approximately though not perfectly equivalent in low-energy situations, and that when they diverged, relativity was overall more accurate.
We'd want to include these cases as generalised model, and measure how "imperfect" the morphism between them is, i.e. how much Q0 and Q1 diverge from being in perfect correspondence. We'll also look at how the Q and the feature sets are related - how much information Q caries relative to F.
Imperfect morphisms
So we'll loosen the definition of "morphisms" so that the Q0 and Q1 need not correspond exactly to each other or an underlying reality. If we have a relation r between W0 and W1, here are different Q-consistency requirements that we could put on r to make it into a morphism (the previous requirement has been renamed "Q-preserving", condition 5):
The general results about these morphisms classes are (proof found in this footnote[1]):
Examples of morphisms
Here are four relations:
A coarsening is when multiple worlds get related to a single world, thus losing details. A refinement is the opposite: a single world gets related to multiple worlds, thus adding more details. An inclusion adds more worlds to the set. Its opposite, a restriction, removes worlds.
In terms of morphisms, if we assume that all the worlds in those sets have non-zero probability, then coarsenings, refinements, and inclusions are all Q-relational and Q-functional. Restrictions are neither. Coarsenings and refinements are Q-birelational, while inclusions and restrictions are not.
As for Q-preserving, coarsenings are Q-preserving if, for all w1∈W1, the sum of Q0(w0) for all (w0,w1)∈r, is equal to Q1(w1). Similarly, refinements are Q-preserving if, for all w0∈W0, the sum of the Q1(w1) for all (w0,w1)∈r, is equal Q0(w0). None of these four relations are Q-isomorphic (unless some of the worlds in the diagrams are of measure 0).
What about Bayesian updates? If we start with M0=({f}⊔F,Q0) and want to update on f=c for some constant c, then this corresponds to relating M0 to M1={F,Q1} such that (w0,w1)∈r if w0=(w1,f=c). We'll also require Q0(w0)=Q1(w1) on any such worlds - and assume that that defines Q1 entirely (we're ignoring renormalisation here, since we don't assume that Q0 and Q1 have measure 1).
This r is clearly a restriction, and hence does not meet any of the Q-consistency conditions. However, we can define Bayesian updates as relations r such that r−1 is Q-functional and injective. They do form a category when seen this way (since (rq)−1=q−1r−1).
Comparing the Q's
Given two generalised models M0=(F0,Q0) and M1=(F1,Q1), with a relation r between them, we'll now compare Q0 and Q1. We'll do this by defining a length operator L that gives the "length" of r, which is a measure of the divergence between Q0 and Q1 "along" the relation r.
Let (Q′0,Q′1) be a pair of probability distributions on W1 and W1, respectively. We'll say the pair is r-compatible if r is a Q-preserving morphism between (F0,Q′0) and (F1,Q′1).
Since Qi and Q′i are distributions over the same Wi, we can compare their l1 norm, defined as: ||Qi−Q′i||1=∑wi∈Wi|Qi(wi)−Q′i(wi)|. Then define L(r,Q′0,Q′1) as the sum of the l1-distances to Q0 and Q1:
L(r,Q′0,Q′1)=||Q0−Q′0||1+||Q1−Q′1||1.
We'll define L(r) to be the minimum[2] value of this norm among all the r-compatible (Q′0,Q′1). Because of its definition, it's immediately obvious that L(r)=L(r−1). The other key properties - proved here[3] - are that:
It's that last property that makes L such a useful distance metric: it measures the extent to which M0 and M1 fail to be aspects of the same underlying reality.
Relating features and probability distributions
In the above we've been looking at the relationship between r and Q, but have looked little at the features. Here we'll look at some of the relations between features and probability distributions. The idea is to measure how related the features are to each other.
There are several candidates for a measure of this types, but the most interesting seems to be a generalisation of mutual information. For any feature F∈F, we have the marginal distribution QF of Q over the feature F. Then if H is the informational entropy of a probability distribution/random variable, we can define a measure over Q given F as:
−H(Q)+∑F∈FH(QF).
If we define QF as the product distribution ∏F∈FQF, then that can also be defined as DKL(Q||QF), where DKL is the KL-divergence from QF to Q.
Example: gas laws
As an illustration, consider the ideal gas laws: PV=nRT, where P is pressure, V is volume, n is the amount of substance, R is the ideal gas constant, and T is the temperature. We'll consider a simple model with constant amount of substance, setting nR=1, so the law reduces to:
PV=T
We'll allow these variables to take a few values: P and V are integers that range from 1 to 4, while T is an integer that ranges from 1 to 16. The probability distribution Q will give uniform probability[4] 1/16 to each (P=p,V=v,T=pv).
In this case, Q is uniform among the 16 worlds where it is non-zero; hence
H(Q)=log2(16)=4.
This characterises Q, but doesn't touch the features F. So, let QP, QV, and QP be the marginal distributions over the features. Both QP and QV are uniform over 4 elements, so H(QP)=H(QV)=2. As for QT, it is 1/16 over {1,9,16}, 2/16 over {2,3,6,8,12}, and 3/16 over {4}. Some calculations then establish that H(QT) is (54−3log2(3))/16. Then the KL-divergence from QF=QVQPQT to Q is:
DKL(Q||QF)=−4+2+2+54−3log2(3)16=54−3log2(3)16≈3.08.
Let us add another variable T′ to the feature set, which is just equal to T, but with another name, and see how things change. Then H(Q) is unchanged, and DKL(Q||QF) adds another 54−3log2(3)16, corresponding to H(T′).
We've already shown that Q-preserving morphisms are associative. The composition of general binary relations is known to be associative too.
So now assume that r:M0→M1 and p:M1→M2 are both Q-relational. Let w0∈W0 be such that Q0(w0)>0. Then, because r is Q-relational, there exists a w1∈W0 with Q1(w1)>0 and (w0,w1)∈r. Then since p is Q-relational, there exists a w2∈W2 with Q2(w2)>0 and (w1,w2)∈p. Combining the two gives (w0,w2)∈pr. Thus Q-relational morphisms are associative. Applying the same argument to r−1 shows that Q-birelational morphisms are also associative. A variant of the same argument with "there exists a unique w1∈W0" instead of "there exists a w1∈W0" shows that Q-functional morphisms are also associative. Hence Q-isomorphic r are also functional.
Now assume that r is Q-preserving and let w0∈W0 be such that Q0(w0)>0. Then there exists an underlying model Mr=(F0⊔F1,Qr) such that Q0(w0)=∑(w0,w1)∈rQr(w0,w1). Since this sum is greater than zero, there exists a w1 such that Qr(w0,w1)>0. Then since Q1(w