This post is my capstone project for the AI Safety Fundamentals programme. I would like to thank the organizers of the programme for putting together the resources and community which have broadened my horizons in the field. Thanks to my cohort and facilitator @sudhanshu_kasewa for the encouragement. Thanks also to @adamShimi, Brady C and @DavidHolmes for helpful discussion about the contents of a more technical version of this post which may appear in the future.

As the title suggests, the purpose of this post is to take a close look at Stuart Armstrong's category of generalized models. I am a category theorist by training, and my interest lies in understanding how category theory might be leveraged on this formalism in order to yield results about model splintering, which is the subject of Stuart's research agenda. This turns out to be hard, not because the category is especially hard to analyse, but because a crucial aspect of the formalism (that of which transformations qualify as morphisms) is not sufficiently determined to provide a solid foundation for deeper analysis.

Stuart Armstrong is open about the fact that he uses this category-theoretic formulation only as a convenient mental tool. He is as yet unconvinced of the value of a categorical approach to model splintering. I hope this post can be a first step to testing the validity of that scepticism.

A little background on categories

A category is a certain type of mathematical structure; this should not be confused with the standard meaning of the term! A category in this mathematical sense essentially consists of a collection of objects and a collection of morphisms (aka arrows), which behave like transformations in the sense that they can be composed.^{[1]} The reason this structure is called a category is that if I consider a category (in the usual sense) of structures studied in maths, such as sets, groups, vector spaces, algebras and so on, these typically come with a natural notion of transformation which makes these structures the objects of a category.

There are a number of decent posts introducing category theory here on LessWrong, and increasingly many domain-relevant introductions proliferating both online and in print, so I won't try to give a comprehensive introduction. In any case, in this post we'll mostly be examining what the objects and morphisms are in Stuart's category of generalized models.

John Wentworth likes to think about categories in terms of graphs, and that works for the purposes of visualization and getting a feel for how the structure of a generic category works. However, when we study a specific category, we naturally do so with the intention of learning something about the objects and morphisms that make it up, and we cannot make much progress or extract useful information until we nail down exactly what these objects and morphisms are supposed to represent.

Objects: Generalised models

In his original post describing the category of generalized models, Stuart takes the objects of the category to be triples (F,E,Q), where F is a set of 'features', E is a collection of subsets of possible values of features called 'environments', and Q is a partial probability distribution on E. In subsequent posts, the E part is dropped, so we end up with a generalized model being a pair (F,Q), although we need some auxiliary definitions to construct Q, as we shall see below.

I’m going to break these objects down for you, because part of the conceptual content that Stuart is attempting to capture here is contained in a separate (and much longer) post, and understanding the ingredients will be crucial to deciding what the full structure should be.

Each feature f∈F represents a piece of information about the world, and comes equipped with a set V(f) of possible values that the feature can take. As examples, features could include the temperature of a gas, the colour of an object, the location of an object, and so on.

Stuart defines a 'possible world' to be an assignment of subsets of V(f) to each feature f∈F. He constructs the set of possible worlds by first constructing the disjoint union ¯¯¯¯F:=∐f∈FV(f) and then taking the powerset W=2¯¯¯¯F. Concretely, each element of this set consists of a choice of value for each feature; from the point of view of the generalized model, a world is completely characterized by its features.

Finally, the partial probability distributionQ is intended to capture the model’s “beliefs” (heavy quote marks, because "belief" is a very loaded term whose baggage I want to avoid) about how the world works, in the form of conditional probabilities derived from relationships between features. Stuart appeals to physical theories like the ideal gas laws to illustrate this: if I have some distribution over the values of pressure and volume of a gas, I can derive from these a distribution over the possible temperatures. The distribution is only defined on a subset of the possible worlds for two reasons: one is that the full powerset of an infinite set is too big, so realistic models will only be able to define their distributions on a sensible collection of subsets of worlds; the other is that the relationships determining the distribution may not be universally valid, so it makes sense to allow for conservative models which only make predictions within a limited range of feature values.

It is interesting that the partial probability distribution is an extensional way of capturing the idea of a model. That is, while we might informally think of a model as carrying data such as equations, these are only present implicitly in the partial distribution, and another set of equations which produces the same predictions will produce identical generalized models. The distribution only carries the input-output behaviour of the model, rather than any explicit representation of the model itself. I think this is a wise choice, since any explicit representation would require an artificial choice of syntax and hence some constraints on what types of models could be expressed, which is all baggage that would get in the way of tackling the issues being targeted with this formalism.

Morphisms: updates?

A morphism (F,Q)→(F′,Q′), in Stuart’s posts, essentially consists of a relation between the respective sets of worlds. The subtlety is how this relation interacts with the partial distributions.

When we compare two world models, we base this comparison on the assumption that they are two models of the same ‘external’ world, which can be described in terms of features from each of the models. This is where the underlying model of a morphism comes from: it’s intended to be a model which carries all of the features of the respective models. Out of the set of all possible worlds for these features, we select a subset describing the compatible worlds. That subset R⊆2¯F⊔¯F′≅2¯F×2¯F′ is what we mean by a relation.

The first case that Stuart considers is the easiest case, in which the two generalized models are genuinely compatible models of a common world. In that case, the probability assigned to a given set of worlds in one world model should be smaller than the probability assigned to all possible compatible worlds in the other model. This appears to cleanly describe extensions of models which do not affect our expectations about the values of the existing features, or world models with independent features.

But there's a discrepancy here: in the models, the relationships between features are captured by the partial probability distribution, allowing for approximate or probabilistic relationships which are more flexible than strict ones. On the other hand, a relation between sets of possible worlds must determine in a much stricter yes/no sense which feature values in the respective models are compatible. This will typically mean that there are several possible probabilistic relationships which extend this relation of compatibility (some conditions will be formally compatible but very unlikely, say). As such, it is no surprise that when Stuart builds an underlying model for a morphism, the partial distribution it carries is not uniquely defined. A possible fix I would suggest here, which simultaneously resolves the discrepancy, would be to have morphisms being partial distributions over the possible worlds for the disjoint union of the features, subject to a condition ensuring that Q and Q′ can be recovered as marginal distributions^{[2]}. This eliminates relations completely from the data of the morphism.

Setting that suggestion aside, we now come to the problem that the above compatibility relation is not the only kind of transformation we might wish to consider. After all, if the distribution represents our belief about the rules governing the world, we also want to be able to update that distribution in order to reflect changing knowledge about the world, even without changing the features or their sets of possible values. This leads Stuart to consider “imperfect morphisms”.

For these, Stuart still keeps a relation between respective sets of possible worlds around. The interpretation of this relation is no longer clear-cut to me, since it makes less sense to consider which feature values are compatible between models which contain a fundamental disagreement about some part of the world. Stuart considers various "Q-consistency conditions" on such relations, corresponding to different ways in which the relation can interact with the respective partial distributions. While it’s interesting to consider these various possibilities, it seems that none of them capture the type of relationship that we actually care about, as is illustrated by Stuart's example of a Bayesian update. Moreover, some finiteness/discreteness conditions need to be imposed in order for some of these conditions to make sense in the first place (consider how "Q-functional" requires one to consider the probability of individual states, which for any non-discrete distribution is not going to be meaningful), which restricts the generality of the models to a degree I find frustrating.

Conclusions

I think it should be possible to identify a sensible class of morphisms between generalized models which captures the kinds of update we would like to have at our disposal for studying model splintering. I'm also certain that this class has not yet been identified.

Why should anyone go to the trouble of thinking about this? Until we have decided what our morphisms should be, there is very little in the way of category theory that we can apply. Of course, we could try to attack the categories obtained from the various choices of morphism that Stuart presents in his piece on "imperfect morphisms", but without concrete interpretations of what these represent, the value of such an attack is limited.

What could we hope to get out of this formalism in the context of AI Safety? Ultimately, the model splintering research agenda can be boiled down to the question of how morphisms should be constructed in our category of generalized models. Any procedure for converting empirical evidence or data into an update of a model should be expressible in terms of constructions in this category. That means that we can extract guarantees of the efficacy of constructions as theorems about this category. Conversely, any obstacle to the success of a given procedure will be visible in this category (it should contain an abstract version of any pathological example out there), and so we could obtain no-go theorems describing conditions under which a procedure will necessarily fail or cannot be guaranteed to produce a safe update.

More narrowly, the language of category theory provides concepts such as universal properties, which could in this situation capture the optimal solution to a modelling problem (the smallest model verifying some criteria, for example). Functors will allow direct comparison between this category of generalized models and other categories, which will make the structure of generalized models more accessible to tools coming from other areas of maths. This includes getting a better handle on pathological behaviour that can contribute to AI risk.

Once I've had some feedback about the preferred solution to the issues I pointed out in this post, I expect to put together a more technical post examining the category of generalized models with tools from category theory.

^{^}

Here's a bit more detail, although a footnote is really not a good place to be learning what a category is. Each morphism has a domain (aka source) object and a codomain (or target) object, each object has an identity morphism (with domain and codomain that object), and a pair of morphisms in which the codomain of the first coincides with the domain of the second can be composed to produce a morphism from the domain of the first to the codomain of the second. This composition operation is required to be associative and have the aforementioned identity morphisms as units on either side (composing with an identity morphism does nothing).

^{^}

I do not want to give a misleading impression that this solution is clear-cut, since obtaining marginal distributions requires integrating out variables, which is not going to be generally possible for a distribution/measure which is only partially defined. But I think this could be a guide towards a formal solution.

This post is my capstone project for the AI Safety Fundamentals programme. I would like to thank the organizers of the programme for putting together the resources and community which have broadened my horizons in the field. Thanks to my cohort and facilitator @sudhanshu_kasewa for the encouragement. Thanks also to @adamShimi, Brady C and @DavidHolmes for helpful discussion about the contents of a more technical version of this post which may appear in the future.As the title suggests, the purpose of this post is to take a close look at Stuart Armstrong's category of generalized models. I am a category theorist by training, and my interest lies in understanding how category theory might be leveraged on this formalism in order to yield results about model splintering, which is the subject of Stuart's research agenda. This turns out to be hard, not because the category is especially hard to analyse, but because a crucial aspect of the formalism (that of which transformations qualify as morphisms) is not sufficiently determined to provide a solid foundation for deeper analysis.

Stuart Armstrong is open about the fact that he uses this category-theoretic formulation only as a convenient mental tool. He is as yet unconvinced of the value of a categorical approach to model splintering. I hope this post can be a first step to testing the validity of that scepticism.

## A little background on categories

A

categoryis a certain type of mathematical structure; this should not be confused with the standard meaning of the term! A category in this mathematical sense essentially consists of a collection ofobjectsand a collection ofmorphisms(akaarrows), which behave like transformations in the sense that they can be composed.^{[1]}The reason this structure is called a category is that if I consider a category (in the usual sense) of structures studied in maths, such as sets, groups, vector spaces, algebras and so on, these typically come with a natural notion of transformation which makes these structures the objects of a category.There are a number of decent posts introducing category theory here on LessWrong, and increasingly many domain-relevant introductions proliferating both online and in print, so I won't try to give a comprehensive introduction. In any case, in this post we'll mostly be examining what the objects and morphisms are in Stuart's

category of generalized models.John Wentworth likes to think about categories in terms of graphs, and that works for the purposes of visualization and getting a feel for how the structure of a generic category works. However, when we study a specific category, we naturally do so with the intention of learning something about the objects and morphisms that make it up, and we cannot make much progress or extract useful information until we nail down exactly what these objects and morphisms are supposed to represent.

## Objects: Generalised models

In his original post describing the category of generalized models, Stuart takes the objects of the category to be triples (F,E,Q), where F is a

setof 'features', E is a collection of subsets of possible values of features called 'environments', and Q is a partial probability distribution on E. In subsequent posts, the E part is dropped, so we end up with a generalized model being a pair (F,Q), although we need some auxiliary definitions to construct Q, as we shall see below.I’m going to break these objects down for you, because part of the conceptual content that Stuart is attempting to capture here is contained in a separate (and much longer) post, and understanding the ingredients will be crucial to deciding what the full structure should be.

Each feature f∈F represents a piece of information about the world, and comes equipped with a set V(f) of possible values that the feature can take. As examples, features could include the temperature of a gas, the colour of an object, the location of an object, and so on.

Stuart defines a 'possible world' to be an assignment of subsets of V(f) to each feature f∈F. He constructs the set of possible worlds by first constructing the disjoint union ¯¯¯¯F:=∐f∈FV(f) and then taking the powerset W=2¯¯¯¯F. Concretely, each element of this set consists of a choice of value for each feature; from the point of view of the generalized model, a world is completely characterized by its features.

Finally, the partial probability distribution Q is intended to capture the model’s

“beliefs”(heavy quote marks, because "belief" is a very loaded term whose baggage I want to avoid) about how the world works, in the form of conditional probabilities derived from relationships between features. Stuart appeals to physical theories like the ideal gas laws to illustrate this: if I have some distribution over the values of pressure and volume of a gas, I can derive from these a distribution over the possible temperatures. The distribution is only defined on a subset of the possible worlds for two reasons: one is that the full powerset of an infinite set is too big, so realistic models will only be able to define their distributions on a sensible collection of subsets of worlds; the other is that the relationships determining the distribution may not be universally valid, so it makes sense to allow for conservative models which only make predictions within a limited range of feature values.It is interesting that the partial probability distribution is an

extensionalway of capturing the idea of a model. That is, while we might informally think of a model as carrying data such as equations, these are only present implicitly in the partial distribution, and another set of equations which produces the same predictions will produce identical generalized models. The distribution only carries the input-output behaviour of the model, rather than any explicit representation of the model itself. I think this is a wise choice, since any explicit representation would require an artificial choice of syntax and hence some constraints on what types of models could be expressed, which is all baggage that would get in the way of tackling the issues being targeted with this formalism.## Morphisms: updates?

A morphism (F,Q)→(F′,Q′), in Stuart’s posts, essentially consists of a relation between the respective sets of worlds. The subtlety is how this relation interacts with the partial distributions.

When we compare two world models, we base this comparison on the assumption that they are two models of the same ‘external’ world, which can be described in terms of features from each of the models. This is where the underlying model of a morphism comes from: it’s intended to be a model which carries all of the features of the respective models. Out of the set of all possible worlds for these features, we select a subset describing the

compatibleworlds. That subset R⊆2¯F⊔¯F′≅2¯F×2¯F′ is what we mean by arelation.The first case that Stuart considers is the easiest case, in which the two generalized models are genuinely compatible models of a common world. In that case, the probability assigned to a given set of worlds in one world model should be smaller than the probability assigned to all possible compatible worlds in the other model. This appears to cleanly describe extensions of models which do not affect our expectations about the values of the existing features, or world models with independent features.

But there's a discrepancy here: in the models, the relationships between features are captured by the partial probability distribution, allowing for approximate or probabilistic relationships which are more flexible than strict ones. On the other hand, a relation between sets of possible worlds must determine in a much stricter yes/no sense which feature values in the respective models are compatible. This will typically mean that there are several possible probabilistic relationships which extend this relation of compatibility (some conditions will be formally compatible but very unlikely, say). As such, it is no surprise that when Stuart builds an underlying model for a morphism, the partial distribution it carries is

not uniquely defined. A possible fix I would suggest here, which simultaneously resolves the discrepancy, would be to have morphisms beingpartial distributionsover the possible worlds for the disjoint union of the features, subject to a condition ensuring that Q and Q′ can be recovered as marginal distributions^{[2]}. This eliminates relations completely from the data of the morphism.Setting that suggestion aside, we now come to the problem that the above compatibility relation is not the only kind of transformation we might wish to consider. After all, if the distribution represents our belief about the rules governing the world, we also want to be able to update that distribution in order to reflect changing knowledge about the world, even without changing the features or their sets of possible values. This leads Stuart to consider “imperfect morphisms”.

For these, Stuart still keeps a relation between respective sets of possible worlds around. The interpretation of this relation is no longer clear-cut to me, since it makes less sense to consider which feature values are compatible between models which contain a fundamental disagreement about some part of the world. Stuart considers various "Q-consistency conditions" on such relations, corresponding to different ways in which the relation can interact with the respective partial distributions. While it’s interesting to consider these various possibilities, it seems that none of them capture the type of relationship that we actually care about, as is illustrated by Stuart's example of a Bayesian update. Moreover, some finiteness/discreteness conditions need to be imposed in order for some of these conditions to make sense in the first place (consider how "Q-functional" requires one to consider the probability of individual states, which for any non-discrete distribution is not going to be meaningful), which restricts the generality of the models to a degree I find frustrating.

## Conclusions

I think it should be possible to identify a sensible class of morphisms between generalized models which captures the kinds of update we would like to have at our disposal for studying model splintering. I'm also certain that this class has not yet been identified.

Why should anyone go to the trouble of thinking about this? Until we have decided what our morphisms should be, there is very little in the way of category theory that we can apply. Of course, we could try to attack the categories obtained from the various choices of morphism that Stuart presents in his piece on "imperfect morphisms", but without concrete interpretations of what these represent, the value of such an attack is limited.

What could we hope to get out of this formalism in the context of AI Safety? Ultimately, the model splintering research agenda can be boiled down to the question of how morphisms should be constructed in our category of generalized models. Any procedure for converting empirical evidence or data into an update of a model should be expressible in terms of constructions in this category. That means that we can extract

guaranteesof the efficacy of constructions as theorems about this category. Conversely, any obstacle to the success of a given procedure will be visible in this category (it should contain an abstract version of any pathological example out there), and so we could obtainno-go theoremsdescribing conditions under which a procedure will necessarily fail or cannot be guaranteed to produce a safe update.More narrowly, the language of category theory provides concepts such as

universal properties, which could in this situation capture the optimal solution to a modelling problem (the smallest model verifying some criteria, for example).Functorswill allow direct comparison between this category of generalized models and other categories, which will make the structure of generalized models more accessible to tools coming from other areas of maths. This includes getting a better handle on pathological behaviour that can contribute to AI risk.Once I've had some feedback about the preferred solution to the issues I pointed out in this post, I expect to put together a more technical post examining the category of generalized models with tools from category theory.^{^}Here's a bit more detail, although a footnote is really not a good place to be learning what a category is. Each morphism has a

domain(akasource) object and acodomain(ortarget) object, each object has anidentitymorphism (with domain and codomain that object), and a pair of morphisms in which the codomain of the first coincides with the domain of the second can becomposedto produce a morphism from the domain of the first to the codomain of the second. This composition operation is required to be associative and have the aforementioned identity morphisms as units on either side (composing with an identity morphism does nothing).^{^}I do not want to give a misleading impression that this solution is clear-cut, since obtaining marginal distributions requires integrating out variables, which is not going to be generally possible for a distribution/measure which is only partially defined. But I think this could be a guide towards a formal solution.