Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post has been mainly superceeded by this one.


This post aims to formalise the models and model changes/model splintering described in this post. As explained there, the idea is to have a meta-model sufficiently general to be able to directly capture the process of moving from one imperfect model to another.

A note on infinity

For simplicity of exposition, I'll not talk about issues of infinite sets, continuity, convergence, etc... Just assume that any infinite set that comes up is just a finite set, large enough for whatever practical purpose we need it for.

Features, worlds, environments

A model is defined by three object, the set of features, the set of environments, and a probability distribution . We'll define the first two in this section.


Features are things that might be true or not about worlds, or might take certain values in worlds. For example, "the universe is open" is a possible feature about our universe, "the temperature is " is another possible feature, but instead of returning true or false, it returns the temperature value. Adding more details, such as "the temperature is in room 3, at 12:01" show that features should also be able to take inputs: features are functions.

But what about "the frequency of white light"? That's something that makes sense in many models - white light is used extensively in many contexts, and light has a frequency. The problem with that statement is that light has multiple frequencies; so we should allow features to be, at least in some cases, multivalued functions.

To top that off, sometimes there will be no correct value for a function; "the height of white light" is something that doesn't mean anything. So features have to include partial functions as well.

Fortunately, multivalued and partial functions are even simpler than functions at the formal level: they are just relations. And since the sets in the relations can consist of a single element, in even more generality, a feature is a predicate on a set. We just need to know which set.

So, formally, a feature consists of an (implicit) label defining what is (eg "open universe", "temperature in some location") and a set on which it is a predicate. Thus, for example, the features above could be:

  1. (features which are simply true or false are functions of a single element).
  2. .
  3. , for some set of locations and of possible times.
  4. .
  5. for a set of objects.

Note these definitions are purely syntactic, not semantic: they don't have any meaning. Indeed, as sets, and are identical. Note also that there are multiple ways of defining the same things; instead of a single feature , we could have a whole collection of for all .


In Abram's orthodox case against utility functions he talks about the Jeffrey-Bolker axioms, which allows the construction of preferences from events without needing full worlds at all.

Similarly, this formalism is not focused on worlds, but it can be useful to define the full set of worlds for a model. This is simply the possible values that all features could conceivably take; so, if is the disjoint union of all features in (seen as sets), the set of worlds is just , the powerset of - equivalently, the set of all functions from to .

So just consists of all things that could be conceivably distinguished by the features. If we need more discrimination than this - just add more features.


The set of environments is a subset of , the set of worlds (though it need not be defined via ; it's a set of functions from to ).

Though this definition is still syntactic, it starts putting some restrictions on what the semantics could possibly be, in the spirit of this post.

For example, could restrict to situations where is a single valued function, while is allowed to be multivalued. And similarly, takes no defined values on anything in the domain of .


The simplest way of defining is as a probability distribution over .

This means that, if and are subsets of , we can define the conditional probability

Once we have such a probability distribution, then, if the set of features is rich enough, this puts a lot more restrictions on the meaning that these features could have, going a lot of the way towards semantics. For example, if captures the ideal gas laws, then there is a specific relation between temperature, pressure, volume, and amount of substance - whatever those features are labelled.

In general, we'd want to be expressible in a simple way from the set of features; that's the point of having those features in the first place.

Broader definition of "probability"

The plan for this meta-formalism is to allow transition from imperfect models to other imperfect models. So requiring that they have a probability distribution over all of may be too much to ask.

In practice, all that is needed is expressions of the type . And these may not be needed for all , . For example, to go back to the ideal gas laws, it makes perfect sense that we can deduce temperature from the other three features. But what if just fixed the volume - can we deduce the pressure from that?

With as a prior over , we can, by getting the pressure and amount of substance from the prior. But many models don't include these priors, and there's no reason to avoid those.

So, in the more general case, instead of , define , so that, for all , the following probability is defined:

To insure consistency, we can require to follow axioms similar to the two-valued probabilities appendix *IV in Popper's "Logic of Scientific Discovery".

In full generality, we might need an even more general or imperfect definition of ; see this post for a definition of "partial" probability distributions. But I'll leave this aside for the moment, and assume the simpler case where is a distribution over .


Here we'll look at how one can improve a model. Obviously, one can get a better , or a more expansive , or a combination of these. Now, we haven't talked much about the quality of , and we'll leave this underdefined. Say that means that is 'at least as good as '. The 'at least as good' is specified by some mix of accuracy and simplicity.

More expansive means that the environment of the improvement can be bigger. But in order for something to be "bigger", we need some identification between the two environments (which, so far, have just been defined as subsets of the powerset of feature values).

So, let and be models, let be a subset of , and let be a surjective map from to (for an , think of , the preimage of , as the set of all environments in that correspond to ).

We can define on in the following manner: if and are subsets of , define

Then defines as a refinement of if:

  • .

Refinement examples

Here are some examples of different types of refinements:

  1. -improvement: , , (eg using the sine of the angle rather than the angle itself for refraction).
  2. Environment extension: , , with the identity, on (eg moving from a training environment to a more extensive test environment).
  3. Natural extension: environment extension where is simply defined in terms of on , and this extends to on (eg extending Newtonian mechanics from the Earth to the whole of the solar system).
  4. Non-independent feature extension: . Let be the map that takes an element of and maps it to by restricting[1] to features in . Then on , and (eg adding electromagnetism to Newtonian mechanics).
  5. Independent feature extension: as a non-independent feature extension, but , and the stronger condition for that for any with (eg non-colliding planets modelled without rotation, changing to modelling them with (mild) rotation).
  6. Feature refinement: (moving from the ideal gas models to the van der Waals equation).
  7. Feature splintering: when there is no single natural projection that extends (eg Blegg and Rube generalisation, happiness and human smile coming apart, inertial mass in general relativity projected to Newtonian mechanics...)
  8. Reward function splintering: no single natural extension of the reward function on from to all of (any situation where a reward function, seen as a feature, splinters).

Reward function: refactoring and splintering

Reward function refactoring

Let be a refinement of (via ), and let be a reward function defined on .

A refactoring of on , is a reward function on such that for all , . A natural refactoring is an extension of is a refactoring that satisfies some naturalness or simplicity properties. For example, if is the momentum of an object in , and if momentum still makes sense in , then this should be a natural refactoring.

Reward function splintering

If there does not exist a unique natural refactoring of on , then the refinement from to splinters .

Feature splintering

Let be the indicator function for a feature being equal to some element or in some range. If splinters in a refinement, then so does that feature.

  1. Note that is the set of all functions from to . Since , . Then we can project from to by restricting a function to its values on . ↩︎

New to LessWrong?

New Comment