Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post has been mainly superceeded by this one.

Introduction

This post aims to formalise the models and model changes/model splintering described in this post. As explained there, the idea is to have a meta-model sufficiently general to be able to directly capture the process of moving from one imperfect model to another.

A note on infinity

For simplicity of exposition, I'll not talk about issues of infinite sets, continuity, convergence, etc... Just assume that any infinite set that comes up is just a finite set, large enough for whatever practical purpose we need it for.

Features, worlds, environments

A model M is defined by three object, the set F of features, the set E of environments, and a probability distribution Q. We'll define the first two in this section.

Features

Features are things that might be true or not about worlds, or might take certain values in worlds. For example, "the universe is open" is a possible feature about our universe, "the temperature is 250K" is another possible feature, but instead of returning true or false, it returns the temperature value. Adding more details, such as "the temperature is 250K in room 3, at 12:01" show that features should also be able to take inputs: features are functions.

But what about "the frequency of white light"? That's something that makes sense in many models - white light is used extensively in many contexts, and light has a frequency. The problem with that statement is that light has multiple frequencies; so we should allow features to be, at least in some cases, multivalued functions.

To top that off, sometimes there will be no correct value for a function; "the height of white light" is something that doesn't mean anything. So features have to include partial functions as well.

Fortunately, multivalued and partial functions are even simpler than functions at the formal level: they are just relations. And since the sets in the relations can consist of a single element, in even more generality, a feature is a predicate on a set. We just need to know which set.

So, formally, a feature F∈F consists of an (implicit) label defining what F is (eg "open universe", "temperature in some location") and a set on which it is a predicate. Thus, for example, the features above could be:

Fopen universe={0} (features which are simply true or false are functions of a single element).

Ftemperature={R+}.

Ftemperature at location and time={L×T×R+}, for some set L of locations and T of possible times.

Ffrequency of specific light={R+}.

Fheight of object={O×R+} for O a set of objects.

Note these definitions are purely syntactic, not semantic: they don't have any meaning. Indeed, as sets, Ftemperature and Ffrequency of specific light are identical. Note also that there are multiple ways of defining the same things; instead of a single feature Ftemperature at location and time, we could have a whole collection of Ftemperature at land t for all (l,t)∈L×T.

Similarly, this formalism is not focused on worlds, but it can be useful to define the full set of worlds for a model. This is simply the possible values that all features could conceivably take; so, if ¯¯¯¯¯F=⊔FF is the disjoint union of all features in F (seen as sets), the set of worlds W is just W=2¯¯¯¯F, the powerset of ¯¯¯¯¯F - equivalently, the set of all functions from ¯¯¯¯¯F to {True,False}.

So W just consists of all things that could be conceivably distinguished by the features. If we need more discrimination than this - just add more features.

Environments

The set of environments is a subset E of W, the set of worlds (though it need not be defined via W; it's a set of functions from ¯¯¯¯¯F to {True,False}).

Though this definition is still syntactic, it starts putting some restrictions on what the semantics could possibly be, in the spirit of this post.

For example, E could restrict to situations where Ftemperature is a single valued function, while Ffrequency of specific light is allowed to be multivalued. And similarly, Fheight of a specific oject takes no defined values on anything in the domain of Ffrequency of specific light.

Probability

The simplest way of defining Q is as a probability distribution over E.

This means that, if E1 and E2 are subsets of E, we can define the conditional probability

Q(E1∣E2)=Q(E1∩E2∣E2).

Once we have such a probability distribution, then, if the set of features is rich enough, this puts a lot more restrictions on the meaning that these features could have, going a lot of the way towards semantics. For example, if Q captures the ideal gas laws, then there is a specific relation between temperature, pressure, volume, and amount of substance - whatever those features are labelled.

In general, we'd want Q to be expressible in a simple way from the set F of features; that's the point of having those features in the first place.

Broader definition of "probability"

The plan for this meta-formalism is to allow transition from imperfect models to other imperfect models. So requiring that they have a probability distribution over all of E may be too much to ask.

In practice, all that is needed is expressions of the type Q(E1∣E2). And these may not be needed for all E1, E2. For example, to go back to the ideal gas laws, it makes perfect sense that we can deduce temperature from the other three features. But what if E2 just fixed the volume - can we deduce the pressure from that?

With Q as a prior over E, we can, by getting the pressure and amount of substance from the prior. But many models don't include these priors, and there's no reason to avoid those.

So, in the more general case, instead of E⊂W, define E⊂2W×2W, so that, for all (E1,E2)∈E, the following probability is defined:

Q(E1∣E2).

To insure consistency, we can require Q to follow axioms similar to the two-valued probabilities appendix *IV in Popper's "Logic of Scientific Discovery".

In full generality, we might need an even more general or imperfect definition of Q; see this post for a definition of "partial" probability distributions. But I'll leave this aside for the moment, and assume the simpler case where Q is a distribution over E.

Refinement

Here we'll look at how one can improve a model. Obviously, one can get a better Q, or a more expansive E, or a combination of these. Now, we haven't talked much about the quality of Q, and we'll leave this underdefined. Say that Q∗⪰Q means that Q∗ is 'at least as good as Q'. The 'at least as good' is specified by some mix of accuracy and simplicity.

More expansive E means that the environment of the improvement can be bigger. But in order for something to be "bigger", we need some identification between the two environments (which, so far, have just been defined as subsets of the powerset of feature values).

So, let M=(F,E,Q) and M∗=(F∗,E∗,Q∗) be models, let E∗0 be a subset of E∗, and let q be a surjective map from E∗0 to E (for an e∈E, think of q−1(e)⊂E∗0, the preimage of q, as the set of all environments in E∗ that correspond to e).

We can define Q∗0 on E in the following manner: if E1 and E2 are subsets of E, define

Q∗0(E1∣E2)=Q∗(q−1(E1)∣q−1(E2)).

Then q defines M∗ as a refinement of M if:

Q∗0⪰Q.

Refinement examples

Here are some examples of different types of refinements:

Q-improvement: F=F∗, E=E∗, Q∗⪰Q (eg using the sine of the angle rather than the angle itself for refraction).

Environment extension: F=F∗, E⊊E∗, E∗0=E with q the identity, Q∗=Q on E (eg moving from a training environment to a more extensive test environment).

Natural extension: environment extension where Q is simply defined in terms of F on E, and this extends to Q∗ on E∗ (eg extending Newtonian mechanics from the Earth to the whole of the solar system).

Non-independent feature extension: F⊊F∗. Let πF be the map that takes an element of W∗ and maps it to W by restricting^{[1]} to features in F. Then πF=q on E∗0, and Q∗0=Q (eg adding electromagnetism to Newtonian mechanics).

Independent feature extension: as a non-independent feature extension, but E∗0=E∗, and the stronger condition for Q∗ that Q(E1∣E2)=Q∗(q−1(E1)∣E∗2) for any E∗2 with q(E∗2)=E2 (eg non-colliding planets modelled without rotation, changing to modelling them with (mild) rotation).

Feature refinement: F⊊F∗ (moving from the ideal gas models to the van der Waals equation).

Feature splintering: when there is no single natural projection E∗→E that extends q (eg Blegg and Rube generalisation, happiness and human smile coming apart, inertial mass in general relativity projected to Newtonian mechanics...)

Reward function splintering: no single natural extension of the reward function on E from E′=q−1(E) to all of E∗ (any situation where a reward function, seen as a feature, splinters).

Reward function: refactoring and splintering

Reward function refactoring

Let M∗={F∗,E∗,Q∗} be a refinement of M={F,E,Q} (via q), and let R be a reward function defined on E.

A refactoring of R on M∗, is a reward function R∗ on E∗ such that for all e∗∈E∗, R(q(e∗))=R∗(e∗)). A natural refactoring is an extension of R is a refactoring that satisfies some naturalness or simplicity properties. For example, if R is the momentum of an object in M, and if momentum still makes sense in M∗, then this should be a natural refactoring.

Reward function splintering

If there does not exist a unique natural refactoring of R on M∗, then the refinement from M to M∗ splinters R.

Feature splintering

Let R be the indicator function for a feature being equal to some element or in some range. If R splinters in a refinement, then so does that feature.

Note that W∗ is the set of all functions from ¯¯¯¯¯F∗ to {True,False}. Since F⊂F∗, ¯¯¯¯¯F=⊔FF⊂⊔F∗F=¯¯¯¯¯F∗. Then we can project from W∗ to W by restricting a function to its values on ¯¯¯¯¯F. ↩︎

This post has been mainly superceeded by this one.## Introduction

This post aims to formalise the models and model changes/model splintering described in this post. As explained there, the idea is to have a meta-model sufficiently general to be able to directly capture the process of moving from one imperfect model to another.

## A note on infinity

For simplicity of exposition, I'll not talk about issues of infinite sets, continuity, convergence, etc... Just assume that any infinite set that comes up is just a finite set, large enough for whatever practical purpose we need it for.

## Features, worlds, environments

A model M is defined by three object, the set F of features, the set E of environments, and a probability distribution Q. We'll define the first two in this section.

## Features

Features are things that might be true or not about worlds, or might take certain values in worlds. For example, "the universe is open" is a possible feature about our universe, "the temperature is 250K" is another possible feature, but instead of returning true or false, it returns the temperature value. Adding more details, such as "the temperature is 250K in room 3, at 12:01" show that features should also be able to take inputs: features are functions.

But what about "the frequency of white light"? That's something that makes sense in many models - white light is used extensively in many contexts, and light has a frequency. The problem with that statement is that light has multiple frequencies; so we should allow features to be, at least in some cases, multivalued functions.

To top that off, sometimes there will be no correct value for a function; "the height of white light" is something that doesn't mean anything. So features have to include partial functions as well.

Fortunately, multivalued and partial functions are even simpler than functions at the formal level: they are just relations. And since the sets in the relations can consist of a single element, in even more generality, a feature is a predicate on a set. We just need to know which set.

So, formally, a feature F∈F consists of an (implicit) label defining what F is (eg "open universe", "temperature in some location") and a set on which it is a predicate. Thus, for example, the features above could be:

Note these definitions are purely syntactic, not semantic: they don't have any meaning. Indeed, as sets, Ftemperature and Ffrequency of specific light are identical. Note also that there are multiple ways of defining the same things; instead of a single feature Ftemperature at location and time, we could have a whole collection of Ftemperature at land t for all (l,t)∈L×T.

## Worlds

In Abram's orthodox case against utility functions he talks about the Jeffrey-Bolker axioms, which allows the construction of preferences from events

without needing full worlds at all.Similarly, this formalism is not focused on worlds, but it can be useful to define the full set of worlds for a model. This is simply the possible values that all features could conceivably take; so, if ¯¯¯¯¯F=⊔FF is the disjoint union of all features in F (seen as sets), the set of worlds W is just W=2¯¯¯¯F, the powerset of ¯¯¯¯¯F - equivalently, the set of all functions from ¯¯¯¯¯F to {True,False}.

So W just consists of all things that could be conceivably distinguished by the features. If we need more discrimination than this - just add more features.

## Environments

The set of environments is a subset E of W, the set of worlds (though it need not be defined via W; it's a set of functions from ¯¯¯¯¯F to {True,False}).

Though this definition is still syntactic, it starts putting some restrictions on what the semantics could possibly be, in the spirit of this post.

For example, E could restrict to situations where Ftemperature is a single valued function, while Ffrequency of specific light is allowed to be multivalued. And similarly, Fheight of a specific oject takes no defined values on anything in the domain of Ffrequency of specific light.

## Probability

The simplest way of defining Q is as a probability distribution over E.

This means that, if E1 and E2 are subsets of E, we can define the conditional probability

Q(E1∣E2)=Q(E1∩E2∣E2).

Once we have such a probability distribution, then, if the set of features is rich enough, this puts a lot more restrictions on the meaning that these features could have, going a lot of the way towards semantics. For example, if Q captures the ideal gas laws, then there is a specific relation between temperature, pressure, volume, and amount of substance - whatever those features are labelled.

In general, we'd want Q to be expressible in a simple way from the set F of features; that's the point of having those features in the first place.

## Broader definition of "probability"

The plan for this meta-formalism is to allow transition from imperfect models to other imperfect models. So requiring that they have a probability distribution over all of E may be too much to ask.

In practice, all that is needed is expressions of the type Q(E1∣E2). And these may not be needed for all E1, E2. For example, to go back to the ideal gas laws, it makes perfect sense that we can deduce temperature from the other three features. But what if E2 just fixed the volume - can we deduce the pressure from that?

With Q as a prior over E, we can, by getting the pressure and amount of substance from the prior. But many models don't include these priors, and there's no reason to avoid those.

So, in the more general case, instead of E⊂W, define E⊂2W×2W, so that, for all (E1,E2)∈E, the following probability is defined:

Q(E1∣E2).

To insure consistency, we can require Q to follow axioms similar to the two-valued probabilities appendix *IV in Popper's "Logic of Scientific Discovery".

In full generality, we might need an even more general or imperfect definition of Q; see this post for a definition of "partial" probability distributions. But I'll leave this aside for the moment, and assume the simpler case where Q is a distribution over E.

## Refinement

Here we'll look at how one can improve a model. Obviously, one can get a better Q, or a more expansive E, or a combination of these. Now, we haven't talked much about the quality of Q, and we'll leave this underdefined. Say that Q∗⪰Q means that Q∗ is 'at least as good as Q'. The 'at least as good' is specified by some mix of accuracy and simplicity.

More expansive E means that the environment of the improvement can be bigger. But in order for something to be "bigger", we need some identification between the two environments (which, so far, have just been defined as subsets of the powerset of feature values).

So, let M=(F,E,Q) and M∗=(F∗,E∗,Q∗) be models, let E∗0 be a subset of E∗, and let q be a surjective map from E∗0 to E (for an e∈E, think of q−1(e)⊂E∗0, the preimage of q, as the set of all environments in E∗ that correspond to e).

We can define Q∗0 on E in the following manner: if E1 and E2 are subsets of E, define

Q∗0(E1∣E2)=Q∗(q−1(E1)∣q−1(E2)).

Then q defines M∗ as a refinement of M if:

## Refinement examples

Here are some examples of different types of refinements:

^{[1]}to features in F. Then πF=q on E∗0, and Q∗0=Q (eg adding electromagnetism to Newtonian mechanics).## Reward function: refactoring and splintering

## Reward function refactoring

Let M∗={F∗,E∗,Q∗} be a refinement of M={F,E,Q} (via q), and let R be a reward function defined on E.

A

refactoringof R on M∗, is a reward function R∗ on E∗ such that for all e∗∈E∗, R(q(e∗))=R∗(e∗)). Anaturalrefactoring is an extension of R is a refactoring that satisfies some naturalness or simplicity properties. For example, if R is the momentum of an object in M, and if momentum still makes sense in M∗, then this should be a natural refactoring.## Reward function splintering

If there does not exist a unique natural refactoring of R on M∗, then the refinement from M to M∗ splinters R.

## Feature splintering

Let R be the indicator function for a feature being equal to some element or in some range. If R splinters in a refinement, then so does that feature.

Note that W∗ is the set of all functions from ¯¯¯¯¯F∗ to {True,False}. Since F⊂F∗, ¯¯¯¯¯F=⊔FF⊂⊔F∗F=¯¯¯¯¯F∗. Then we can project from W∗ to W by restricting a function to its values on ¯¯¯¯¯F. ↩︎