Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Post written with Rebecca Gorman.

We've written before that model splintering, as we called it then, was a problem with almost all AI safety approaches.

There's a converse to this: solving the problem would help with almost all AI safety approaches. But so far, we've been posting mainly about value extrapolation. In this post, we'll start looking at how other AI safety approaches could be helped.


To clarify, let's make four definitions, distinguishing ideas that we'd previously been grouping together:

Model splintering is when the features and concepts that are valid in one world-model, break down when transitioning to another world-model.

Value splintering (or reward splintering) is when the value function (or reward function, or goal, or preference...) becomes invalid due to model splintering.

Concept extrapolation is extrapolating a feature or concept from one world-model to another.

Value extrapolation is concept extrapolation when the particular concept to extrapolate is a value, a preference, a reward function, an agent's goal, or something of that nature.

Thus concept extrapolation is a solution to model splintering, while value extrapolation is a solution to value splintering specifically.


Consider for example Turner et al's attainable utility. It has a formal definition, but the reason for that definition is that preserving attainable utility is aimed at restricting the "power" of the agent, or at minimising its "side effects".

And it succeeds, in the typical situation. If you measure the attainable utility of an agent, this will give you an idea of its power, and how many side effects it may be causing. However, when we move to general situations, this breaks down: attainable utility preservation no longer restricts power or reduces side effects. So the concepts of power and side effects have splintered when moving from typical situations to general situations. This is the model splintering[1]. If we solve concept extrapolation for this, then we could extend the concepts of power restriction or side effect minimisation, to the general situations. And thus successfully create low impact AIs.

Another example is wireheading. We have a reward signal that corresponds to something we desire in the world; maybe the negative of the concentration in the atmosphere. This is measured by, say, a series of detectors spread over the Earth's surface.

Typically, the reward signal does correspond to what we want. But if the AI hacks its own reward signal, that correspondence breaks down[2]: model splintering. If we can extend the reward properly to new situations, we get concept extrapolation - which, since this is a reward function, is value extrapolation.

Helping with multiple methods

Hence the concept extrapolation/value extrapolation ideas can help with many different approaches to AI safety, not just the value learning approaches.

  1. Equivalently, we could say that the concepts remain the same, but it's the correlation between "attainable utility preservation" and "power restriction" is what breaks down. ↩︎

  2. There are multiple ways we can see the concepts breaking down. We can see the concept of "measured " breaking down. We can see the correlation between concentration and the reward breaking down. We can see the correlation between the reward and the reward signal breaking down. The reason there are so many ways of seeing the breakdown is because most descriptive labels describe collections of correlated features, rather than fundamental concepts. So the descriptions/features/concepts break down when the correlations do. ↩︎

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 10:42 PM

I appreciate making these notions more precise. Model splintering seems closely related to other popular notions in ML, particularly underspecification ("many predictors f that a pipeline could return with similar predictive risk"), the Rashomon effect ("many different explanations exist for the same phenomenon"), and predictive multiplicity ("the ability of a prediction problem to admit competing models with conflicting predictions"), as well as more general notions of generalizability and out-of-sample or out-of-domain performance. I'd be curious what exactly makes model splintering different. Some example questions: Is the difference just the alignment context? Is it that "splintering" refers specifically to features and concepts within the model failing to generalize, rather than the model as a whole failing to generalize? If so, what does it even mean for the model as a whole to fail to generalize but not features failing to generalize? Is it that the aggregation of features is not a feature? And how are features and concepts different from each other, if they are?