Post written with Rebecca Gorman.
We've written before that model splintering, as we called it then, was a problem with almost all AI safety approaches.
There's a converse to this: solving the problem would help with almost all AI safety approaches. But so far, we've been posting mainly about value extrapolation. In this post, we'll start looking at how other AI safety approaches could be helped.
To clarify, let's make four definitions, distinguishing ideas that we'd previously been grouping together:
Model splintering is when the features and concepts that are valid in one world-model, break down when transitioning to another world-model.
Value splintering (or reward splintering) is when the value function (or reward function, or goal, or preference...) becomes invalid due to model splintering.
Concept extrapolation is extrapolating a feature or concept from one world-model to another.
Value extrapolation is concept extrapolation when the particular concept to extrapolate is a value, a preference, a reward function, an agent's goal, or something of that nature.
Thus concept extrapolation is a solution to model splintering, while value extrapolation is a solution to value splintering specifically.
Consider for example Turner et al's attainable utility. It has a formal definition, but the reason for that definition is that preserving attainable utility is aimed at restricting the "power" of the agent, or at minimising its "side effects".
And it succeeds, in the typical situation. If you measure the attainable utility of an agent, this will give you an idea of its power, and how many side effects it may be causing. However, when we move to general situations, this breaks down: attainable utility preservation no longer restricts power or reduces side effects. So the concepts of power and side effects have splintered when moving from typical situations to general situations. This is the model splintering. If we solve concept extrapolation for this, then we could extend the concepts of power restriction or side effect minimisation, to the general situations. And thus successfully create low impact AIs.
Another example is wireheading. We have a reward signal that corresponds to something we desire in the world; maybe the negative of the concentration in the atmosphere. This is measured by, say, a series of detectors spread over the Earth's surface.
Typically, the reward signal does correspond to what we want. But if the AI hacks its own reward signal, that correspondence breaks down: model splintering. If we can extend the reward properly to new situations, we get concept extrapolation - which, since this is a reward function, is value extrapolation.
Helping with multiple methods
Hence the concept extrapolation/value extrapolation ideas can help with many different approaches to AI safety, not just the value learning approaches.
Equivalently, we could say that the concepts remain the same, but it's the correlation between "attainable utility preservation" and "power restriction" is what breaks down. ↩︎
There are multiple ways we can see the concepts breaking down. We can see the concept of "measured " breaking down. We can see the correlation between concentration and the reward breaking down. We can see the correlation between the reward and the reward signal breaking down. The reason there are so many ways of seeing the breakdown is because most descriptive labels describe collections of correlated features, rather than fundamental concepts. So the descriptions/features/concepts break down when the correlations do. ↩︎