This is a distillation of The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables and some related ideas, that aims to describe the problem without referring to Bayesian expected-utility-maximising agents, DAGs or lazy evaluation of models and utility functions.

Was written for the AI Safety Distillation Contest.

The problem

If we want an AI to optimise for human values, we cannot rely solely on human feedback stating which potential world is higher in our preference ordering. If we do this, the AI will end up optimising for humans' estimates of their values. Something like "as long as the human cannot see the suffering going on in this world, it must be ok".

Human values are functions of types of data that are, by their nature, impossible to acquire, such as someone's internal experience of happiness, or the states of beings we don't even know exist. Additionally, human values are based on human world-models, as opposed to complete low-level specifications of reality. (If we end up being completely mistaken about quantum physics, we will still value happy beings, whatever their low-level description). 

We can think of the things we value as latent variables in our world-model. In The Pointers Problem: Clarifications/Variations, Abram Demski brings our attention to the ambiguity between latent variables as pragmatic tools vs ontological commitments. We invent latent variables for the convenience of their predictive power, and then ascribe value to specific states of those variables. This presents a difficulty if an AI does not share our ontological commitments. The article mentions "depression" as an example - we use the latent variable we name "depression" to refer to a cluster of symptoms. Diagnosing someone with "depression" provides explanatory power which allows us to make inferences about how best to help the person, what their behaviour and mood is likely to be like, etc. However, we cannot be certain that "depression" as a concept will exist in the AI's ontology. 

If an AI could know what the values-relevant latent variables of our world-model corresponded to in the real world and/or in its own world-model, it would be much easier for it to determine to what extent a possible world satisfies our values. This is the "pointers to values problem", a special case of a more general problem, the "pointers problem", which asks "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model"?

The pointers problem is a rephrasing of the outer alignment problem

The pointers problem appears to be a suitable way of conceptualising the outer alignment problem which highlights its most challenging aspect. 

The outer alignment problem is the problem of coming up with an optimisation objective or criteria to filter the program search space that is a good proxy for “what we actually want”. This involves us encoding what we want into a form that can be applied to an AI. 

Objective functions that we can use to outer-align an AI will take as inputs either the state of an AI (eg: model weights) and/or the state of the world (observable data from sensors). Naive solutions to the problem will cause the AI to optimise for states-that-look-good, and not states-that-are-good. Designing an objective that encapsulates optimising for states-that-are-good would require solving the “pointers to values problem”, resolving the ontology mismatch between us and the AI. As Demski writes, we want to "try to maximize values according to models which, according to human beliefs, track the things we care about well".

Changing values

Another important issue to address is that the AI can change our values. For example, if we are convinced that we are fundamentally mistaken about parts of our ontology, we may choose to modify our value function. This introduces additional complexity because we want to know when we are mistaken about the nature of reality, however we do not want to be manipulated to change our values for the wrong reasons.



New Comment