This is a distillation of The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables and some related ideas written for the AI Safety Distillation Contest.

The problem

If we want an AI to optimize for human values, we cannot rely solely on human feedback stating which potential world is higher in our preference ordering. 

If we do this, the AI will optimize for humans' estimates of their values, resulting in something like "as long as the human cannot see the suffering in this world, it must be ok."

Human values are functions of data types that are, by their nature, impossible to acquire, such as someone's internal experience of happiness or the states of beings we don't even know exist. Additionally, human values are based on human world models instead of complete low-level specifications of reality. (If we are entirely mistaken about quantum physics, we will still value happy beings, whatever their low-level description). 

We can think of what we value as latent variables in our world model. The Pointers Problem: Clarifications/Variations brings our attention to the ambiguity between latent variables as pragmatic tools vs ontological commitments. We invent latent variables for the convenience of their predictive power and then ascribe value to specific states of those variables. This presents a challenge if an AI does not share our ontological commitments. The article mentions "depression" as an example - we use the latent variable "depression" to refer to a cluster of symptoms. Diagnosing someone with "depression" provides explanatory power, which allows us to make inferences about how best to help the person, what their behavior and mood are likely to be like, etc. However, we cannot be sure that the concept of "depression" will exist in an AI's ontology. 

If an AI could know what the values-relevant latent variables of our world model corresponded to in the real world and its world model, it would be much easier to determine to what extent a possible world satisfies our values. 

This is the "pointers to values problem," a particular case of a more general problem, the "pointers problem," which asks "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent's world-model"?

The pointers problem is a rephrasing of the outer alignment problem

The pointers problem is suitable for conceptualizing the outer alignment problem, highlighting its most challenging aspect. 

The outer alignment problem is the problem of coming up with an optimization objective or criteria to filter the program search space that is a good proxy for "what we actually want." This involves encoding what we want into a form that can be applied to an AI. 

Objective functions that we can use to outer-align an AI will take as inputs the state of an AI (e.g., model weights) and the state of the world (observable data from sensors). Naive solutions to the problem will cause the AI to optimize for "states that look good," not "states that are good." 

Designing an objective that encapsulates optimizing for good states would require solving the "pointers to values problem," resolving the ontology mismatch between us and the AI. We want to "try to maximize values according to models which, according to human beliefs, track the things we care about well."

Changing values

Another important issue to address is that an AI can change our values. For example, if we are convinced that we are fundamentally mistaken about parts of our ontology, we may choose to modify our value function. This introduces additional complexity because we want to know when we are mistaken about the nature of reality. However, we do not want to be manipulated to change our values for the wrong reasons.


New Comment