An AI actively trying to figure out what I want might show me snapshots of different possible worlds and ask me to rank them. Of course, I do not have the processing power to examine entire worlds; all I can really do is look at some pictures or video or descriptions. The AI might show me a bunch of pictures from one world in which a genocide is quietly taking place in some obscure third-world nation, and another in which no such genocide takes place. Unless the AI already considers that distinction important enough to draw my attention to it, I probably won’t notice it from the pictures, and I’ll rank those worlds similarly - even though I’d prefer the one without the genocide. Even if the AI does happen to show me some mass graves (probably secondhand, e.g. in pictures of news broadcasts), and I rank them low, it may just learn that I prefer my genocides under-the-radar.
The obvious point of such an example is that an AI should optimize for the real-world things I value, not just my estimates of those things. I don't just want to think my values are satisfied, I want them to actually be satisfied. Unfortunately, this poses a conceptual difficulty: what if I value the happiness of ghosts? I don't just want to think ghosts are happy, I want ghosts to actually be happy. What, then, should the AI do if there are no ghosts?
Human "values" are defined within the context of humans' world-models, and don't necessarily make any sense at all outside of the model (i.e. in the real world). Trying to talk about my values "actually being satisfied" is a type error.
Some points to emphasize here:
- My values are not just a function of my sense data, they are a function of the state of the whole world, including parts I can't see - e.g. I value the happiness of people I will never meet.
- I cannot actually figure out or process the state of the whole world
- … therefore, my values are a function of things I do not know and will not ever know - e.g. whether someone I will never encounter is happy right now
- This isn’t just a limited processing problem; I do not have enough data to figure out all these things I value, even in principle.
- This isn’t just a problem of not enough data, it’s a problem of what kind of data. My values depend on what’s going on “inside” of things which look the same - e.g. whether a smiling face is actually a rictus grin
- This isn’t just a problem of needing sufficiently low-level data. The things I care about are still ultimately high-level things, like humans or trees or cars. While the things I value are in principle a function of low-level world state, I don’t directly care about molecules.
- Some of the things I value may not actually exist - I may simply be wrong about which high-level things inhabit our world.
- I care about the actual state of things in the world, not my own estimate of the state - i.e. if the AI tricks me into thinking things are great (whether intentional trickery or not), that does not make things great.
These features make it rather difficult to “point” to values - it’s not just hard to formally specify values, it’s hard to even give a way to learn values. It’s hard to say what it is we’re supposed to be learning at all. What, exactly, are the inputs to my value-function? It seems like:
- Inputs to values are not complete low-level world states (since people had values before we knew what quantum fields were, and still have values despite not knowing the full state of the world), but…
- I value the actual state of the world rather than my own estimate of the world-state (i.e. I want other people to actually be happy, not just look-to-me like they’re happy).
How can both of those intuitions seem true simultaneously? How can the inputs to my values-function be the actual state of the world, but also high-level objects which may not even exist? What things in the low-level physical world are those “high-level objects” pointing to?
If I want to talk about "actually satisfying my values" separate from my own estimate of my values, then I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world.
I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now. This includes alignment of both “principled” and “prosaic” AI. The one major exception is pure human-mimicking AI, which suffers from a mostly-unrelated set of problems (largely stemming from the shortcomings of humans, especially groups of humans).
I have yet to see this problem explained, by itself, in a way that I’m satisfied by. I’m stealing the name from some of Abram’s posts, and I think he’s pointing to the same thing I am, but I’m not 100% sure.
The goal of this post is to demonstrate what the problem looks like for a (relatively) simple Bayesian-utility-maximizing agent, and what challenges it leads to. This has the drawback of defining things only within one particular model, but the advantage of showing how a bunch of nominally-different failure modes all follow from the same root problem: utility is a function of latent variables. We’ll look at some specific alignment strategies, and see how and why they fail in this simple model.
One thing I hope people will take away from this: it’s not the “values” part that’s conceptually difficult, it’s the “pointers” part.
We have a Bayesian expected-utility-maximizing agent, as a theoretical stand-in for a human. The agent’s world-model is a causal DAG over variables , and it chooses actions to maximize - i.e. it’s using standard causal decision theory. We will assume the agent has a full-blown Cartesian boundary, so we don’t need to worry about embeddedness and all that. In short, this is a textbook-standard causal-reasoning agent.
One catch: the agent’s world-model uses the sorts of tricks in Writing Causal Models Like We Write Programs, so the world-model can represent a very large world without ever explicitly evaluating probabilities of every variable in the world-model. Submodels are expanded lazily when they’re needed. You can still conceptually think of this as a standard causal DAG, it’s just that the model is lazily evaluated.
In particular, thinking of this agent as a human, this means that our human can value the happiness of someone they’ve never met, never thought about, and don’t know exists. The utility can be a function of variables which the agent will never compute, because the agent never needs to fully compute u in order to maximize it - it just needs to know how u changes as a function of the variables influenced by its actions.
Key assumption: most of the variables in the agent’s world-model are not observables. Drawing the analogy to humans: most of the things in our world-models are not raw photon counts in our eyes or raw vibration frequencies/intensities in our ears. Our world-models include things like trees and rocks and cars, objects whose existence and properties are inferred from the raw sense data. Even lower-level objects, like atoms and molecules, are latent variables; the raw data from our eyes and ears does not include the exact positions of atoms in a tree. The raw sense data itself is not sufficient to fully determine the values of the latent variables, in general; even a perfect Bayesian reasoner cannot deduce the true position of every atom in a tree from a video feed.
Now, the basic problem: our agent’s utility function is mostly a function of latent variables. Human values are mostly a function of rocks and trees and cars and other humans and the like, not the raw photon counts hitting our eyeballs. Human values are over inferred variables, not over sense data.
Furthermore, human values are over the “true” values of the latents, not our estimates - e.g. I want other people to actually be happy, not just to look-to-me like they’re happy. Ultimately, is the agent’s estimate of its own utility (thus the expectation), and the agent may not ever know the “true” value of its own utility - i.e. I may prefer that someone who went missing ten years ago lives out a happy life, but I may never find out whether that happened. On the other hand, it’s not clear that there’s a meaningful sense in which any “true” utility-value exists at all, since the agent’s latents may not correspond to anything physical - e.g. a human may value the happiness of ghosts, which is tricky if ghosts don’t exist in the real world.
On top of all that, some of those variables are implicit in the model’s lazy data structure and the agent will never think about them at all. I can value the happiness of people I do not know and will never encounter or even think about.
So, if an AI is to help optimize for , then it’s optimizing for something which is a function of latent variables in the agent’s model. Those latent variables:
- May not correspond to any particular variables in the AI’s world-model and/or the physical world
- May not be estimated by the agent at all (because lazy evaluation)
- May not be determined by the agent’s observed data
… and of course the agent’s model might just not be very good, in terms of predictive power.
As usual, neither we (the system’s designers) nor the AI will have direct access to the model; we/it will only see the agent’s behavior (i.e. input/output) and possibly a low-level system in which the agent is embedded. The agent itself may have some introspective access, but not full or perfectly reliable introspection.
Despite all that, we want to optimize for the agent’s utility, not just the agent’s estimate of its utility. Otherwise we run into wireheading-like problems, problems with the agent’s world model having poor predictive power, etc. But the agent’s utility is a function of latents which may not be well-defined at all outside the context of the agent’s estimator (a.k.a. world-model). How can we optimize for the agent’s “true” utility, not just an estimate, when the agent’s utility function is defined as a function of latents which may not correspond to anything outside of the agent’s estimator?
The Pointers Problem
We can now define the pointers problem - not only “pointers to values”, but the problem of pointers more generally. The problem: what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model? And what does that “correspondence” even mean - how do we turn it into an objective for the AI, or some other concrete thing outside the agent’s own head?
Why call this the “pointers” problem? Well, let’s take the agent’s perspective, and think about what its algorithm feels like from the inside. From inside the agent’s mind, it doesn’t feel like those latent variables are latent variables in a model. It feels like those latent variables are real things out in the world which the agent can learn about. The latent variables feel like “pointers” to real-world objects and their properties. But what are the referents of these pointers? What are the real-world things (if any) to which they’re pointing? That’s the pointers problem.
Is it even solvable? Definitely not always - there probably is no real-world referent for e.g. the human concept of a ghost. Similarly, I can have a concept of a perpetual motion machine, despite the likely-impossibility of any such thing existing. Between abstraction and lazy evaluation, latent variables in an agent’s world-model may not correspond to anything in the world.
That said, it sure seems like at least some latent variables do correspond to structures in the world. The concept of “tree” points to a pattern which occurs in many places on Earth. Even an alien or AI with radically different world-model could recognize that repeating pattern, realize that examining one tree probably yields information about other trees, etc. The pattern has predictive power, and predictive power is not just a figment of the agent’s world-model.
So we’d like to know both (a) when a latent variable corresponds to something in the world (or another world model) at all, and (b) what it corresponds to. We’d like to solve this in a way which (probably among other use-cases) lets the AI treat the things-corresponding-to-latents as the inputs to the utility function it’s supposed to learn and optimize.
To the extent that human values are a function of latent variables in humans’ world-models, this seems like a necessary step not only for an AI to learn human values, but even just to define what it means for an AI to learn human values. What does it mean to “learn” a function of some other agent’s latent variables, without necessarily adopting that agent’s world-model? If the AI doesn’t have some notion of what the other agent’s latent variables even “are”, then it’s not meaningful to learn a function of those variables. It would be like an AI “learning” to imitate grep, but without having any access to string or text data, and without the AI itself having any interface which would accept strings or text.
Let’s look at some example symptoms which can arise from failure to solve specific aspects of the pointers problem.
Let’s go back to the opening example: an AI shows us pictures from different possible worlds and asks us to rank them. The AI doesn’t really understand yet what things we care about, so it doesn’t intentionally draw our attention to certain things a human might consider relevant - like mass graves. Maybe we see a few mass-grave pictures from some possible worlds (probably in pictures from news sources, since that’s how such information mostly spreads), and we rank those low, but there are many other worlds where we just don’t notice the problem from the pictures the AI shows us. In the end, the AI decides that we mostly care about avoiding worlds where mass graves appear in the news - i.e. we prefer that mass killings stay under the radar.
How does this failure fit in our utility-function-of-latents picture?
This is mainly a failure to distinguish between the agent’s estimate of its own utility , and the “real” value of the agent’s utility (insofar as such a thing exists). The AI optimizes for our estimate, but does not give us enough data to very accurately estimate our utility in each world - indeed, it’s unlikely that a human could even handle that much information. So, it ends up optimizing for factors which bias our estimate - e.g. the availability of information about bad things.
Note that this intuitive explanation assumes a solution to the pointers problem: it only makes sense to the extent that there’s a “real” value of from which the “estimate” can diverge.
Not-So-Easy Wireheading Problems
The under-the-radar genocide problem looks roughly like a typical wireheading problem, so we should try a roughly-typical wireheading solution: rather than the AI showing world-pictures, it should just tell us what actions it could take, and ask us to rank actions directly.
If we were ideal Bayesian reasoners with accurate world models and infinite compute, and knew exactly where the AI’s actions fit in our world model, then this might work. Unfortunately, the failure of any of those assumptions breaks the approach:
- We don’t have the processing power to predict all the impacts of the AI’s actions
- Our world models may not be accurate enough to correctly predict the impact of the AI’s actions, even if we had enough processing power
- The AI’s actions may not even fit neatly into our world model - e.g. even the idea of genetic engineering might not fit the world-model of premodern human thinkers
Mathematically, we’re trying to optimize , i.e. optimize expected utility given the AI’s actions. Note that this is necessarily an expectation under the human’s model, since that’s the only context in which is well-defined. In order for that to work out well, we need to be able to fully evaluate that estimate (sufficient processing power), we need the estimate to be accurate (sufficient predictive power), and we need to be defined within the model in the first place.
The question of whether our world-models are sufficiently accurate is particularly hairy here, since accuracy is usually only defined in terms of how well we estimate our sense-data. But the accuracy we care about here is how well we “estimate” the values of latent variables and . What does that even mean, when the latent variables may not correspond to anything in the world?
People I Will Never Meet
“Human values cannot be determined from human behavior” seems almost old-hat at this point, but it’s worth taking a moment to highlight just how underdetermined values are from behavior. It’s not just that humans have biases of one kind or another, or that revealed preferences diverge from stated preferences. Even in our perfect Bayesian utility-maximizer, utility is severely underdetermined from behavior, because the agent does not have perfect estimates of its latent variables. Behavior depends only on the agent’s estimate, so it cannot account for “error” in the agent’s estimates of latent variable values, nor can it tell us about how the agent values variables which are not coupled to its own choices.
The happiness of people I will never interact with is a good example of this. There may be people in the world whose happiness will not ever be significantly influenced by my choices. Presumably, then, my choices cannot tell us about how much I value such peoples’ happiness. And yet, I do value it.
In Latent Variables and Model Misspecification, jsteinhardt talks about “misspecification” of latent variables in the AI’s model. His argument is that things like the “value function” are latent variables in the AI’s world-model, and are therefore potentially very sensitive to misspecification of the AI’s model.
In fact, I think the problem is more severe than that.
The value function’s inputs are latent variables in the human’s model, and are therefore sensitive to misspecification in the human’s model. If the human’s model does not match reality well, then their latent variables will be something wonky and not correspond to anything in the world. And AI designers do not get to pick the human’s model. These wonky variables, not corresponding to anything in the world, are a baked-in part of the problem, unavoidable even in principle. Even if the AI’s world model were “perfectly specified”, it would either be a bad representation of the world (in which case predictive power becomes an issue) or a bad representation of the human’s model (in which case those wonky latents aren’t defined).
The AI can’t model the world well with the human’s model, but the latents on which human values depend aren’t well-defined outside the human’s model. Rock and a hard place.
Within the context of a Bayesian utility-maximizer (representing a human), utility/values are a function of latent variables in the agent’s model. That’s a problem, because those latent variables do not necessarily correspond to anything in the environment, and even when they do, we don’t have a good way to say what they correspond to.
So, an AI trying to help the agent is stuck: if the AI uses the human’s world-model, then it may just be wrong outright (in predictive terms). But if the AI doesn’t use the human’s world-model, then the latents on which the utility function depends may not be defined at all.
Thus, the pointers problem, in the Bayesian context: figure out which things in the world (if any) correspond to the latent variables in a model. What do latent variables in my model “point to” in the real world?