# Ω 19

RationalityAI
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

An AI actively trying to figure out what I want might show me snapshots of different possible worlds and ask me to rank them. Of course, I do not have the processing power to examine entire worlds; all I can really do is look at some pictures or video or descriptions. The AI might show me a bunch of pictures from one world in which a genocide is quietly taking place in some obscure third-world nation, and another in which no such genocide takes place. Unless the AI already considers that distinction important enough to draw my attention to it, I probably won’t notice it from the pictures, and I’ll rank those worlds similarly - even though I’d prefer the one without the genocide. Even if the AI does happen to show me some mass graves (probably secondhand, e.g. in pictures of news broadcasts), and I rank them low, it may just learn that I prefer my genocides under-the-radar.

The obvious point of such an example is that an AI should optimize for the real-world things I value, not just my estimates of those things. I don't just want to think my values are satisfied, I want them to actually be satisfied. Unfortunately, this poses a conceptual difficulty: what if I value the happiness of ghosts? I don't just want to think ghosts are happy, I want ghosts to actually be happy. What, then, should the AI do if there are no ghosts?

Human "values" are defined within the context of humans' world-models, and don't necessarily make any sense at all outside of the model (i.e. in the real world). Trying to talk about my values "actually being satisfied" is a type error.

Some points to emphasize here:

• My values are not just a function of my sense data, they are a function of the state of the whole world, including parts I can't see - e.g. I value the happiness of people I will never meet.
• I cannot actually figure out or process the state of the whole world
• … therefore, my values are a function of things I do not know and will not ever know - e.g. whether someone I will never encounter is happy right now
• This isn’t just a limited processing problem; I do not have enough data to figure out all these things I value, even in principle.
• This isn’t just a problem of not enough data, it’s a problem of what kind of data. My values depend on what’s going on “inside” of things which look the same - e.g. whether a smiling face is actually a rictus grin
• This isn’t just a problem of needing sufficiently low-level data. The things I care about are still ultimately high-level things, like humans or trees or cars. While the things I value are in principle a function of low-level world state, I don’t directly care about molecules.
• Some of the things I value may not actually exist - I may simply be wrong about which high-level things inhabit our world.
• I care about the actual state of things in the world, not my own estimate of the state - i.e. if the AI tricks me into thinking things are great (whether intentional trickery or not), that does not make things great.

These features make it rather difficult to “point” to values - it’s not just hard to formally specify values, it’s hard to even give a way to learn values. It’s hard to say what it is we’re supposed to be learning at all. What, exactly, are the inputs to my value-function? It seems like:

• Inputs to values are not complete low-level world states (since people had values before we knew what quantum fields were, and still have values despite not knowing the full state of the world), but…
• I value the actual state of the world rather than my own estimate of the world-state (i.e. I want other people to actually be happy, not just look-to-me like they’re happy).

How can both of those intuitions seem true simultaneously? How can the inputs to my values-function be the actual state of the world, but also high-level objects which may not even exist? What things in the low-level physical world are those “high-level objects” pointing to?

If I want to talk about "actually satisfying my values" separate from my own estimate of my values, then I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world.

I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now. This includes alignment of both “principled” and “prosaic” AI. The one major exception is pure human-mimicking AI, which suffers from a mostly-unrelated set of problems (largely stemming from the shortcomings of humans, especially groups of humans).

I have yet to see this problem explained, by itself, in a way that I’m satisfied by. I’m stealing the name from some of Abram’s posts, and I think he’s pointing to the same thing I am, but I’m not 100% sure.

The goal of this post is to demonstrate what the problem looks like for a (relatively) simple Bayesian-utility-maximizing agent, and what challenges it leads to. This has the drawback of defining things only within one particular model, but the advantage of showing how a bunch of nominally-different failure modes all follow from the same root problem: utility is a function of latent variables. We’ll look at some specific alignment strategies, and see how and why they fail in this simple model.

One thing I hope people will take away from this: it’s not the “values” part that’s conceptually difficult, it’s the “pointers” part.

# The Setup

We have a Bayesian expected-utility-maximizing agent, as a theoretical stand-in for a human. The agent’s world-model is a causal DAG over variables , and it chooses actions  to maximize  - i.e. it’s using standard causal decision theory. We will assume the agent has a full-blown Cartesian boundary, so we don’t need to worry about embeddedness and all that. In short, this is a textbook-standard causal-reasoning agent.

One catch: the agent’s world-model uses the sorts of tricks in Writing Causal Models Like We Write Programs, so the world-model can represent a very large world without ever explicitly evaluating probabilities of every variable in the world-model. Submodels are expanded lazily when they’re needed. You can still conceptually think of this as a standard causal DAG, it’s just that the model is lazily evaluated.

In particular, thinking of this agent as a human, this means that our human can value the happiness of someone they’ve never met, never thought about, and don’t know exists. The utility  can be a function of variables which the agent will never compute, because the agent never needs to fully compute u in order to maximize it - it just needs to know how u changes as a function of the variables influenced by its actions.

Key assumption: most of the variables in the agent’s world-model are not observables. Drawing the analogy to humans: most of the things in our world-models are not raw photon counts in our eyes or raw vibration frequencies/intensities in our ears. Our world-models include things like trees and rocks and cars, objects whose existence and properties are inferred from the raw sense data. Even lower-level objects, like atoms and molecules, are latent variables; the raw data from our eyes and ears does not include the exact positions of atoms in a tree. The raw sense data itself is not sufficient to fully determine the values of the latent variables, in general; even a perfect Bayesian reasoner cannot deduce the true position of every atom in a tree from a video feed.

Now, the basic problem: our agent’s utility function is mostly a function of latent variables. Human values are mostly a function of rocks and trees and cars and other humans and the like, not the raw photon counts hitting our eyeballs. Human values are over inferred variables, not over sense data.

Furthermore, human values are over the “true” values of the latents, not our estimates - e.g. I want other people to actually be happy, not just to look-to-me like they’re happy. Ultimately,  is the agent’s estimate of its own utility (thus the expectation), and the agent may not ever know the “true” value of its own utility - i.e. I may prefer that someone who went missing ten years ago lives out a happy life, but I may never find out whether that happened. On the other hand, it’s not clear that there’s a meaningful sense in which any “true” utility-value exists at all, since the agent’s latents may not correspond to anything physical - e.g. a human may value the happiness of ghosts, which is tricky if ghosts don’t exist in the real world.

On top of all that, some of those variables are implicit in the model’s lazy data structure and the agent will never think about them at all. I can value the happiness of people I do not know and will never encounter or even think about.

So, if an AI is to help optimize for , then it’s optimizing for something which is a function of latent variables in the agent’s model. Those latent variables:

• May not correspond to any particular variables in the AI’s world-model and/or the physical world
• May not be estimated by the agent at all (because lazy evaluation)
• May not be determined by the agent’s observed data

… and of course the agent’s model might just not be very good, in terms of predictive power.

As usual, neither we (the system’s designers) nor the AI will have direct access to the model; we/it will only see the agent’s behavior (i.e. input/output) and possibly a low-level system in which the agent is embedded. The agent itself may have some introspective access, but not full or perfectly reliable introspection.

Despite all that, we want to optimize for the agent’s utility, not just the agent’s estimate of its utility. Otherwise we run into wireheading-like problems, problems with the agent’s world model having poor predictive power, etc. But the agent’s utility is a function of latents which may not be well-defined at all outside the context of the agent’s estimator (a.k.a. world-model). How can we optimize for the agent’s “true” utility, not just an estimate, when the agent’s utility function is defined as a function of latents which may not correspond to anything outside of the agent’s estimator?

# The Pointers Problem

We can now define the pointers problem - not only “pointers to values”, but the problem of pointers more generally. The problem: what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model? And what does that “correspondence” even mean - how do we turn it into an objective for the AI, or some other concrete thing outside the agent’s own head?

Why call this the “pointers” problem? Well, let’s take the agent’s perspective, and think about what its algorithm feels like from the inside. From inside the agent’s mind, it doesn’t feel like those latent variables are latent variables in a model. It feels like those latent variables are real things out in the world which the agent can learn about. The latent variables feel like “pointers” to real-world objects and their properties. But what are the referents of these pointers? What are the real-world things (if any) to which they’re pointing? That’s the pointers problem.

Is it even solvable? Definitely not always - there probably is no real-world referent for e.g. the human concept of a ghost. Similarly, I can have a concept of a perpetual motion machine, despite the likely-impossibility of any such thing existing. Between abstraction and lazy evaluation, latent variables in an agent’s world-model may not correspond to anything in the world.

That said, it sure seems like at least some latent variables do correspond to structures in the world. The concept of “tree” points to a pattern which occurs in many places on Earth. Even an alien or AI with radically different world-model could recognize that repeating pattern, realize that examining one tree probably yields information about other trees, etc. The pattern has predictive power, and predictive power is not just a figment of the agent’s world-model.

So we’d like to know both (a) when a latent variable corresponds to something in the world (or another world model) at all, and (b) what it corresponds to. We’d like to solve this in a way which (probably among other use-cases) lets the AI treat the things-corresponding-to-latents as the inputs to the utility function it’s supposed to learn and optimize.

To the extent that human values are a function of latent variables in humans’ world-models, this seems like a necessary step not only for an AI to learn human values, but even just to define what it means for an AI to learn human values. What does it mean to “learn” a function of some other agent’s latent variables, without necessarily adopting that agent’s world-model? If the AI doesn’t have some notion of what the other agent’s latent variables even “are”, then it’s not meaningful to learn a function of those variables. It would be like an AI “learning” to imitate grep, but without having any access to string or text data, and without the AI itself having any interface which would accept strings or text.

Let’s look at some example symptoms which can arise from failure to solve specific aspects of the pointers problem.

Let’s go back to the opening example: an AI shows us pictures from different possible worlds and asks us to rank them. The AI doesn’t really understand yet what things we care about, so it doesn’t intentionally draw our attention to certain things a human might consider relevant - like mass graves. Maybe we see a few mass-grave pictures from some possible worlds (probably in pictures from news sources, since that’s how such information mostly spreads), and we rank those low, but there are many other worlds where we just don’t notice the problem from the pictures the AI shows us. In the end, the AI decides that we mostly care about avoiding worlds where mass graves appear in the news - i.e. we prefer that mass killings stay under the radar.

How does this failure fit in our utility-function-of-latents picture?

This is mainly a failure to distinguish between the agent’s estimate of its own utility , and the “real” value of the agent’s utility  (insofar as such a thing exists). The AI optimizes for our estimate, but does not give us enough data to very accurately estimate our utility in each world - indeed, it’s unlikely that a human could even handle that much information. So, it ends up optimizing for factors which bias our estimate - e.g. the availability of information about bad things.

Note that this intuitive explanation assumes a solution to the pointers problem: it only makes sense to the extent that there’s a “real” value of  from which the “estimate” can diverge.

The under-the-radar genocide problem looks roughly like a typical wireheading problem, so we should try a roughly-typical wireheading solution: rather than the AI showing world-pictures, it should just tell us what actions it could take, and ask us to rank actions directly.

If we were ideal Bayesian reasoners with accurate world models and infinite compute, and knew exactly where the AI’s actions fit in our world model, then this might work. Unfortunately, the failure of any of those assumptions breaks the approach:

• We don’t have the processing power to predict all the impacts of the AI’s actions
• Our world models may not be accurate enough to correctly predict the impact of the AI’s actions, even if we had enough processing power
• The AI’s actions may not even fit neatly into our world model - e.g. even the idea of genetic engineering might not fit the world-model of premodern human thinkers

Mathematically, we’re trying to optimize , i.e. optimize expected utility given the AI’s actions. Note that this is necessarily an expectation under the human’s model, since that’s the only context in which  is well-defined. In order for that to work out well, we need to be able to fully evaluate that estimate (sufficient processing power), we need the estimate to be accurate (sufficient predictive power), and we need  to be defined within the model in the first place.

The question of whether our world-models are sufficiently accurate is particularly hairy here, since accuracy is usually only defined in terms of how well we estimate our sense-data. But the accuracy we care about here is how well we “estimate” the values of latent variables and . What does that even mean, when the latent variables may not correspond to anything in the world?

## People I Will Never Meet

“Human values cannot be determined from human behavior” seems almost old-hat at this point, but it’s worth taking a moment to highlight just how underdetermined values are from behavior. It’s not just that humans have biases of one kind or another, or that revealed preferences diverge from stated preferences. Even in our perfect Bayesian utility-maximizer, utility is severely underdetermined from behavior, because the agent does not have perfect estimates of its latent variables. Behavior depends only on the agent’s estimate, so it cannot account for “error” in the agent’s estimates of latent variable values, nor can it tell us about how the agent values variables which are not coupled to its own choices.

The happiness of people I will never interact with is a good example of this. There may be people in the world whose happiness will not ever be significantly influenced by my choices. Presumably, then, my choices cannot tell us about how much I value such peoples’ happiness. And yet, I do value it.

## “Misspecified” Models

In Latent Variables and Model Misspecification, jsteinhardt talks about “misspecification” of latent variables in the AI’s model. His argument is that things like the “value function” are latent variables in the AI’s world-model, and are therefore potentially very sensitive to misspecification of the AI’s model.

In fact, I think the problem is more severe than that.

The value function’s inputs are latent variables in the human’s model, and are therefore sensitive to misspecification in the human’s model. If the human’s model does not match reality well, then their latent variables will be something wonky and not correspond to anything in the world. And AI designers do not get to pick the human’s model. These wonky variables, not corresponding to anything in the world, are a baked-in part of the problem, unavoidable even in principle. Even if the AI’s world model were “perfectly specified”, it would either be a bad representation of the world (in which case predictive power becomes an issue) or a bad representation of the human’s model (in which case those wonky latents aren’t defined).

The AI can’t model the world well with the human’s model, but the latents on which human values depend aren’t well-defined outside the human’s model. Rock and a hard place.

# Takeaway

Within the context of a Bayesian utility-maximizer (representing a human), utility/values are a function of latent variables in the agent’s model. That’s a problem, because those latent variables do not necessarily correspond to anything in the environment, and even when they do, we don’t have a good way to say what they correspond to.

So, an AI trying to help the agent is stuck: if the AI uses the human’s world-model, then it may just be wrong outright (in predictive terms). But if the AI doesn’t use the human’s world-model, then the latents on which the utility function depends may not be defined at all.

Thus, the pointers problem, in the Bayesian context: figure out which things in the world (if any) correspond to the latent variables in a model. What do latent variables in my model “point to” in the real world?

# Ω 19

New Comment

I definitely endorse this as a good explanation of the same pointers problem I was getting at. I particularly like the new framing in terms of a direct conflict between (a) the fact that what we care about can be seen as latent variables in our model, and (b) we value "actual states", not our estimates -- this seems like a new and better way of pointing out the problem (despite being very close in some sense to things Eliezer talked about in the sequences).

What I'd like to add to this post would be the point that we shouldn't be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don't think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don't need to.)

One reason is because finding a correspondence and applying it isn't what the agent should want. In this simple setup, where we suppose a perfect Bayesian agent, it's reasonable to argue that the AI should just use the agent's beliefs. That's what would maximize the expectation from the perspective of the agent -- not using the agent's utility function but substituting the AI's beliefs for the agent's. You mention that the agent may not have a perfect world-model, but this isn't a good argument from the agent's perspective -- certainly not an argument for just substituting the agent's model with some AI world-model.

This can be a real alignment problem for the agent (not just a mistake made by an overly dogmatic agent): if the AI believes that the moon is made of blue cheese, but the agent doesn't trust that belief, then the AI can make plans which the agent doesn't trust even if the utility function is perfect.

And if the agent does trust the AI's machine-learning-based model, then an AI which used the agent's prior would also trust the machine-learning model. So, nothing is lost by designing the AI to use the agent's prior in addition to its utility function.

So this is an argument that prior-learning is a part of alignment just as much as value-learning.

We don't usually think this way because when it comes to humans, well, it sounds like a terrible idea. Human beliefs -- as we encounter them in the wild -- are radically broken and irrational, and inadequate to the task. I think that's why I got a lot of push-back on my post about this:

I mean, I REALLY don't want that or anything like that.

- jbash

But I think normativity gives us a different way of thinking about this. We don't want the AI to use "the human prior" in the sense of some prior we can extract from human behavior, or extract from the brain, or whatever. Instead, what we want to use is "the human prior" in the normative sense -- the prior humans reflectively endorse.

This gives us a path forward on the "impossible" cases where humans believe in ghosts, etc. It's not as if humans don't have experience dealing with things of value which turn out not to be a part of the real world. We're constantly forming and reforming ontologies. The AI should be trying to learn how we deal with it -- again, not quite in a descriptive sense of how humans actually deal with it, but rather in the normative sense of how we endorse dealing with it, so that it deals with it in ways we trust and prefer.

This makes a lot of sense.

I had been weakly leaning towards the idea that a solution to the pointers problem should be a solution to deferral - i.e. it tells us when the agent defers to the AI's world model, and what mapping it uses to translate AI-variables to agent-variables. This makes me lean more in that direction.

What I'd like to add to this post would be the point that we shouldn't be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don't think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don't need to.)

I see a couple different claims mixed together here:

• The metaphilosophical problem of how we "should" handle this problem is sufficient and/or necessary to solve in its own right.
• There probably isn't a general way to find correspondences between models, so we need to operate at the meta-level.

The main thing I disagree with is the idea that there probably isn't a general way to find correspondences between models. There are clearly cases where correspondence fails outright (like the ghosts example), but I think the problem is probably solvable allowing for error-cases (by which I mean cases where the correspondence throws an error, not cases in which the correspondence returns an incorrect result). Furthermore, assuming that natural abstractions work the way I think they do, I think the problem is solvable in practice with relatively few error cases and potentially even using "prosaic" AI world-models. It's the sort of thing which would dramatically improve the success chances of alignment by default.

I absolutely do agree that we still need the metaphilosophical stuff for a first-best solution. In particular, there is not an obviously-correct way to handle the correspondence error-cases, and of course anything else in the whole setup can also be close-but-not-exactly-right . I do think that combining a solution to the pointers problem with something like the communication prior strategy, plus some obvious tweaks like partially-ordered preferences and some model of logical uncertainty, would probably be enough to land us in the basin of convergence (assuming the starting model was decent), but even then I'd prefer metaphilosophical tools to be confident that something like that would work.

With reference specifically to this:

The happiness of people I will never interact with is a good example of this. There may be people in the world whose happiness will not ever be significantly influenced by my choices. Presumably, then, my choices cannot tell us about how much I value such peoples’ happiness. And yet, I do value it.

and without considering any other part of the structure, I have an alternate view:

It is possible to determine if and how much you value the happiness (or any other attribute) of people you will never interact with by calculating

1. What are the various things you, personally, could have done in the past [time period], and how would they have affected each of the people, plants, animals, ghosts, etc. that you might care about?
2. What things did you actually do?
3. How far away from your maximum impact / time were you for each entity you could have affected. (scaled in some way tbd)
4. Derive values and weights from that. For example, if I donate $100 to Clean Water for Africa, that implies that I care about Clean Water & Africa more than I care about AIDS and Pakistan, and the level there depends on how much$100 means to me. If that's ten (or even two) hours of work to earn it that's a different level of commitment than if it represents 17 minutes of owning millions in assets.
5. Run the calculation for all desired moral agents, to average out won't-ever-see-them effects.
1. Derive values and weights from that. For example, if I donate $100 to Clean Water for Africa, that implies that I care about Clean Water & Africa more than I care about AIDS and Pakistan, and the level there depends on how much$100 means to me. If that's ten (or even two) hours of work to earn it that's a different level of commitment than if it represents 17 minutes of owning millions in assets.

This will very quickly lead to incorrect conclusions, because people don’t act according to their values (especially for things that don’t impact their day to day lives like international charity). The fact that you donated $100 to Clean Water for Africa does not mean that you value that more than AIDS in Pakistan. You personally may very well care about about clean water and/or Africa more than AIDS and/or Pakistan, but if you apply this sort of analysis writ large you will get egregiously wrong answers. Scott Alexander’s “Too Much Dark Money in Almonds” describes one facet of this rather well. Another facet is that how goods are bundled matters. Did I spend$15 on almonds because I value a) almonds b) nuts c) food d) sources of protein e) snacks I can easily eat while I drive f) snacks I can put out at parties... etc. And more importantly, which of those things do I care about more than I care about Trump losing the election?

Elizabeth Anscombe’s book Intention does a good job analyzing this. When we make actions, we are not making those actions based on the state of the world we are making those actions based on the state of the world under a particular description. One great example she gives is walking into a room and kissing a woman. Did you intend to a) kiss your girlfriend b) kiss the tallest women in the room c) kiss the woman closest to the door wearing pink d) kiss the person who got the 13th highest mark on her history exam last week e) ...

The answer is (typically) a. You intended to kiss your girlfriend. However to an outside observer who doesn’t already have a good model of humanity at large, if not a model of you in particular, it’s unclear how they’re supposed to tell that. Most people who donate to Clean Water for Africa don’t intend to be choosing that over AIDS in Pakistan. Their actions are consistent with having that intention, but you can’t derive intentionality from brute actions.

I agree with your comment, but I think it's a scale thing. If I analyze every time you walk into a room, and every time you kiss someone, I can derive that you kiss [specific person] when you see them after being apart. And this is already being done in corporate contexts with Deep Learning for specific questions, so it's just a matter of computing power, better algorithms, and some guidance at to the relevant questions and variables.

You've mostly understood the problem-as-stated, and I like the way you're thinking about it, but there's some major loopholes in this approach.

First, I may value the happiness of agents who I cannot significantly impact via my actions - for instance, prisoners in North Korea.

Second, the actions we chose probably won't provide enough data. Suppose there are n different people, and I could give any one of them \$1. I value these possibilities differently (e.g. maybe because they have different wealth/cost of living to start with, or just because I like some of them better). If we knew how much I valued each action, then we'd know how much I valued each outcome. But in fact, if I chose person 3, then all we know is that I value person 3 having the dollar more than I value anyone else having it; that's not enough information to back out how much I value each other person having the dollar. This sort of underdetermination will probably be the usual result, since the choice-of-action contains a lot less bits than a function mapping the whole action space to values.

Third, and arguably most important: "run the calculation for all desired moral agents" requires first identifying all the "desired moral agents", which is itself an instance of the problem in the post. What the heck is a "moral agent", and how does an AI know which ones are "desired"? These are latent variables in your world-model, and would need to be translated to something in the real world.

I was attempting to answer the first point, so let me rephrase: Even though your ability to affect prisoners in North Korea is miniscule, we can still look at how much of it you're doing. Are you spending any time seeking out ways you could be affecting them? Are you voting for and supporting and lobbying politicians who are more likely to use their greater power to affect the NK prisoner's lives? Are you doing [unknown thing that the AI figures out would affect them]? And, also, are you doing anything that is making their situation worse? Or any other of the multiple axis of being, since happiness isn't everything, and even happiness isn't a one-dimentional scale.

"Who counts as a moral agent? (And should they all have equal weights)" Is a question of philosophy, which I am not qualified to answer. But "who gets to decide the values to teach" it's one meta-level up from the question of "how do we teach values", so I take it as a given for the latter problem.

I'm not convinced that we can do nothing if the human wants ghosts to be happy. The AI would simply have to do what would make ghosts happy if they were real. In the worst case, the human's (coherent extrapolated) beliefs are your only source of information on how ghosts work. Any proper general solution to the pointers problem will surely handle this case. Apparently, each state of the agent corresponds to some probability distribution over worlds.

This seems like it's only true if the humans would truly cling to their belief in spite of all evidence (IE if they believed in ghosts dogmatically), which seems untrue for many things (although I grant that some humans may have some beliefs like this). I believe the idea of the ghost example is to point at cases where there's an ontological crisis, not cases where the ontology is so dogmatic that there can be no crisis (though, obviously, both cases are theoretically important).

However, I agree with you in either case -- it's not clear there's "nothing to be done" for the ghost case (in either interpretation).

I don’t understand what the purported ontological crisis is. If ghosts exist, then I want them to be happy. That doesn’t require a dogmatic belief that there are ghosts at all. In fact, it can even be true when I believe ghosts don’t exist!

I mean, that's fair. But what if your belief system justified almost everything ultimately in terms of "making ancestors happy", and relied on a belief that ancestors are still around to be happy/sad? There are several possible responses which a real human might be tempted to make:

• Give up on those values which were justified via ancestor worship, and only pursue the few values which weren't justified that way.
• Value all the same things, just not based on ancestor worship any more.
• Value all the same things, just with a more abstract notion of "making ancestors happy" rather than thinking the ancestors are literally still around.
• Value mostly the same things, but with some updates in places where ancestor worship was really warping your view of what's valuable rather than merely serving as a pleasant justification for what you already think is valuable.

So we can fix the scenario to make a more real ontological crisis.

It also bears mentioning -- the reason to be concerned about ontological crisis is, mostly, a worry that almost none of the things we express our values in terms of are "real" in a reductionistic sense. So an AI could possibly view the world through much different concepts and still be predictively accurate. The question then is, what would it mean for such an AI to pursue our values?

You could learn the pointers by observing how the model is incrementally built over time. There is much more explicit in children learning. Compare to how our modern values being very different from our ancestors'.

I think what you call the pointers problem is mostly the grounding problem applied to values. Philosophy has long tried to find solutions to it and you can google some here.

Really fascinating problem! I like how your examples make me want to say "Well, the AI just has to ask about... wait a minute, that's the problem!". Taken from another point of view, you're asking how and in which context can an AI reveal our utility functions, which means revealing our latent variables.

This problems also feels related to our discussion of the locality of goals. Here you assume a non-local goal (as most human ones are), and I think that a better knowledge of how to detect/measure locality from behavior and assumptions about the agent-model might help with the pointers problem.

Setting up the "locality of goals" concept: let's split the variables in the world model into observables , action variables , and latent variables . Note that there may be multiple stages of observations and actions, so we'll only have subsets  and  of the observation/action variables in the decision problem. The Bayesian utility maximizer then chooses  to maximize

... but we can rewrite that as

Defining a new utility function , the original problem is equivalent to:

In English: given the original utility function on the ("non-local") latent variables, we can integrate out the latents to get a new utility function defined only on the ("local") observation & decision variables. The new utility function yields completely identical agent behavior to the original.

So observing agent behavior alone cannot possibly let us distinguish preferences on latent variables from preferences on the "local" observation & decision variables.

Over the last few posts the recurrent thought I have is "why aren't you talking about compression more explicitly?"

Could you uncompress this comment a bit please?

A pointer is sort of the ultimate in lossy compression. Just an index to the uncompressed data, like a legible compression library. Wireheading is a goodhearting problem, which is a lossy compression problem etc.

I like this post. I have thoughts along the same lines sometimes, and it makes me feel a bit overwhelmed and nihilistic, so then I go back to thinking about easier problems :-P

Is it even solvable? Definitely not always - there probably is no real-world referent for e.g. the human concept of a ghost.

Michael Graziano has another nice example: "pure whiteness".

And then he argues argues that another example is, ummm, the whole idea of conscious experience, which would be a bit problematic for philosophy and ethics if true. See my Book Review: Rethinking Consciousness.

I think that one of the problems in this post is actually easier in the real world than in the toy model.

In the toy model the AI has to succeed by maximizing the agent's True Values, which the agent is assumed to have as a unique function over its model of the world. This is a very tricky problem, especially when, as you point out, we might allow the agent's model of reality to be wrong in places.

But in the real world, humans don't have a unique set of True Values or even a unique model of the world - we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense.

Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans (and we might have desiderata about how it does that modeling and what the end results should contain), and satisfy the modeled values. And in some ways this is actually a bit reassuring, because I'm pretty sure that it's possible to get better final results on this problem than on than learning the toy model agent's True Values - maybe not in the most simple case, but as you add things like lack of introspection, distributional shift, meta-preferences like identifying some behavior as "bias," etc.

This comment seems wrong to me in ways that make me think I'm missing your point.

Some examples and what seems wrong about them, with the understanding that I'm probably misunderstanding what you're trying to point to:

we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense

I have no idea why this would be tied to non-Cartesian-ness.

But in the real world, humans don't have a unique set of True Values or even a unique model of the world

There are certainly ways in which humans diverge from Bayesian utility maximization, but I don't see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that's quite different from having multiple distinct world-models.

Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans [...] and satisfy the modeled values.

How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong.

One way to interpret all this is that you're pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don't see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).

Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.

If you don't agree with me on this, why didn't you reply when I spent about six months just writing posts that were all variations of this idea? Here's Scott Alexander making the basic point.

It's like... is there a True rational approximation of pi? Well, 22/7 is pretty good, but 355/113 is more precise, if harder to remember. And just 3 is really easy to remember, but not as precise. And of course there's the arbitrarily large "approximation" that is 3.141592... Depending on what you need to use it for, you might have different preferences about the tradeoff between simplicity and precision. There is no True rational approximation of pi. True Human Values are similar, except instead of one tradeoff that you can make it's approximately one bajillion.

• we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense

I have no idea why this would be tied to non-Cartesian-ness.

If a Cartesian agent was talking about their values, they could just be like "you know, those things that are specified as my values in the logic-stuff my mind is made out of." (Though this assumes some level of introspective access / genre savviness that needn't be assumed, so if you don't want to assume this then we can just say I was mistaken.). When a human talks about their values they can't take that shortcut, and instead have to specify values as a function of how they affect their behavior. This introduces the dependency on how we're breaking down the world into categories like "human behavior."

• Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans [...] and satisfy the modeled values.

How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong.

Well, if there were unique values, we could say "maximize the unique values." Since there aren't, we can't. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we're going to have to do with that wrong-seeming.

Well, if there were unique values, we could say "maximize the unique values." Since there aren't, we can't. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we're going to have to do with that wrong-seeming.

Before I get into the meat of the response... I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human's world-model, which still gives rise to all the same problems as a total order in the human's world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)

Anyway, I don't think that's the main disconnect here...

Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.

Ok, I think I see what you're saying now. I am of course on board with the notion that e.g. human values do not make sense when we're modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.

However, there is a difference between "the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction", and "there is not a unique, well-defined referent of 'human values'". The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human's model.

An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its "goals" do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different "goals" to the robot - e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.

And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. "The utility function I hard-coded into the robot" is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that "the utility function I hard-coded into the robot" would correspond to some unambiguous thing in the real world - although specifying exactly what that thing is, is an instance of the pointers problem.

Does that make sense? Am I still missing something here?