One-step hypothetical preferences

Stuart_Armstrong

Human preferences are time-inconsistent, and also contradictory.

That, by itself, is not a huge problem, but it's also the case that few human preferences are present at any given moment. At the moment, I'm focused on finding the best explanation to get my ideas through to you, the reader; I'm not focused on my moral preferences, personal safety desires, political beliefs, or taste in music.

If anyone asked me about those, I could immediately bring them to mind. My answers to standard questions are kinda in the background, accessible but not accessed. Wei Dai made a similar point about translators: they have a lot of trained knowledge that is not immediately accessible to their introspection. And only by giving them the inputs they were trained on (eg words, sentences,...) can you bring that knowledge to the fore.

In this post, I'll try and formalise these accessible preferences, starting with formalising preferences in general.

Basic preferences setup

This section will formalise the setup presented in Alice's example. Let $W$ be a set of all possible worlds. A human makes use of a model $M$ . This model contains a lot of variables ${P_{i}}$ , called properties. These $P_{i}$ take values in a domain $D_{i}$ .

A basic set $S$ of states in $M$ is a set of possible values for some of the $P_{i}$ . Thus $S = {S_{i}}$ , with $S_{i} \subset D_{i}$ . The property $P_{i}$ unconstrained in $S$ if $S_{i} = D_{i}$ . A general set of states is a union of basic $S$ ; let $S$ be these of all these sets of states.

For example, a human could be imagining four of their friends, and the $P_{i j}$ could be whether friend $i$ is sleeping with friend $j$ ( $6$ different Boolean $P_{i j} = P_{j i}$ ), and also whether a third friend $k$ believes two others are sleeping together ( $12$ different $P_{i j k} = P_{j i k}$ , taking values in ${$ sleeping together, not sleeping together, don't know $}$ ).

Then a statement of human gossip like ''X is sleeping with Y, but A doesn't realise it; in fact, A thinks that Y is sleeping with Z, which is totally not true!" is encoded as:

$S_{G} = {P_{X Y} = 1,$ $P_{Y Z} = 0,$ $P_{X Y A} \subset {"don't know", "not sleeping together"},$ $P_{Y Z A} = "sleeping together"}$ , with the other $P$ s unconstrained.

It's interesting how unintuitive that formulation is, compared with how our brains instinctively parse gossip.

To make use of these, these symbols need to be grounded. This is achieved via a function $g$ that takes a set of states $S$ and maps it to a set of worlds: $g (S) \subset W$ .

Finally, the human expresses a judgement about the states of $M$ , mentally categorising a set of states as better than another. This is an anti-symmetric partial function $J : S \times S \to R$ , a partial function that is non trivial on at least one pair of inputs.

For example, if $S_{G}$ is the gossip set above, and $S_{G}^{'}$ is the same statement with $P_{Y Z A} = "not sleeping together"$ , then a human that values honesty might judge $J (S_{G}, S_{G}^{'}) = - 1$ ; ie it is worse if $A$ believes a lie about $Y$ and $Z$ .

The sign of $J (S, S^{'})$ informs which set the human prefers; the magnitude is the difficult-to-define weight or intensity of the preference.

Hypotheticals posed to the human

Let $M_{J}$ be the set of possible pairs $(M, J)$ defined in the previous section. Humans rarely consider many $(M, J)$ at the same time. We often only consider one, or zero.

A hypothetical is some possible short intervention - a friend asks them a question, they get an email, a TV in the background shows something salient - that will cause a human to mentally use a model $M$ and pass judgement $J$ within it. Note that this not the same as Paul Christiano's definition of ascription : we don't actually need the human to answer anything, just to think.

So if $H_{t}$ is the set of possible hypothetical interventions at time $t$ , we have a (counterfactual) map $f$ from $H_{t}$ to $M_{J}$ .

Now, not all moments are ideal for a human to do much reflection (though a lot of instinctive reactions are also very informative). So it might be good to expand the time a bit, to say, $T =$ a week, and consider all the models that a human could be hypothetically be made to consider in that time.

So let $H_{t}^{T}$ be the set of hypothetical short interventions from time $t$ to $t + T$ , given that this intervention is the first in that time period. Then there is a natural map

$f : H_{t}^{T} \to M_{J}$ .

Idealised object

The map $f$ is a highly idealised and counterfactual object - there is no way we can actually test a human on the vast number of possible interventions. So the AI would not be tasked with "use $f$ to establish human preferences", but "estimate $f$ to estimate human preferences".

The $f$ will also reveal a lot of contradictions, since humans often have different opinions on the same subject, depending on how the intervention or question is phrased. Different phrasings may trigger different internal models of the same issue, or even different judgements within the same model. And, of course, the same intervention at different times (or by different agents) may trigger different reactions.

But dealing with contradictions is just one of the things that we have to sort out with human preferences.

Minimum modification

I mentioned the interventions should be short; that $T$ should be a short period; and that the interventions in $H_{t}^{T}$ should be the first in that time period. The whole idea is to avoid "modifying" the human too much, or giving the AI too much power to change, rather than reflect, the human's values. The human's reaction should be as close as possible to an unvarnished initial reaction.

There may be other ways of reducing the AI's influence, but it is still useful to get these initial reactions.

One-step hypotheticals

In slight contrast with the previous section, it is very valuable to get the human to reflect on new issues they hadn't considered before. For example, we could introduce them to philosophical thought experiments they hadn't seen before (maybe the trolley problem or the repugnant conclusion, or unusual variants of these), or present ideas that cross across their usual political boundaries, or the boundaries of their categories (eg whether Neanderthals should have human rights if a tribe of them were suddenly discovered today).

This is, in a sense, a minimum extrapolation, the very first tentative step of CEV. We are not asking what the human would think if they were smarter, but instead what they would think if they encountered a novel problem for the first time.

These "one-step hypotheticals" are thus different from the human's everyday current judgement, yet don't involve transforming the human into something else.

EDIT: Avturchin asks whether I expect these one-step hypotheticals to reveal hidden preferences, or to force humans to make a choice, knowing that they might have made a different choice in different circumstances.

The answer is... a bit of both. I expect the hypotheticals to sometimes contradict each other, depending on the phrasing and the timing. I expect them to contradict each other more than more usual questions ("zero-step hypotheticals") do.

But I don't expect the answers to be completely random, either. There will be a lot of information there. And the pattern of different $H_{t}^{T}$ leading to different or contradictory $J$ is relevant, and not random.

Finally, the human expresses a judgement about the states of M, mentally categorising a set of states as better than another. This is an anti-symmetric partial function J:S×S→R, a partial function that is non trivial on at least one pair of inputs.

I continue to be unsure if we can even claim anti-symmetry of the preference relation. For example, let $S_{A}$ be the state "I eat an apple" and $S_{O}$ the state "I eat an orange", and today $J (S_{A}, S_{O})$ but tomorrow $J (S_{O}, S_{A})$ , seemingly violating antisymmetry. Now of course maybe I misunderstood my own understanding of $S_{A}$ and $S_{O}$ such that they actually included a hidden-to-my-awareness property conditioning them on time or something else such that anti-symmetry is not violated, but the fact that there may be some property on the states that I didn't think about at first that salvages anti-symmetry makes my worry that this model is confused in this and other ways because it was so easily to think of and construct something that seemingly violated the property but then on further reflection seems like it doesn't.

That's not a slam-dunk argument against this formalization. This is more me sharing some thoughts on my reservations of using this type of model. If we can so easily fail to notice something relevant about how we formalize some simple preferences, what else may we be failing to notice? And if so what happens if we build an AI based in part on this formalization? Will it also fail to account for relevant aspects of how human preferences are calculated because they are not easily visible to us in the model, or is that a failure of humans to understand themselves rather than the model? These are the things I'm wrestling with lately.

I also have some reservations about whether we can even really model humans has having discrete preferences that we can reason about in this way without getting ourselves into trouble and confused. Not to say that I doubt that this model often works, only that I worry that it's missing some important details that are relevant for alignment and without accounting for them we will fail to produce aligned AI. I worry this because there doesn't seem to be anything in the human mind that actually is a preference; preferences are more like reifications of a pattern of action that appears in humans. Getting closer to understanding the mechanism that produces the pattern we interpret as preferences seems valuable to me in this work because I worry we're missing crucial details when we reason about preferences at the level of detail you pursue here.

I see the orange-apple preference reversal as another example of conditional preferences.

I agree that viewing preferences as conditioned on the environment, up to and including the entire history of the observable universe, is a sensible improvement over many more simplistic models that result in clear violations of preference normativity and eliminates many of those violations. My concern is that, given that this is not so obvious as to be the normal way of thinking about preferences in all fields and was nonobvious enough that you had to write a post about the point, this makes me cautious about updating to thinking this is sufficient to make the current value abstraction you use sufficient for purposes of AI alignment. I basically view conditionality of preferences as neutral evidence about the explanatory power of the theory (for the purpose of AI alignment).

Valid point, though conditional meta-preferences are things I've already written about, and the issue of being wrong now about what your own preferences would be in the future, is also something I've addressed multiple times in different forms. Your example is particularly crisp, though.

Do these "one step hypotheticals" reveal hidden preferences, or force a human to make a choice, to which she will later stick to preserve her consistency? For example, I could make a random answer to a question about Neanderthal tribe rights, but later rationalise why it should be true. I think I have heard of some psychological research which demonstrated such behaviour.

Added an addendum to the post to address this issue. The "later rationalise" is not really relevant here, because we're not thinking of doing all these hypothetical interventions.