In previous posts, I've been assuming that human values are complete and consistent, but, finally, we are ready to deal with actual human values/preferences/rewards - the whole contradictory, messy, incoherent mess of them.

Define "completely resolving human values", as an AI extracting a consistent human reward function from the inconsistent data on the preference and values of a single human (leaving aside the easier case of resolving conflicts between different humans). This post will look at how such resolutions could be done - or at least propose an initial attempt, to be improved upon.

EDIT: There is a problem with rendering some of the LaTeX, which I don't understand. The draft rendered it fine, but not the published version. So I've replaced some LaTeX with unicode or image files; it generally works, but there are oversized images in section 3.

Adequate versus elegant

Part of the problem is resolving human values, is that people have been looking to do it too well and too elegantly. This results in either complete resolutions that ignore vast parts of the values (eg hedonic utilitarianism), or in thorough analyses of a tiny part of the problem (eg all the papers published on the trolley problem).

Incomplete resolutions are not sufficient to guide an AI, and elegant complete resolutions seem to be like most utopias: not any good for real humans.

Much better to aim for an adequate complete resolution. Adequacy means two things here:

  • It doesn't lead to disastrous outcomes, according to the human's current values.
  • If a human has a strong value or meta-value, that will strongly influence the ultimate human reward function, unless their other values point strongly in the opposite direction.

Aiming for adequacy is quite freeing, allowing you to go ahead and construct a resolution, which can then be tweaked and improved upon. It also opens up a whole new space of possible solutions. And, last but not least, any attempt to formalise and write a solution gives a much better understanding of the problem.

Basic framework, then modifications

This post is a first attempt at constructing such an adequate complete resolution. Some of the details will remain to be filled in, others will doubtlessly be changed; nevertheless, this first attempt should be instructive.

The resolution will be built in three steps:

  • a) It will provide a basic framework for resolving low level values, or meta-values of the same "level".
  • b) It will extend this framework to account for some types of meta-values applying to lower level values.
  • c) It will then allow some meta-values to modify the whole framework.

Finally, the post will conclude with some types of meta-values that are hard to integrate into this framework.

1 Terminology and basic concepts

Let be a human, whose "true" values we are trying elucidate. Let be the possible environments (including its transition rules), with the actual environment. And let be the set of future histories that the human may encounter, from time onward (the human's past history is seen as part of the environment).

Let be a set of rewards. We'll assume that is closed under many operations - affine transformations (including negation), adding two rewards together, multiplying them together, and so on. For simplicity, assume that is a real vector space, generated by a finite number of basis rewards.

Then define to be a set of potential values of . This is defined to be all the value/preference/reward statements that might agree to, more or less strongly.

1.1 The role of the AI

The AI's role is elucidate how much the human actually accepts statements in (see for instance here and here). For any given , it will compute , the weight of the value . For mental calibration purposes, assume that is in the to range, and that if the human has no current opinion on , then is zero (the converse is not true: could be zero because the human has carefully analysed but found it to be irrelevant or negative).

The AI will also compute , the endorsement of . This measures the extent to which 'approves' or 'disapproves' of a certain reward or value (there is a reward normalisation issue which I'll elide for now).

Object level values are those which are non-zero only on rewards; ie the for which for all . To avoid the most obvious self-referential problem, any value's self-endorsement is assumed to be zero (so ). As we will see below, positively endorsing a negative reward is not the same as negatively endorsing a positive reward: does not mean the same thing as .

Then this post will attempt to define the resolution function , which maps weights, endorsements, and the environment to a single reward function. So if is the cross product of all possible weight functions, endorsement functions, and environments:

In the following, we'll also have need for a more general , and for special distributions over dependent on a given ; but we'll define them as and when they are needed.

2 The basic framework

In this section, we'll introduce a basic framework for resolving rewards. This will involve making a certain number of arbitrary choices, choices that may then get modified in the next section.

This section will deal with the problems with human values being contradictory, underdefined, changeable, and manipulable. As a side effect, this will also deal with the fact that humans can make moral errors (and end up feeling their previous values were 'wrong'), and that they can derive insights from philosophical thought experiments.

As an example, we'll use a classic modern dilemma: whether to indulge in bacon or to keep slim.

So let there be two rewards, the bacon reward, and , the slimness reward. Assume that if always indulges, and , while if they never indulge, and . There are various tradeoff and gains from trade for intermediate levels of indulgence, the details of which are not relevant here.

Then define .

2.1 Contradictory values

Define by {I like eating bacon}, and {I want to keep slim}. Given the right normative assumptions, the AI can easily establish that and are both greater than zero. For example, it can note that the human sometimes does indulge, or desires to do so; but also the human feels sad about gaining weight, shame about their lack of discipline, and sometimes engages in anti-bacon precommitment activities.

The natural thing here is to weight by the weight of and the endorsement that gives to (and similarly with and ). This means that

Or, for the more general formula, with implicit uncurrying so as to write as a function of two variables:

For this post, I'll ignore the issue of whether that sum always converges (which it would almost certainly do, in practice).

2.2 Unendorsing rewards

I said there was a difference between a negative endorsement of , and a positive endorsement of . A positive endorsement is just a value judgement that sees as good, while the negative endorsement just doesn't want to appear at all.

For example, consider {I'm not worried about my weight}. Obviously this has a negative endorsement of , but it doesn't have a positive endorsement of - it explicitly doesn't have a desire to be fat, either. So the weight and endorsement of are fine when it comes to reducing the positive weight of , but not when making a zero or negative weight more negative. To capture that, rewrite as:

Then the AI, to maximise 's rewards, simply needs to follow the policy that maximises that reward.

2.3 Underdefined rewards

Let's now look at the problem of underdefined values. To illustrate, add the option of liposuction to the main model. If indulges in bacon, and undergoes liposuction, then both and can be set to .

But might not want to undergo liposuction (assumed, in this model, to be costless). Let be the reward for no liposuction, if liposuction is avoided, and if it happens, and let {I want to avoid liposuction}. Extend to .

Because hasn't thought much about liposuction, they currently have . But it's possible they may have firm views on it, after some reflection. If so, it would be good to use those views now. When humans haven't thought about values, there are many ways they can develop them, depending on how the issue is presented to them and how it interacts with their categories, social circles, moral instincts, and world models.

For example, assume that the AI can figure out that, if is given a description of liposuction that starts with "lazy people can cheat by...", then they will be against it: will be greater than zero. However, if they are given a description that starts with "efficient people can optimise by...", then they will be in favour of it, and will be zero.

If is the weight of at future time , given the future history , define the discounted future weight as

for, say, if is denominated in days. If is the history with the "lazy" description, this will be greater than zero. If it's the history with the "efficient" description, it will be close to zero.

We'd like to use the expected value of , but there are two problems with that. The first is that many possible futures might involve no reflection about on the part of . We don't care about these futures. The other is that these futures depend on the actions of the AI, so that it can manipulate the human's future values.

So define , a subset of the set of histories . This subset is defined firstly so that the will have relevant opinions about : they won't be indifferent to it. Secondly, these are future on which the human is allowed to develop their values 'naturally', without undue rigging and influence on the part of the AI (see this for an example of such a distribution). Note that these need not be histories which will actually happen, just future histories which the AI can estimate. Let be the probability distribution of future histories, restricted to (this requires that the AI pick a sensible probability distribution over its own future policy, at least for the purpose of computing this probability distribution).

Note that the exact definition of and are vitally important and still need to be fully established. That is a critical problem I'll be returning to in the future.

Laying that aside for the moment, we can define the expected relevant weight:

Then the formula for becomes:

using instead of .

2.4 Moral errors and moral learning

The above was designed to address underdefined values, but it actually does much more than that. It deals with changeable values, and addresses moral errors and moral learning.

An example of moral error is thinking that you want something, but, upon achieving it, you find that you don't. Let us examine , the desire to be slim. People don't generally have a strong intrinsic desire for slimness just for the sake of it; instead, they might strive for this because they think it will make them healthier, happier, might increase their future status, might increase their self-discipline in general, or something similar.

So we could replace with {I desire X}, where X is something that believes will come out of slimming.

When computing and , the AI will test how reacts to achieving slimness, or achieving X, and ultimately compute a low but a high . This would be even more the case if is allowed to contain impossible future histories, such as hypotheticals where the human miraculously slims without achieving X, or vice-versa.

The use of also picks up systematic, predictable moral change. For example, the human may be currently committed to a narrative that seems themselves as disciplined, stereotypical-rational being that will overcome their short term weaknesses. Their weight is high. However, the AI knows that trying to slim will be unpleasant for , and that they will soon give up as the pain mounts, and change their narrative to one where they accept and balance their own foibles. So the expected is low, under most reasonable futures where humans cannot control their own value changes (this has obvious analogies with major life changes, such as loss of faith or changes in political outlooks).

Then there is the third case where strongly held values may end being incoherent (as I argued is the case of the 'purity' moral foundation). Suppose the human deeply believes that {Humans have souls and pigs don't, so it's ok to eat pigs, but not ok to defile the human form with liposuction}. This value would thus endorse and . But it's also based on false facts.

There seems to be three standard ways to resolve this. Replacing "soul" with, say, "mind capable of complex thought and ability to suffer", they may shift to {I should not eat pigs}. Or if they go for "humans have no souls, so 'defilement' makes no sense", they may embrace {All human enhancements are fine}. Or, as happens often in the real world when people can't justify their values, they may shift their justification but preserve the basic value: {It is natural and traditional and therefore good to eat pig, and avoid liposuction}.

Now, I feel is probably incoherent as well, but there are no lack of coherent-but-arbitrary reasons to eat pigs and avoid liposuction, so some value set similar to that is plausible.

Then suitably defined would allow the AI to figure out which way the human wants to update their values for , , , and , as the human moves away from the incorrect first values to one of the other alternatives.

2.5 Automated philosophy and CEV

The use of also allows one to introduce philosophy to the mix. One simply needs to include in the presentation of philosophical thought experiments to , and 's reaction and updating. Similarly, one can do the initial steps of coherent extrapolated volition, by including futures where changes themselves in the desired direction. This can be seen as automating some of philosophy (this approach has nothing to say about epistemology and ontology, for instance).

Indeed, you could define philosophers as people with particularly strong philosophical meta-values: that is, putting a high premium on philosophical consistency, simplicity, and logic.

The more weight is given to philosophy or to frameworks like CEV, the more elegant and coherent the resulting resolution is, but the higher the risk of it going disastrously wrong by losing key parts of human values - we risk running into the problems detailed here and here.

2.6 Meta-values

We'll conclude this section by looking at how one can apply the above framework to meta-values. There are values that have non-zero endorsements of other values, ie .

The previous {All human enhancements are fine} could be seen as a meta-value, one that unendorses the anti-liposuction value , so Or we might have one that unendorses short-term values: {Short-term values are less important}, with .

The problem with comes when values start referring to values that start referring to themselves. This allows indirect self-reference, with all the trouble that that brings.

Now, there are various tools for dealing with self-reference or circular reasoning - Paul Christiano's probabilistic self-reference, and Scott Aaronson's Eigenmorality are obvious candidates.

But in the spirit of adequacy, I'll directly define a method that can take all these possibly self-referential values and resolve them. Those who are not interested in the maths here can skip to the next section; there is no real insight here.

Let , and let be an ordering (or a permutation) of , ie a bijective map from to . Then recursively define by , and

Thus each is the sum of the actual weight , plus the -adjusted endorsements of the values preceding it (in the ordering), with the zero lower bound. By averaging across the set of all permutations of , we can then define:

Then, finally, for resolving the reward, we can use these weights in the standard reward function:

3. The "wrong" : meta-values for the resolution process

The of the previous section is sufficient to resolve the values of an which has no strong feelings on how those values should be resolved.

But many may find it inadequate, filled with arbitrary choices, doing too much by hand/fiat, or doing to little. So the next step is to let 's values affect how the itself works.

Define as the framework constructed in the previous section. And let be the set of all such possible resolution frameworks. We now extend so that can endorse or unendorse not only elements of and , but also of .

Then we can define

and define itself as

These formulas make sense, since the various elements of takes values in , which can be summed. Also, because we can multiply a reward by a positive scalar, there is no need for renormalising or weighting in these summing formulas.

Now, this is not a complete transformation of according to 's values - for example, there is no place for these values to change the computation of , which is computed according to the previously defined for . (Note: Those are where the LaTeX errors used to be, and now there are oversized image files which I can't reduce, sorry!)

But I won't worry about that for the moment, though I'll undoubtedly return to it later. First of all, I very much doubt that many humans have strong intuitions about the correct method for resolving contradictions among the different ways of designing a resolution system for mapping most values and meta-values to a reward. And if someone does have such a meta-value, I'd wager it'll be mostly to benefit a specific object level value or reward, so it's more instructive to look at the object level.

But the real reason I won't dig too much into those issues for the moment, is that the next section demonstrates that there are problems with fully self-referential ways of resolving values. I'd like to understand and solve those before getting too meta on the resolution process.

4 Problems with self-referential

Here I'll look at some of the problems that can occur with fully self-referential Θ and/or v. The presentation will be more informal, since I haven't defined the language or the formalism to allow such formulation yet.

4.1 All-or-nothing values, and personal identity

Some values put a high premium on simplicity, or on defining the whole of the relevant part of . For example, the paper "An impossibility theorem in population axiology..." argues that total utilitarianism is the only the theory that avoids a series of counter-intuitive problems.

Now, I've disagreed that these problems are actually problems. But some people's intuitions strongly disagree with me, and feel that total utilitarianism is justified by these arguments. Indeed, I get the impression that, for some people, even a small derogation to total utilitarianism is bad: they strongly prefer 100% total utilitarianism to 99.99% total utilitarianism + 0.01% something else.

This could be encoded as a value {I value having a simple populations ethics}. This would provide a bonus based on the overall simplicity of the image of Θ. To do this, we have introduced personal identity (an issue which I've argued is unresolved in terms of reward functions), as well as about the image of Θ.

Population ethics feels like an abstract high-level concept, but here is a much more down-to-earth version. When the AI looks forwards, it extrapolates the weight of certain values based on the expected weight in the future. What if the AI extrapolates that will be either or in the future, with equal probability? It then reasonably sets to .

But the human will live in one of those futures. The AI will be maximising their 'true goals' which include , while is forced into extreme values of ( or ) which do not correspond to the value the AI is currently maximising. So {I want to agree with the reward that computes} is a reasonable meta-value, that would reward closeness between expected future values and actual future values.

In that case, one thing the AI would be motivated to do, is to manipulate so that they have the 'right' weights in the future. But this might not always be possible. And might see that as a dubious thing to do.

Note here that this is not a problem of desiring personal moral growth in the future. Assuming that can be defined, the AI can then grant it. The problem would be wanting personal moral growth and wanting the AI to follow the values that emerge from this growth.

4.2 You're not the boss of me!

For self-reference, we don't need Gödel or Russell. There is a much simpler, more natural self-reference paradox lurking here, one that is very common in humans: the urge to not be told what to do.

If the AI computes , there are many humans who would, on principle, declare and decide that their reward was something other than . This could be a value { incorrectly computes my values}. I'm not sure how to resolve this problem, or even if it's much of a problem (if the human will disagree equally no matter what, then we may as well ignore that disagreement; and if they disagree to different degrees in different circumstances, this gives something to minimise and trade-off against other values). But I'd like to understand and formalise this better.

5 Conclusion: much more work

I hope this post demonstrates what I am hoping to achieve, and how we might start going about it. Combining this resolution project, with the means of extracting human values would then allow the Inverse Reinforcement Learning project to succeed in full generality: we could then have the AI deduce human values from observation, and then follow them. This seems like a potential recipe for a Friendly-ish AI.

New Comment
30 comments, sorted by Click to highlight new comments since:

I very much appreciate the amount of time and effort you're putting into this!

That said, as much as I'd like to engage with this post, it feels very hard for me to do. The main problem I'm having is that there are a lot of very specific details where I feel like I don't have enough context to evaluate the details. By "context", I mean that there are a million different ways by which one could choose to formalize human values, and I assume that you've got some very specific reasons for why you've made the specific formalization choices that you have made. And in order to evaluate whether these are good choices, I'd need to understand your goals in making said choices, but you seem to have only given us the end results of your thought process rather than the original goals of it.

For instance, you note that can be 0 if a human has carefully considered it and found it to be irrelevant or negative. This sentence jumped out at me somewhat, since I would have intuitively assumed that if the human had evaluated something negative, it would be assigned a negative value rather than a 0; at least I wouldn't have expected values that were evaluated as irrelevant, to be assigned the same score as values that were evaluated as negative!

Reading on, I found that you separately define an endorsement of v, which can be negative - so apparently if we have evaluated things as negative, we can maybe still model that by assigning the thing a positive value and then giving it a negative endorsement value? I'm confused as to why these are split into two different variables. "Endorsement" suggests that it's about meta-values, so that the intent of this separation would be to model things which the human likes but doesn't actually endorse liking. But that doesn't capture the possibility that they e.g. dislike pain, and also endorse disliking pain.

Or maybe, since a value v was supposed to be defined as a statement which a human might agree to, we're supposed to model pain avoidance as a positive claim, "pain is to be avoided", which is then given a positive value? That would make sense, but in that case I'm again unclear on what the endorsement thing is meant to model, since apparently it doesn't take things like "liking" into account at all, but rather acts directly on endorsements?

So I mentally tag this as unclear and try to read on, hoping that this will be clarified later in the article, but instead I seem to run into a lot more specific choices and assumptions, and get the feeling that the article's assuming me to already have understood the previous sections in each new section it introduces... at which point I gave up.

What would make this much more readable for me would be something like, each subsection starting with the philosophical motivation and desiderata for the formalization choices made in that section, then having the content that it has now, and then finally giving some practical examples of what these formalizations imply and what kinds of mathematical objects result as a consequence. (Not necessarily always in that order: some mixing might be in order. E.g. for section 1.1, you have the line "Object level values are those which are non-zero only on rewards"; this seems to suggest that there may be values which refer to other values, separately from having the value also contain an endorsement for its assigned reward...? So you could have a value that assigns a positive value to some reward, a negative endorsement of that reward, and then a separate value which assigns treats the outcome of the first value as a positive reward with some weight, and it also assigns a positive or negative endorsement to the result of that computation...? I'm probably misunderstanding this somehow, which a bunch of examples about object-level and non-object-level values would clear up.)

Knowing at least what's the kind of real-world thing that the formalism is trying to capture, would help a lot when I was trying to evaluate whether I'd interpreted something you said correctly.


Ok, I will rework it for improved clarity; but not all the options I chose have deep philosophical justifications. As I said, I was aiming for an adequate resolution, with people's internal meta-values working as philosophical justifications for their own resolution.

As for the specific case that tripped you up: I wanted to distinguish between endorsing a reward or value, endorsing its negative, and endorsing not having it. "I want to be thin" vs "I want to be fat" vs "I don't want to care about my weight". The first one I track as a positive endorsement of R, the second as a positive endorsement of -R, the third as a negative endorsement of R (and of -R).

But I'll work on it more.


not all the options I chose have deep philosophical justifications.

Just to be clear, when I said that each section would be served by having a philosophical justification, I don't mean that it would necessarily need to be super-deep; just something like "this seems to make sense because X", which e.g. sections 2.4 and 2.5 already have.

Reason why the LaTeX is breaking: We parse each LaTeX block separately, and sometimes out of order (for performance reasons), this means you can't use "newcommand" in one LaTeX block and expect it to work in future LaTeX blocks. The editor loads all LaTeX simultaneously, so you won't run into this problem in the editor, but we will run into this problem when we try to render the LaTeX for other users.

If you want to make sure your LaTeX works, you want to avoid using "newcommand", or redefine the command at the top of the relevant LaTeX blocks.

...that makes newcommand almost useless (though it's always worked previously for me). And in some cases, it was things like \Theta that was not being rendered!

Hum. Any way of getting round this? Is there a way of editing the whole post as text (since then I can run a substitution on the text, replacing all the commands with their full version)?

But thanks for figuring it out!

Huh, then maybe it's something else. Do you have a post in which worked fine?

See eg

In this current post, where things stopped working, it seemed that the number of latex fomulas was relevant? If I added any more latex box, no matter no simple, it would fail?

I really like this in that it's approaching an issue I view as currently neglected within AI safety research: how to determine human values to be learned. Like Kaj I find this a bit hard to engage with specific issues to give feedback, but I look forward to where this goes since I expect us to eventually need more formal approaches to axiology, even if they are only "adequate".

Some typos:

  • "any reward's self-endorsement" -> "any value's self-endorsement"
  • "in favour of if" -> "in favour of it"
  • "denominated it days" -> "denominated in days"
  • "many possible future" -> "many possible futures"
  • "The more weight it given" -> "The more weight is given"
  • "fomalise" -> "formalise"

Additional typo/request for clarification; is w supposed to be v' ?

Object level values are those which are non-zero only on rewards; ie the v∈V for which θ(v)(w)=0 for all v′∈V

Thanks, now corrected.

Now corrected, thanks.

Thanks! Will correct once I have a decent conection.

More interesting post to me now than it was to past-me :) Thanks from the future. Anyhow, typos for the typo god:

"doing to little"->"doing too little"

Also the second link in "many ways" is broken now, I think it was probably to ?

Thanks! Glad you got good stuff out of it.

I won't edit the post, due to markdown and latex issues, but thanks for pointing out the typos.

Glad to see this work on possible structure for representing human values which can include disagreement between values and structured biases.

I had some half-formed ideas vaguely related to this, which I think map onto an alternative way to resolve self reference.

Rather than just having one level of values that can refer to other values on the same level (which potentially leads to a self-reference cycle), you could instead explicitly represent each level of value, with level 0 values referring to concrete reward functions, level 1 values endorsing or negatively endorsing level 0 values, and generally level n values only endorsing or negatively endorsing level n-1 values. This might mean that you have some kinds of values that end up being duplicated between multiple levels. For any n, there's a unique solution to the level of endorsement for every concrete value. We can then consider the limit as n->infinity as the true level of endorsement. This allows for situations where the limit fails to converge (ie. it alternates between different values at odd and even levels), which seems like a way to handle self reference contradictions (possibly also the all-or-nothing problem if it results from a conflict between meta-levels).

I think this maps into the case where we don't distinguish between value levels if we define an function that just adjusts the endorsement of each value by the values that directly to refer to it. Then iterating this function n times gives the equivalent of having an n-level meta-hierarchy.

I think there might be interesting work in mapping this strategy into some simple value problem, and then trying to perform bayesian value learning in that setting with some reasonable prior over values/value endorsements.

Imagine a person, simple John, who doesn't have values, but who's behaviour is controlled by a random generator. For example, Jonh randomly choose between sleeping, collecting flowers and killing cats. However, neither John, not anyone outside doesn't know that he has not values.

The question is: will the outside observer (human or AI described above in your post) recognise that John has no values, or it will construct some model of John's values to better explain and predict John behaviour? It would be like seeing patterns in random noise.

My example may be applicable to a lot of actual human behaviour which is not controlled by values, and is either random or automatic like a reflex.

Is your theory able to correctly recognise such random behaviour and don't produce model of non-existing values?

Because of the no-free lunch theorem in value learning, you can't say anything at all about the values of an irrational agent - without making assumptions.

In practice, these assumptions are the common ones shared by most humans. So the AI would necessarily project human properties on John.

But John doesn't behave anything like a proper human, so the AI would have difficulty interpreting him, and would probably default to something like "crazy basic human".

Does it mean that AI has to have a model of human mind and its typical values before reading John's values?

The AI has to have a model of how humans distinguish preferences from irrationality; see eg

and some of the links within those links.

Thanks for links. By the way, do we have a definition of "human value" about which we agree?

>do we have a definition of "human value" about which we agree?

Of course not; that would make things far too easy! :-)

Though in , I define human values as preferences (which is a lot clearer), with the distinction between values and more normal preferences being due to a human meta-preference.

Ok, what about preferences? Is it correct to call the preference "a probability distribution of expected human choices"? For example, my preference is 70 percent to take coffee and 30 percent to take tea at breakfast.

>Is it correct to call the preference "a probability distribution of expected human choices"

No, because the assumption of irrationality means that preferences don't match up with choices. Preferences are rankings of possible worlds/rewards/outcomes on an ordinal and cardinal scale. The challenge is to infer these preference from human behaviour.

If preferences will be equal to choices, then predicting preferences will be predicting future choice which may be relatively simple task of extrapolation of the past behaviour, and it could be computed without assuming existence of two parts of the human mind: constant preferences and noise.

>If preferences will be equal to choices

Unless you are arguing that humans are fully rational in every decision they ever make, this is not the case.

Yes, but this happens only because of the way we define preferences, imho. We define preferences as purely rational part, then compare this definition with actual humans, and see that there is also another irrational part.

Example: the same way we could say: every human being is six feet high, plus minus some noise variable. This may be useful way to describe humans, but it has obvious limitations.

What I suggest to do, is to look why we decided that humans have values or preferences at all? It is idea which appeared somewhere in 20 century psychology or philosophy, and it is only one of several ways to describe humans behaviour.

I want to construct/extract/extrapolate/define human preferences (or make a human reward/utility function), in order to have something we can give AI as a goal. Whether we count this as defining or extrapolating doesn't really matter; it's the result that's important.

One of the things that gives me hope is that actual humans overlap considerably in their judgement of what is rational and irrational. Almost everyone agrees that the anchoring bias is bias, not a preference; almost everyone agrees that people are less rational when drunk (with the caveat that drunkeness can also suppress certain other irrationalities, like social phobia - but again, that more complicated story is also something that people tend to agree on).

And values, and debates over values, date back at least to tribal times; dehumanising foreigners was based a lot around their strange values and untrustrworthiness.

I understand it and I think it is important project.

I will try to write something in next couple of months where I will check another approach: is it possible to describe AI-human positive relations without extracting or extrapolating values at all. For now I have some gut feeling that it could be interesting point of view, but I am not ready to formalize it.

Good luck with that! I'm skeptical of that approach, but it would be lovely if it could be worked out...