It seems that if we can ever define the difference between human beliefs and values, we could program a safe Oracle by requiring it to maximise the accuracy of human beliefs on a question, while keeping human values fixed (or very little changing). Plus a whole load of other constraints, as usual, but that might work for a boxed Oracle answering a single question.

This is a reason to suspect it will not be easy to distinguish human beliefs and values ^_^

New to LessWrong?

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 9:00 AM

It seems that if we can ever define the difference between human beliefs and values, we could program a safe Oracle

Why do you think so? I have some idea that it follows from your other posts on AI safety but I'm not clear how.

Or do you intend this as a post where the reader is asked to figure it out as an exercise?

One of the problem of Oracles is that they will mislead us (eg social engineering, seduction) to achieve their goals. Most ways of addressing this either require strict accuracy in the response (which means the response might be incomprehensible or useless, eg if its given in terms of atom position) or some measure about the human reaction to the response (which brings up the risk of social engineering).

If we can program the AI to distinguish between human beliefs and values, we can give it goals like "ensure that human beliefs post answer are more accurate, while their values are (almost?) unchanged". This solves the issue of defining an accurate answer, and removes that part of the risk (it doesn't remove others).

How would you avoid the "oracle" just re-iterating (or being heavily biased by) the ideological beliefs of those who programmed it?

I am not sure that I am following what you are saying here.

To use the map / territory distinction: My understanding is that belief refers to the contents of someone's map, while values are properties that they want the territory to have or maximise.

I think your last line was meant to be "beliefs and values" rather than "preferences and values".

And I don't know it's a question of distinguishing beliefs from values, more of a question of whether values are stable. I personally don't think most individuals have a CEV, and even if many do, there's no reason to suspect that any group has one. This is especially true for the undefined group "humanity", which usually includes some projections of not-yet-existent members.

Corrected, thanks!

Stuart is it really your implicit axiom that human values are static, fixed?

(Were they fixed historically? Is humankind mature now? Is humankind homogenic in case of values?)

In the space of all possible values, human values have occupied a very small space, with the main change being who gets counted as moral agent (the consequences of small moral changes can be huge, but the changes themselves don't seem large in an absolute sense).

Or, if you prefer, I think it's possible the AI moral value changes will range so widely, that human value can essentially be seen as static in comparison.

I think we need better definition of problem we like to study here. Probably beliefs and values are not so undistinguishable

From this page ->

Human values are, for example:

  • civility, respect, consideration;
  • honesty, fairness, loyalty, sharing, solidarity;
  • openness, listening, welcoming, acceptance, recognition, appreciation;
  • brotherhood, friendship, empathy, compassion, love.

  1. I think none of them we could call belief.

  2. If these will define vectors of virtual space of moral values then I am not sure if AI could occupy much bigger space than humans do. (how much selfish or unwelcome or dishonest could AI or human be?)

  3. On the contrary - because we are selfish (is it our moral value which we try to analyze?) we want that AI will be more open, more listening, more honest, more friend (etc) than we want or plan to be. Or at least we are now. (so are we really want that AI will be like us?)

  4. I see the question about optimal level of these values. For example would we like to see agent who will be maximal honest, welcoming and sharing to anybody? (AI at your house which welcome thieves and tell them what they ask and share all?)

And last but not least - if we will have more AI agents then some kind of selfishness and laziness could help. For example to prevent to create singleton or fanatical mob of these agents. In evolution of humankind, selfishness and laziness could help human groups to survive. And lazy paperclip maximizer could save humankind.

We need good mathematical model of laziness, selfishness, openness, brotherhood, friendship, etc. We have hard philosophical tasks with deadline. (singularity is coming and dead in word deadline could be very real)

I like to add some values which I see not so static and which are proably not so much question about morality:

Privacy and freedom (vs) security and power.

Family, society, tradition.

Individual equality. (disparities of wealth, right to have work, ...)

Intellectual properties. (right to own?)

more of a question of whether values are stable.

or question if human values are (objective and) independent of humans (as subjects who could develop)

or question if we are brave enough to ask questions if answers could change us.

or (for example) question if it is necessarily good for us to ask questions where answers will give us more freedom.


It would win prediction markets.


Firstly, let me state that I'm being entirely serious here:

This is a reason to suspect it will not be easy to distinguish human beliefs and values ^_^

What makes you think that any such thing as "values" actually exists, ie: that your brain actually implements some function that assigns a real number to world-states?

It's clear that people have ordinal preferences over certain world-states, and that many of these preferences are quite stable from day to day. And people have some ability to trade these off with probabilities, suggesting cardinal preferences as well. It seems correct and useful to refer to this as "values", at least approximately.

On the other hand, it's clear that our brains do not implement some function that assigns a real number to world-states. That's one of the reasons that it's so hard to distinguish human values in the first place.