Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TL;DR: AI which is learning human values may act unethically or be catastrophically dangerous, as it doesn’t yet understand human values.

The main idea is simple: a young AI which is trying to learn human values (which I will call a “value learner”) will have a “chicken and egg” problem. Such AIs must extract human values, but to do it safely, they should know these values, or at least have some safety rules for value extraction. This idea has been analyzed before (more on those analyses below); here, I will examine different ways in which value learners may create troubles.

It may be expected that the process of young AI learning human values will be akin to a nice conversation or perfect observation but it could easily take the form of a witch trial if appropriate constraints are not specified. Even a human child between ages 2 and 18 could do many stupid things just because she doesn’t understand all societal rules; however, some experimentation with such rules is necessary to learn them, and a (good) school provides a safe sandbox for such active learning.

List of dangers

There are several possible types of failures of AI value learners:

1) Paperclipping. The value-learning process starts to have a very large impact on the world, for example, AI creates too many computers for modelling values. The special case here is a value learning AI which will never end its work; in that case, the AI will have to take over the world and continue to try to find human values until the end of the universe. In any case, we expect that good value learning will take some reasonable amount of time, on the order of days to years, and some limited resources, like interaction with just a few people.

2) Human torture. A value learner may conduct unethical experiments on humans in order to extract information on values, perhaps including causing pain, brain surgeries, and putting humans in unpleasant situations. Mindcrime is a special case: AI may need to run enormous number of human simulations or at least to upload a person, in order to understand his preferences.

3) Human value manipulation. AI may manipulate human values in order to make them simpler or to gain some type of advantage. This may include wireheading, or pressing to make fake approval of AI’s actions.

4) Wrong value extraction. AI may make an error, perhaps based on wrong assumptions about what and how it should learn, and converge to an incorrect model of human values, which may be either completely wrong, or wrong in subtle ways.

Obvious ideas of what could be done

What could be done to make value learners safer? (obviously not a comprehensive list):

• AI capabilities should be limited (artificial stupidity).

• AI should be equipped with the expected structure of human values and expected model of values (that is, we expect that a normal human doesn’t want to kill babies).

• Good ways to extract human values should be whitelisted.

• Actions with bad consequences in the external world should be blacklisted.

• Corrigibility: the AI may be turned off or corrected.

• “Boxing” of the value learner, so it learns only from previously recorded data. One way to create a safer value learner is by not allowing it to actually manipulate humans, but to train it (at least in the beginning) on some prerecorded data like an ethical dataset or description of a legal system.

• Value learner as an Oracle AI: it may ask and receive only short data points from the outside world, which are needed to choose between different models.

This all starts to look like a typical naïve set of ideas about creation of Friendly AI. Thus, the question arises: is the creation of a safe value learner—an FAI-complete task?

In other words, is it possible to create safe value learners without first solving the full alignment problem, including correct representation of human values? If yes, the hope is futile that using AI to learn human values will make the AI safety simpler.

What others have written about the subject

Danger value learners are special case of the “safe exploration problem” from Concrete problems in AI safety. Christiano et al suggested several instruments for safe exploration: risk-sensitive performance criteria, use of demonstrations of right trajectories, learning in simulation and human oversight, but most of them are more suitable for a drone learning not to crash in the ground than to a superintelligence trying to learn human values.

A similar idea was explored in a post by J. Maxwell in which he shows that Seed AI will have difficulty understanding human values. He suggests: “The best approach may be to find an intelligence “sweet spot” that’s sufficient to understand human values without being dangerous. Intuitively, it seems plausible that such a sweet spot exists: Individual humans can learn unfamiliar values, such as those of animals they study, but individual humans aren’t intelligent enough to be dangerous.” He later suggests “ontology autogeneration” as a safer way to create a mind model.

Soares also wrote that AI should learn values through data: “smarter-than-human AI systems that can inductively learn what to value from labeled training data”, and refine its conclusions through questions. However, many ideas, like CIRL, may assume more direct interaction between the AI and humans, like observing actual behaviour or active debate.

The distill-amplify approach assumes gradual refinement of an already semi-accurate model of human value system. A general understanding of some mammalian values (Sarma) may be a starting point for this process.

In a post entitled “Cake, or death!”, Armstrong described a model of AI which may uninterested in refining its model of human values.

Commercial home robots and self-driving cars will soon appear, and they will have some capability to act ethically in the outside world (partly hand-coded, partly trained on existing datasets, as in the case of self-driving cars), and such robots could be used as initial human value models for the training of more advanced AI.

On the other hand, some unrestricted learners which use a purely mathematical model of the human reward function (Sezener) may act unethically in the early stages of learning.

Failed value learners may be most dangerous types of AI, as smaller AI will be weaker, and successful learners will be safe (for humans). Value learners, by definition, will have at least human capabilities, but not yet be aligned.

Deeper classification of potentially dangerous value learners

First, we can consider that there are three types of human values (they don’t actually exist, but it is a useful classification):

1) Basic human needs (which could be also called “fundamental human rights”). These are survival, escaping pain, freedom, housing and healthcare. Basically, it is all that is listed in criminal law (with some caveats, like some criminal laws punish people for the things which a national states wants from them, like drug or treason laws). The list may be not full, as there are possible unknown basic needs, like “not be replaced with p-zombie”, x-risks, s-risks or “not stuck into an eternal boredom”. Some advance Oracle AI (may be consisting from best humans) may be used to list all possible basic needs. Basic needs form a basis for safety. Something is not safe if it destroys or prevents fulfilment of a basic human need. In other words, a full listing of basic human needs is almost equivalent to the list of value requirements that will need to be respected by AI for it to be safe for humans.

2) Personal human preferences. This is something as “I like collecting 17th century coins”. These are well-defined and stable personal preferences but neglecting them is not an existential catastrophe (for a typical human being).

3) One-time wishes. These are minute-long wishes which disappear immediately after they are fulfilled, for example, “I want coffee”.

Obviously, these types of values correspond to different types of value learners, as different learners will learn different types of values:

a) The first type is the zero-knowledge, first day value learner, which doesn’t have any ideas about the outside world or humans but tries to learn about them. This type is the most dangerous, as without any constrains, powerful AI may act most unethically.

b) The second is the personal preference learner, which already knows basic human needs. It is generally safe but will have to learn my personal preferences. This is like a robot which I bring home from a shop: It has a built-in model of basic human needs and could learn my personal values based on some pleasant conversation after unpacking (like in movie “Her”).

c) The third type should guess what I exactly mean based on a single verbal command, but it already knows basic human needs and my preferences. The nature of wishes is that a person actually knows that he has a wish—he feels the desire but may have difficulty correctly articulating it. (But the same person may not feel his basic needs until losing something important.) The third type of learner already knows basic human needs and my personal preferences, which provides a context for guessing what I meant; it could check its understanding by asking “did I correctly understand you that you want coffee in bed?”, perhaps illustrating by images how it pours coffee in bed (into a cup or onto the sheets).

I present here a taxonomy of potentially dangerous and safe value learners, and what could go wrong if an AI of a certain type tries to learn some type of human preferences:


Ω 3

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 9:28 PM

I think that given good value learning, safety isn't that difficult. I think even a fairly halfharted attempt at the sort of Naive safety measures discussed will probably lead to non catastrophic outcomes. 

Tell it about mindcrime from the start. Give it lots of hard disks, and tell it to store anything that might possibly resemble a human mind. It only needs to work well enough with a bunch of Miri people guiding it and answering its questions.  Post singularity, a superintelligence can see if there are any human minds in the simulations it created when young and dumb. If there are, welcome those minds to the utopia.

As I said in another comment:  To learn human values from, say, fixed texts is a good start, but it doesn't solve the "chicken or the egg problem": that we start from running non-aligned AI which is learning human values, but we want the first AI to be already aligned.  One possible obstacle: non-aligned AI could run away before it has finished to learn human values from the texts. 

I like the main point; I hadn't considered it before with value learning.  Trying to ask myself why I haven't been worried about this sort of failure mode before, I get the following:

It seems all of the harms to humans the value-learner causes are from some direct or indirect interaction with humans, so instead I want to imagine a pre-training step that learns as much about human values from existing sources (internet, books, movies, etc) without interacting with humans.

Then as a second step, this value-learner is now allowed to interact with humans in order to continue it's learning.

So your problem in this scenario comes down to: Is the pre-training step insufficient to cause the value-learner to avoid causing harms in the second step?

I'm not certain, but here it at least seems more reasonable that it might not.  In particular, if the value-learner were sufficiently uncertain about things like harms (given the baseline access to human values from the pre-training step) it might be able to safely continue learning about human values.

Right now I think I'd be 80/20 that a pre-training step that learned human values without interacting with humans from existing media would be sufficient to prevent significant harms during the second stage of value learning.

(This doesn't rule out other superintelligence risks / etc, and is just a statement about the risks incurred during value learning that you list)

To learn human values from, say, fixed texts is a good start, but it doesn't solve the "chicken or the egg problem": that we start from running non-aligned AI which is learning human values, but we want the first AI to be already aligned.  

One possible obstacke: non-aligned AI could run away before it has finished to learn human values from the texts. 

The problem of chicken and the egg could presumably be solved by some iteration-and-distillation approach. First we give some very rough model of human values (or rules) to some limited AI, and later we increase its power and its access to real human. But this suffers from all the difficulties of the iteration-and-distillation, like unexpected jumps of capabilities.