Only humans can have human values

Ethics is not geometry

Western philosophy began at about the same time as Western geometry; and if you read Plato you'll see that he, and many philosophers after him, took geometry as a model for philosophy.

In geometry, you operate on timeless propositions with mathematical operators.  All the content is in the propositions.  A proof is equally valid regardless of the sequence of operators used to arrive at it.  An algorithm that fails to find a proof when one exists is a poor algorithm.

The naive way philosophers usually map ethics onto mathematics is to suppose that a human mind contains knowledge (the propositional content), and that we think about that knowledge using operators.  The operators themselves are not seen as the concern of philosophy.  For instance, when studying values (I also use "preferences" here, as a synonym differing only in connotation), people suppose that a person's values are static propositions.  The algorithms used to satisfy those values aren't themselves considered part of those values.  The algorithms are considered to be only ways of manipulating the propositions; and are "correct" if they produce correct proofs, and "incorrect" if they don't.

But an agent's propositions aren't intelligent.  An intelligent agent is a system, whose learned and inborn circuits produce intelligent behavior in a given environment.  An analysis of propositions is not an analysis of an agent.

I will argue that:

  1. The only preferences that can be unambiguously determined are the preferences people implement, which are not always the preferences expressed by their beliefs.
  2. If you extract a set of propositions from an existing agent, then build a new agent to use those propositions in a different environment, with an "improved" logic, you can't claim that it has the same values.
  3. Values exist in a network of other values.  A key ethical question is to what degree values are referential (meaning they can be tested against something outside that network); or non-referential (and hence relative).
  4. Supposing that values are referential helps only by telling you to ignore human values.
  5. You cannot resolve the problem by combining information from different behaviors, because the needed information is missing.
  6. Today's ethical disagreements are largely the result of attempting to extrapolate ancestral human values into a changing world.
  7. The future will thus be ethically contentious even if we accurately characterize and agree on present human values.

Instincts, algorithms, preferences, and beliefs are artificial categories

There is no principled distinction between algorithms and propositions in any existing brain.  This means that there's no clear way to partition an organism's knowledge into "propositions" (including "preferences" and "beliefs"), and "algorithms."  Hence, you can't expect all of an agent's "preferences" to end up inside the part of the agent that you choose to call "propositions".  Nor can you reliably distinguish "beliefs" from "preferences".

Suppose that a moth's brain is wired to direct its flight by holding the angle to the moon constant.  (This is controversial, but the competing hypotheses would give similar talking points.)  If so, is this a belief about the moon, a preference towards the moon, or an instinctive motor program?  When it circles around a lamp, does it believe that lamp is the moon?

When a child pulls its hand away from something hot, does it value not burning itself and believe that hot things burn, or place a value on not touching hot things, or just have an evolved motor program that responds to hot things?  Does your answer change if you learn that the hand was directed to pull back by spinal reflexes, without involving the cortex?

Monkeys can learn to fear snakes more easily than they can learn to fear flowers (Cook & Mineka 1989).  Do monkeys, and perhaps humans, have an "instinctive preference" against snakes?  Is it an instinct, a preference (snake = negative utility), or a learned behavior (lab monkeys are not afraid of snakes)?

Can we map the preference-belief distinction onto the distinction between instinct and learned behavior?  That is, are all instincts preferences, and all preferences instincts?  There are things we call instincts, like spinal reflexes, that I don't think can count as preferences.  And there are preferences, such as the relative values I place on the music of Bach and Berg, that are not instincts.  (In fact, these are the preferences we care about.  The purpose of Friendly AI is not to retain the fist-clenching instinct for future generations.)

Bias, heuristic, or preference?

A "bias" is a reasoning procedure that produces an outcome that does not agree with some logic.  But the object in nature is not to conform to logic; it is to produce advantageous behavior.

Suppose you interview Fred about his preferences.  Then you write a utility function for Fred.  You experiment, putting Fred in different situations and observing how he responds.  You observe that Fred acts in ways that fail to optimize the utility function you wrote down, in a consistently-biased way.

Is Fred displaying bias?  Or does the Fred-system, including both his beliefs and the bias imposed by his reasoning processes, implement a preference that is not captured in his beliefs alone?

Allegedly true story, from a Teaching Company audio lecture (I forget which one):  A psychology professor was teaching a class about conditioned behavior.  He also had the habit of pacing back and forth in front of the class.

The class decided to test his claims by leaning forward and looking interested when the professor moved toward the left side of the room, but acting bored when he moved toward the right side.  By the end of the semester, they had trained him to give his entire lecture from the front left corner.  When they asked him why he always stood there, he was surprised by the question - he wasn't even aware he had changed his habit.

If you inspected the professor's beliefs, and then studied his actions, you would conclude he was acting irrationally.  But he wasn't.  He was acting rationally, just not thinking rationally.   His brain didn't detect the pattern in the class's behavior and deposit a proposition into his brain.  It encoded the proper behavior, if not straight into his pre-motor cortex, at least not into any conscious beliefs.

Did he have a bias towards the left side of the room?  Or a preference for seeing students pay attention?  Or a preference that became a bias when the next semester began and he kept doing it?

Take your pick - there's no right answer.

If a heuristic gives answers consistently biased in one direction across a wide range of domains, we can call it a bias.  Most biases found in the literature appear to be wide-ranging and value-neutral.  But the literature on biases is itself biased (deliberately) towards discussing that type of bias.   If we're trawling all of human behavior for values, we may run across many instances where we can't say whether a heuristic is a bias or a preference.

As one example, I would say that the extraordinarity bias is in fact a preference.  Or consider the happiness paradox:  People who become paralyzed become extremely depressed only temporarily; people who win the lottery become very happy only temporarily.  (Google 'happiness "set-point"'.)  I've previously argued on LessWrong that this is not a bias, but a heuristic to achieve our preferences.  Happiness is proportional not to our present level of utility, but to the rate of change in our utility.  Trying to maximize happiness (the rate of increase of utility) in the near term maximizes total utility over lifespan better than consciously attempting to maximize near-term utility would.  This is because maximizing the rate of increase in utility over a short time period, instead of total utility over that time period, prefers behavior that has a small area under the utility curve during that time but ends with a higher utility than it started with, over behavior with a large area under the utilty curve that ends with a lower utility than it started with.  This interpretation of happiness would mean that impact bias is not a bias at all, but a heuristic that compensates for this in order to maximize utility rather than happiness when we reason over longer time periods.

Environmental factors: Are they a preference or a bias?

Evolution does not distinguish between satisfying preconditions for behavior by putting knowledge into a brain, or by using the statistics of the environment.  This means that the environment, which is not even present in the geometric model of ethics, is also part of your values.

When the aforementioned moth circles around a lamp, is it erroneously acting on a bias, or expressing moth preferences?

Humans like having sex.  The teleological purpose of this preference is to cause them to have children.  Yet we don't say that they are in error if they use birth control.  This suggests that we consider our true preferences to be the organismal ones that trigger positive qualia, not the underlying evolutionary preferences.

Strict monogamy causes organisms that live in family units to evolve to act more altruistically, because their siblings are as related to them as their children are (West & Gardner 2010).  Suppose that people from cultures with a long history of nuclear families and strict monogamy act, on average, more altruistically than people from other cultures; and you put people from both cultures together in a new environment with neither monogamy nor nuclear families.  We would probably rather say that the people from these different cultures have different values; not that they both have the same preference to "help their genes", but that the people from the monogamous culture have an evolved bias that causes them to erroneously treat strangers nicely in this new environment.  Again, we prefer the organismal preference.

However, if we follow this principle consistently, it prevents us from ever trying to improve ourselves, since it in effect defines our present selves as optimal:

  • Humans like eating food with fat, sugar, and salt.  In our ancestral context, that expressed the human value of optimizing nutrition.  The evolutionary preference is for good nutrition; the organismal preference is for fat, sugar, and salt.  By analogy to contraception, liking fat, sugar, and salt is not an evolved but dysfunctional bias in taste; it's a true human value.
  • Suppose fear of snakes is triggered by the shape and motion of snakes.  The organismal preference is against snakes.  The evolutionary preference is against poisonous snakes.  If the world is now full of friendly cybernetic snakes, you must conclude that prejudice against them is a human value to be preserved, not a bias to be overcome.  Death to the friendly snakes!
  • Men enjoy violence.  Hitting a stranger over the head with a stick is naturally fun to human males, and it takes a lot of social conditioning to get them not to do this, or at least to restrict themselves to video games.  By what principle can we say that this is merely an obsolete heuristic to protect the tribe that is no longer helpful in our present environment; yet having sex with a condom is enjoying a preference?
  • (Santos et al. 2010) reports (summarized in Science Online) that children with a genetic mutation causing Williams syndrome, which causes less fear of strangers, have impaired racial stereotyping, but intact gender stereotyping.  This suggests that racism, and perhaps sexism, are evolved preferences actively implemented by gene networks.

So the "organismal vs. evolutionary" distinction doesn't help us choose what's a preference and what's a bias.  Without any way of doing that, it is in principle impossible to create a category of "preferences" distinct from "preferred outcomes".  A "value" consists of declarative knowledge, algorithms, and environment, taken together.  Change any of those, and it's not the same value anymore.

This means that extrapolating human values into a different environment gives an error message.

A ray of hope? ...

I just made a point by presenting cases in which most people have intuitions about which outcome is correct, and showing that these intuitions don't follow a consistent rule.

So why do we have the intuitions?

If we have consistent intuitions, they must follow some rule.  We just don't know what it is yet.  Right?

... No.

We don't have consistent intuitions.

Any one of us has consistent intuitions; and those of us living in Western nations in the 21st century have a lot of intuitions in common.  We can predict how most of these intuitions will fall out using some dominant cultural values.  The examples involving monogamy and violent males rely on the present relatively high weight on the preference to reduce violent conflict.  But this is a context-dependent value!  <just-so story>It arises from living in a time and a place where technology makes interactions between tribes more frequent and more beneficial, and conflict more costly</just-so story>.  But looking back in history, we see many people who would disagree with it:

  • Historians struggle to explain the origins of World War I and the U.S. Civil War.  Sometimes the simplest answer is best:  They were for fun.  Men on both sides were itching for an excuse to fight.
  • In the 19th century, Americans killed off the Native Americans to have their land.  Americans universally condemn that action now that they are secure in its benefits; most Americans condoned it at the time.
  • Homer would not have agreed that violence is bad!  Skill at violence was the greatest virtue to the ancient Greeks.  The tension that generates tragedy in the Iliad is not between violence and empathy, but between saving one's kin and saving one's honor.  Hector is conflicted, but not about killing Greeks.  His speech on his own tragedy ends with his wishes for his son:  "May he bring back the blood-stained spoils of him whom he has laid low, and let his mother's heart be glad."
  • The Nazis wouldn't have agreed that enjoying violence was bad.  We have learned nothing if we think the Nazis rose to power because Germans suddenly went mad en masse, or because Hitler gave really good speeches.  Hitler had an entire ideology built around the idea, as I gather, that civilization was an evil constriction on the will to power; and artfully attached it to a few compatible cultural values.
  • A similar story could be told about communism.

The idea that violence (and sexism, racism, and slavery) is bad is a minority opinion in human cultures over history.  Nobody likes being hit over the head with a stick by a stranger; but in pre-Christian Europe, it was the person who failed to prevent being struck, not the person doing the striking, whose virtue was criticized.

Konrad Lorenz believed that the more deadly an animal is, the more emotional attachment to its peers its species evolves, via group selection (Lorenz 1966).  The past thousand years of history has been a steady process of humans building sharper claws, and choosing values that reduce their use, keeping net violence roughly constant.  As weapons improve, cultural norms that promote conflict must go.  First, the intellectuals (who were Christian theologians at the time) neutered masculinity; in the Enlightenment, they attacked religion; and in the 20th century, art.  The ancients would probably find today's peaceful, offense-forgiving males as nauseating as I would find a future where the man on the street embraces postmodern art and literature.

This gradual sacrificing of values in order to attain more and more tolerance and empathy, is the most-noticable change in human values in all of history.  This means it is the least-constant of human values.  Yet we think of an infinite preference for non-violence and altruism as a foundational value!  Our intuitions about our values are thus as mistaken as it is possible for them to be.

(The logic goes like this:  Humans are learning more, and their beliefs are growing closer to the truth.  Humans are becoming more tolerant and cooperative.  Therefore, tolerant and cooperative values are closer to the truth.  Oops!  If you believe in moral truth, then you shouldn't be searching for human values in the first place!)

Catholics don't agree that having sex with a condom is good.  They have an elaborate system of belief built on the idea that teleology express God's will, and so underlying purpose (what I call evolutionary preference) always trumps organismal preference.

And I cheated in the question on monogamy.  Of course you said that being more altruistic wasn't an error.  Everyone always says they're in favor of more altruism.  It's like asking whether someone would like lower taxes.  But the hypothesis was that people from non-monogamous or non-family-based cultures do in fact show lower levels of altruism.  By hypothesis, then, they would be comfortable with their own levels of altruism, and might feel that higher levels are a bias.

Preferences are complicated and numerous, and arise in an evolutionary process that does not guarantee consistency.  Having conflicting preferences makes action difficult.  Energy minimization, a general principle that may underly much of our learning, simply means reducing conflicts in a network.  The most basic operations of our neurons thus probably act to reduce conflicts between preferences.

But there are no "true, foundational" preferences from which to start.  There's just a big network of them that can be pushed into any one of many stable configurations, depending on the current environment.  There's the Catholic configuration, and the Nazi configuration, and the modern educated tolerant cosmopolitan configuration.  If you're already in one of those configurations, it seems obvious what the right conclusion is for any particular value question; and this gives the illusion that we have some underlying principle by which we can properly choose what is a value and what is a bias.  But it's just circular reasoning.

What about qualia?

But everyone agrees that pleasure is good, and pain is bad, right?

Not entirely - I could point to, say, medieval Europe, when many people believed that causing yourself needless pain was virtuous.  But, by and large yes.

And beside the point (although see below).  Because when we talk about values, the eventual applications we have in mind are never about qualia.  Nobody has heated arguments about whose qualia are better.  Nobody even really cares about qualia.  Nobody is going to dedicate their life to building Friendly AI in order to ensure that beings a million years from now still dislike castor oil and enjoy chocolate.

We may be arguing about preserving a tendency to commit certain acts that give us a warm qualic glow, like helping a bird with a broken wing.  But I don't believe there's a dedicated small-animal-empathy quale.  More likely there's a hundred inferential steps linking an action, through our knowledge and thinking processes, to a general-purpose warm-glow quale.

Value is a network concept

Abstracting human behavior into "human values" is an ill-posed problem.  It's an attempt to divine a simple description of our preferences, outside the context of our environment and our decision process.  But we have no consistent way of deciding what are the preferences, and what is the context.  We have the illusion that we can, because our intuitions give us answers to questions about preferences - but they use our contextually-situated preferences to do so.  That's circular reasoning.

The problem in trying to root out foundational values for a person is the same as in trying to root out objective values for the universe, or trying to choose the "correct" axioms for a geometry.  You can pick a set that is self-consistent; but you can't label your choice "the truth".

These are all network concepts, where we try to isolate things that exist only within a complex homogeneous network.  Our mental models of complex networks follow mathematics, in which you choose a set of axioms as foundational; or social structures, in which you can identify a set of people as the prime movers.  But these conceptions do not even model math or social structures correctly.  Axioms are chosen for convenience, but a logic is an entire network of self-consistent statements, many different subsets of which could have been chosen as axioms.  Social power does not originate with the rulers, or we would still have kings.

There is a very similar class of problems, including symbol grounding (trying to root out the nodes that are the sources of meaning in a semantic network), and philosophy of science (trying to determine how or whether the scientific process of choosing a set of beliefs given a set of experimental data converges on external truth as you gather more data).  The crucial difference is that we have strong reasons for believing that these networks refer to an external domain, and their statements can be tested against the results from independent access to that domain.  I call these referential network concepts.  One system of referential network concepts can be more right than another; one system of non-referential network concepts can only be more self-consistent than another.

Referential network concepts cannot be given 0/1 truth-values at a finer granularity than the level at which a network concept refers to something in the extensional (referred-to) domain.  For example, (Quine 1968) argues that a natural-language statement cannot be unambiguously parsed beyond the granularity of the behavior associated with it.  This is isomorphic to my claim above that a value/preference can't be parsed beyond the granularity of the behavior of an agent acting in an environment.

Thomas Kuhn gained notoriety by arguing (Kuhn 1962) that there is no such thing as scientific progress, but only transitions between different stable states of belief; and that modern science is only different from ancient science, not better.  (He denies this in the postscript to the 1969 edition, but it is the logical implication of both his arguments and the context he presents them in.)  In other words, he claims science is a non-referential network concept.  An interpretation in line with Quine would instead say that science is referential at the level of the experiment, and that ambiguities may remain in how we define the fine-grained concepts used to predict the outcomes of experiments.

Determining whether a network concept domain is referential or non-referential is tricky.  The distinction was not even noticed until the 19th century.  Until then, everyone who had ever studied geometry, so far as I know, believed there was one "correct" geometry, with Euclid's 5 postulates as axioms.  But in the early 19th century, several mathematicians proved that you could build three different, consistent geometries depending on what you put in the place of Euclid's fifth postulate.  The universe we live in most likely conforms to only one of these (making geometry referential in a physics class); but the others are equally valid mathematically (making geometry non-referential in a math class).

Is value referential, or non-referential?

There are two ways of interpreting this question, depending on whether one means "human values" or "absolute values".

Judgements of value expressed in human language are referential; they refer to human behavior.  So human values are referential.  You can decide whether claims about a particular human's values are true or false, as long as you don't extend those claims outside the context of that human's decision process and environment.  This claim is isomorphic to Quine's claim about meaning in human language.

Asking about absolute values is isomorphic to applying the symbol-grounding problem to consciousness.  Consciousness exists internally, and is finer-grained than human behaviors.  Providing a symbol-grounding method that satisfied Quine's requirements would not provide any meanings accessible to consciousness.  Stevan Harnad (Harnad 2000) described how symbols might be grounded for consciousness in sense perceptions and statistical regularities of those perceptions.

(This brings up an important point, which I will address later:  You may be able to assign referential network concepts probabilistic or else fuzzy truth values at a finer level of granularity than the level of correspondence.  A preview: This doesn't get you out of the difficulty, because the ambiguous cases don't have mutual information with which they could help resolve each other.)

Can an analogous way be found to ground absolute values?  Yes and no.  You can choose axioms that are hard to argue with, like "existence is better than non-existence", "pleasure is better than pain", or "complexity is better than simplicity".  (I find "existence is better than non-existence" pretty hard to argue with; but Buddhists disagree.)  If you can interpret them in an unambiguous way, and define a utility calculus enabling you to make numeric comparisons, you may be able to make "absolute" comparisons between value systems relative to your axioms.

You would also need to make some choices we've talked about here before, such as "use summed utility" or "use average utility".  And you would need to make many possibly-arbitrary interpretation assumptions such as what pleasure is, what complexity is, or what counts as an agent.  The gray area between absolute and relative values is in how self-evident all these axioms, decisions, and assumptions are.  But any results at all - even if they provide guidance only in decisions such as "destroy / don't destroy the universe" - would mean we could claim there is a way for values to be referential at a finer granularity than that of an agent's behavior.  And things that seem arbitrary to us today may turn out not to be; for example, I've argued here that average utilitarianism can be derived from the von Neumann-Morgenstern theorem on utility.

... It doesn't matter WRT friendly AI and coherent extrapolated volition.

Even supposing there is a useful, correct, absolute lattice on value system and/or values, it doesn't forward the project of trying to instill human values in artificial intelligences.  There are 2 possible cases:

  1. There are no absolute values.  Then we revert to judgements of human values, which, as argued above, have no unambiguous interpretation outside of a human context.
  2. There are absolute values.  In which case, we should use them, not human values, whenever we can discern them.

Fuzzy values and fancy math don't help

So far, I've looked at cases of ambiguous values only one behavior at a time.  I mentioned above that you can assign probabilities to different value interpretations of a behavior.  Can we take a network of many probabilistic interpretations, and use energy minimization or some other mathematics to refine the probabilities?

No; because for the ambiguities of interest, we have no access to any of the mutual information between how to resolve two different ambiguities.  The ambiguity is in whether the hypothesized "true value" would agree or disagree with the results given by the initial propositional system plus a different decision process and/or environment.  In every case, this information is missing.  No clever math can provide this information from our existing data, no matter how many different cases we combine.

Nor should we hope to find correlations between "true values" that will help us refine our estimates for one value given a different unambiguous value. The search for values is isomorphic to the search for personality primitives.  The approach practiced by psychologists is to use factor analysis to take thousands of answers to questions that are meant to test personality phenotype, and mathematically reduce these to discover a few underlying ("latent") independent personality variables, most famously in the Big 5 personality scale (reviewed in Goldberg 1993).  In other words:  The true personality traits, and by analogy the true values a person holds, are by definition independent of each other.

We expect, nonetheless, to find correlations between the component of these different values that resides in decision processes.  This is because it is efficient to re-use decision processes as often as possible.  Evolution should favor partitioning values between propositions, algorithms, and environment in a way that minimizes the number of algorithms needed.  These correlations will not help us, because they have to do only with how a value is implemented within an organism, and say nothing about how the value would be extended into a different organism or environment.

In fact, I propose that the different value systems popular among humans, and the resulting ethical arguments, are largely different ways of partitioning values between propositions, algorithms, and environment, that each result in a relatively simple set of algorithms, and each in fact give the same results in most situations that our ancestors would have encountered.  It is the attempt to extrapolate human values into the new, manmade environment that causes ethical disagreements.  This means that our present ethical arguments are largely the result of cultural change over the past few thousand years; and that the next few hundred years of change will provide ample grounds for additional arguments even if we resolve today's disagreements.

Summary

Philosophically-difficult domains often involve network concepts, where each component depends on other components, and the dependency graph has cycles.  The simplest models of network concepts suppose that there are some original, primary nodes in the network that everything depends on.

We have learned to stop applying these models to geometry and supposing there is one true set of axioms.  We have learned to stop applying these models to biology, and accept that life evolved, rather than that reality is divided into Creators (the primary nodes) and Creatures.  We are learning to stop applying them to morals, and accept that morality depends on context and biology, rather than being something you can extract from its context.  We should also learn to stop applying them to the preferences directing the actions of intelligent agents.

Attempting to identify values is a network problem, and you cannot identify the "true" values of a species, or of a person, as they would exist outside of their current brain and environment.  The only consistent result you can arrive at by trying to produce something that implements human values, is to produce more humans.

This means that attempting to instill human values into an AI is an ill-posed problem that has no complete solution.  The only escape from this conclusion is to turn to absolute values - in which case you shouldn't be using human values in the first place.

This doesn't mean that we have no information about how human values can be extrapolated beyond humans.  It means that the more different an agent and an environment are from the human case, the greater the number of different value systems there are that are consistent with human values.  However, it appears to me, from the examples and the reasoning given here, that the components of values that we can resolve are those that are evolutionarily stable (and seldom distinctly human); while the contentious component of values that people argue about are their extensions into novel situations, which are undefined.  From that I infer that, even if we pin down present-day human values precisely, the ambiguity inherent in extrapolating them into novel environments and new cognitive architectures will make the near future as contentious as the present.

References

Michael Cook & Susan Mineka (1989).  Observational conditioning of fear to fear-relevant versus fear-irrelevant stimuli in rhesus monkeysJournal of Abnormal Psychology 98(4): 448-459.

Lewis Goldberg (1993).  The structure of phenotypic personality traitsAmerican Psychologist 48: 26-34.

Stevan Harnad (1990) The Symbol Grounding Problem. Physica D 42: 335-346.

Thomas Kuhn (1962).  The Structure of Scientific Revolutions. 1st. ed., Chicago: Univ. of Chicago Press.

Konrad Lorenz (1966).  On Aggression.  New York: Harcourt Brace.

Willard Quine (1969).  Ontological relativity.  The Journal of Philosophy 65(7): 185-212.

Andreia Santos, Andreas Meyer-Lindenberg, Christine Deruelle (2010).  Absence of racial, but not gender, stereotyping in Williams syndrome children.  Current Biology 20(7), April 13: R307-R308.

Stuart A. West and Andy Gardner (2010).  Science 12 March 2010: 1341-1344.

159 comments, sorted by
magical algorithm
Highlighting new comments since Today at 10:53 AM
Select new highlight date
Moderation Guidelinesexpand_more

I suppose I might count as someone who favors "organismal" preferences over confusing the metaphorical "preferences" of our genes with those of the individual. I think your argument against this is pretty weak.

You claim that favoring the "organismal" over the "evolutionary" fails to accurately identify our values in four cases, but I fail to see any problem with these cases.

  • I find no problem with upholding the human preference for foods which taste fatty, sugary and salty. (Note that consistently applied, the "organismal" preference would be for the fatty, sugary and salty taste and not foods that are actually fatty, sugary and salty. E.g. We like drinking diet Pepsi with Splenda almost as much as Pepsi and in a way roughly proportional to the success with which Splenda mimics the taste of sugar. We could even go one step further and drop the actual food part, valuing just the experience of [seemingly] eating fatty, sugary and salty foods.) This doesn't necessarily commit me to valuing an unhealthy diet all things considered because we also have many other preferences, e.g. for our health, which may outweigh this true human value.
  • The next two cases (fear of snakes and enjoying violence) can be dealt with similarly.
  • The last one is a little trickier but I think it can be addressed by a similar principle in which one value gets outweighed by a different value. In this case, it would be some higher-order value such as treating like cases alike. The difference here is that rather than being a competing value that outweighs the initial value, it is more like a constitutive value which nullifies the initial value. (Technically, I would prefer to talk here of principles which govern our values rather than necessarily higher order values.)

I thought your arguments throughout this post were similarly shallow and uncharitable to the side you were arguing against. For instance, you go on at length about how disagreements about value are present and intuitions are not consistent across cultures and history, but I don't see how this is supposed to be any more convincing than talking about how many people in history have believed the earth is flat.

Okay, you've defeated the view that ethics is about the values all humans throughout history unanimously agree on. Now what about views that extrapolate not from perfectly consistent, unanimous and foundational intuitions or preferences, but from dynamics in human psychology that tend to shape initially inconsistent and incoherent intuitions to be more consistent and coherent -- dynamics, the end result of which can be hard to predict when iteratively applied and which can be misapplied in any given instance in a way analogous to applications of the dynamic over beliefs of favoring the simplest hypothesis consistent with the evidence.

By the way, I don't mean to claim that your conclusion is obviously wrong. I think someone favoring my type of view about ethics has a heavy burden of proof that you hint at, perhaps even one that has been underappreciated here. I just don't think your arguments here provide any support for your conclusion.

It seems to me that when you try to provide illustrative examples of how opposing views fail, you end up merely attacking straw men. Perhaps you'd do better if you tried to establish that any opposing views must have some property in common and that such a property dooms those views to failure. Or that opposing views must go one of two mutually exclusive and exhaustive routes in response to some central dilemma and both routes doom them to failure.

I really would like to see the most precise and cogent version of your argument here as I think it could prompt some important progress in filling in the gaps present in the sort of ethical view I favor.

Voted up for thought and effort. BTW, when I started writing this last week, I thought I always preferred organismal preferences.

the "organismal" preference would be for the fatty, sugary and salty taste and not foods that are actually fatty, sugary and salty.

That's a good point. But in the context of designing a Friendly AI that implements human values, it means we have to design the AI to like fatty, sugary, and salty tastes. Doesn't that seem odd to you? Maybe not the sort of thing we should be fighting to preserve?

The next two cases (fear of snakes and enjoying violence) can be dealt with similarly.

I don't see how. Are you going to kill the snakes, or not? Do you mean that you can use technology to let people experience simulated violence without actually hurting anybody? Doesn't that seem like building an inconsistency into your utopia? Wouldn't having a large number of such inconsistencies make utopia unstable, or lacking in integrity?

The last one is a little trickier but I think it can be addressed by a similar principle in which one value gets outweighed by a different value.

That's how I said we resolve all of these cases. Only it doesn't get outweighed by a single different value (the Prime Mover model); it gets outweighed by an entire, consistent, locally-optimal energy-minimizing set of values.

... but from dynamics, the end result of which can be hard to predict when iteratively applied and which can be misapplied in any given instance in a way analogous to applications of the dynamic over beliefs of favoring the simplest hypothesis consistent with the evidence.

This seems to be at the core of your comment, but I can't parse that sentence.

Perhaps you'd do better if you tried to establish that any opposing views must have some property in common and that such a property dooms those views to failure.

My emphasis is not on defeating opposing views (except the initial "preferences are propositions" / ethics-as-geometry view), but on setting out my view, and overcoming the objections to it that I came up with. For instance, when I talked about the intuitions of humans over time not being consistent, I wasn't attacking the view that human values are universal. I was overcoming the objection that we must have an algorithm for choosing evolutionary or organismal preferences, if we seem to agree on the right conclusion in most cases.

I just don't think your arguments here provide any support for your conclusion.

Which conclusion did you have in mind? The key conclusion is that value can't be unambiguously analyzed at a finer level of detail than the behavior, in the way that communication can't be unambiguously analyzed at a finer level of detail than the proposition. You haven't said anything about that.

(I just realized this makes me a structuralist above some level of detail, but a post-structuralist below it. Damn.)

I really would like to see the most precise and cogent version of your argument here as I think it could prompt some important progress in filling in the gaps present in the sort of ethical view I favor.

I don't think I will be any more precise or cogent (at least not as long as I'm not getting paid for it), nor that most readers would have preferred an even longer post. It took me two days to write this. If you don't think my arguments provide any support for my conclusions, the gap between us is too wide for further elaboration to be worthwhile.

What is the ethical view you favor?

The FAI shouldn't like sugary tastes, sex, violence, bad arguments, whatever. It should like us to experience sugary tastes, sex, violence, bad arguments, whatever.

"I don't see how. Are you going to kill the snakes, or not?"

Presumably you act out a weighted balance of the voting power of possible human preferences extrapolated over different possible environments which they might create for themselves.

" Do you mean that you can use technology to let people experience simulated violence without actually hurting anybody? Doesn't that seem like building an inconsistency into your utopia? Wouldn't having a large number of such inconsistencies make utopia unstable, or lacking in integrity?"

I don't understand the problem here. I don't mean that this is the correct solution, though it is the obvious solution, but rather that I don't see what the problem is. Ancients, who endorsed violence, generally didn't understand or believe in personal death anyway.

The FAI shouldn't like sugary tastes, sex, violence, bad arguments, whatever. It should like us to experience sugary tastes, sex, violence, bad arguments, whatever.

You're going back to Eliezer's plan to build a single OS FAI. I should have clarified that I'm speaking of a plan to make AIs that have human values, for the sake of simplicity. (Which IMHO is a much, much better and safer plan.) Yes, if your goal is to build an OS FAI, that's correct. It doesn't get around the problem. Why should we design an AI to ensure that everyone for the rest of history is so much like us, and enjoys fat, sugar, salt, and the other things we do? That's a tragic waste of a universe.

Presumably you act out a weighted balance of the voting power of possible human preferences extrapolated over different possible environments which they might create for themselves.

Why extrapolate over different possible environments to make a decision in this environment? What does that buy you? Do you do that today?

EDIT: I think I see what you mean. You mean construct a distribution of possible extensions of existing preferences into different environments, and weigh each one according to some function. Such as internal consistency / energy minimization. Which, I would guess, is a preferred Bayesian method of doing CEV.

My intuition is that this won't work, because what you need to make it work is prior odds over events that have never been observed. I think we need to figure out a way to do the math to settle this.

I don't understand the problem here.

It seems irrational, and wasteful, to deliberately construct a utopia where you give people impulses, and work to ensure that the mental and physical effort consumed by acting on those impulses is wasted. It also seems like a recipe for unrest. And, from an engineering perspective, it's an ugly design. It's like building a car with extra controls that don't do anything.

Why should we design an AI to ensure that everyone for the rest of history is so much like us, and enjoys fat, sugar, salt, and the other things we do? That's a tragic waste of a universe.

Well a key hard problem is: what features about ourselves that we like should we try to ensure endure into the future? Yes some features seem hopelessly provincial, while others seem more universally good, but how can we systematically judge this?

It seems irrational, and wasteful, to deliberately construct a utopia where you give people impulses, and work to ensure that the mental and physical effort consumed by acting on those impulses is wasted.

I think you're dancing around a bigger problem: once we have a sufficiently powerful AI, you and I are just a bunch of extra meat and buggy programming. Our physical and mental effort is just not needed or relevant. The purpose of FAI is to make sure that we get put out to pasture in a Friendly way. Or, depending on your mood, you could phrase it as living on in true immortality to watch the glory that we have created unfold.

It's like building a car with extra controls that don't do anything.

I think the more important question is what, in this analogy, does the car do?

I get the impression that's part of the SIAI plan, but it seems to me that the plan entails that that's all there is, from then on, for the universe. The FAI needs control of all resources to prevent other AIs from being made; and the FAI has no other goals than its human-value-fulfilling goals; so it turns the universe into a rest home for humans.

That's just another variety of paperclipper.

If I'm wrong, and SIAI wants to allocate some resources to the human preserve, while letting the rest of the universe develop in interesting ways, please correct me, and explain how this is possible.

If I'm wrong, and SIAI wants to allocate some resources to the human preserve, while letting the rest of the universe develop in interesting ways

If you want the universe to develop in interesting ways, then why not explicitly optimize it for interestingness, however you define that?

I'm not talking about what I want to do, I'm talking about what SIAI wants to do. What I want to do is incompatible with constructing a singleton and telling it to extrapolate human values and run the universe according to them; as I have explained before.

If you think the future would be less than it could be if the universe was tiled with "rest homes for humans", why do you expect that an AI which was maximizing human utility would do that?

It depends how far meta you want to go when you say "human utility". Does that mean sex and chocolate, or complexity and continual novelty?

That's an ambiguity in CEV - the AI extrapolates human volition, but what's happening to the humans in the meanwhile? Do they stay the way they are now? Are they continuing to develop? If we suppose that human volition is incompatible with trilobite volition, that means we should expect the humans to evolve/develop new values that are incompatible with the AI's values extrapolated from humans.

If for some reason humans who liked to torture toddlers became very fit, future humans would evolve to possess values that resulted in many toddlers being tortured. I don't want that to happen, and am perfectly happy constraining future intelligences (even if they "evolve" from humans or even me) so they don't. And as always, if you think that you want the future to contain some value shifting, why don't you believe that an AI designed to fulfill the desires of humanity will cause/let that happen?

I think your article successfully argued that we're not going to find some "ultimate" set of values that is correct or can be proven. In the end, the programmers of an FAI are going to choose a set of values that they like.

The good news is that human values can include things like generosity, non-interference, personal development, and exploration. "Human values" could even include tolerance of existential risk in return for not destroying other species. Any way that you want an FAI to be is a human value. We can program an FAI with ambitions and curiosity of its own, they will be rooted in our own values and anthropomorphism.

But no matter how noble and farsighted the programmers are, to those who don't share the programmers' values, the FAI will be a paperclipper.

We're all paperclippers, and in the true prisoners' dilemma, we always defect.

Upvoted, but -

We can program an FAI with ambitions and curiosity of its own, they will be rooted in our own values and anthropomorphism.

Eliezer needs to say whether he wants to do this, or to save humans. I don't think you can have it both ways. The OS FAI does not have ambitions or curiousity of its own.

But no matter how noble and farsighted the programmers are, to those who don't share the programmers' values, the FAI will be a paperclipper.

I dispute this. The SIAI FAI is specifically designed to have control of the universe as one of its goals. This is not logically necessary for an AI. Nor is the plan to build a singleton, rather than an ecology of AI, the only possible plan.

I notice that some of my comment wars with other people arise because they automatically assume that whenever we're talking about a superintelligence, there's only one of them. This is in danger of becoming a LW communal assumption. It's not even likely. (More generally, there's a strong tendency for people on LW to attribute very high likelihoods to scenarios that EY spends a lot of time talking about - even if he doesn't insist that they are likely.)

I dispute this. The SIAI FAI is specifically designed to have control of the universe as one of its goals.

It is widely expected that this will arise as an important instrumental goal; nothing more than that. I can't tell if this is what you mean. (When you point out that "trying to take over the universe isn't utility-maximizing under many circumstances", it sounds like you're thinking of taking over the universe as a separate terminal goal, which would indeed be terrible design; an AI without that terminal goal, that can reason the same way you can, can decide not to try to take over the universe if that looks best.)

I notice that some of my comment wars with other people arise because they automatically assume that whenever we're talking about a superintelligence, there's only one of them. This is in danger of becoming a LW communal assumption. It's not even likely.

I probably missed it in some other comment, but which of these do you not buy: (a) huge first-mover advantages from self-improvement (b) preventing other superintelligences as a convergent subgoal (c) that the conjunction of these implies that a singleton superintelligence is likely?

(More generally, there's a strong tendency for people on LW to attribute very high likelihoods to scenarios that EY spends a lot of time talking about - even if he doesn't insist that they are likely.)

This sounds plausible and bad. Can you think of some other examples?

(More generally, there's a strong tendency for people on LW to attribute very high likelihoods to scenarios that EY spends a lot of time talking about - even if he doesn't insist that they are likely.)

This is probably just availability bias. These scenarios are easy to recall because we've read about them, and we're psychologically primed for them just by coming to this website.

Eliezer needs to say whether he wants to do this

He did. FAI should not be a person - it's just an optimization process.

ETA: link

The assumption of a single AI comes from an assumption that an AI will have zero risk tolerance. It follows from that assumption that the most powerful AI will destroy or limit all other sentient beings within reach.

There's no reason that an AI couldn't be programmed to have tolerance for risk. Pursuing a lot of the more noble human values may require it.

I make no claim that Eliezer and/or the SIAI have anything like this in mind. It seems that they would like to build an absolutist AI. I find that very troubling.

I make no claim that Eliezer and/or the SIAI have anything like this in mind. It seems that they would like to build an absolutist AI. I find that very troubling.

If I thought they had settled on this and that they were likely to succeed I would probably feel it was very important to work to destroy them. I'm currently not sure about the first and think the second is highly unlikely so it is not a pressing concern.

I dispute this. The SIAI FAI is specifically designed to have control of the universe as one of its goals. This is not logically necessary for an AI. Nor is the plan to build a singleton, rather than an ecology of AI, the only possible plan.

It is, however, necessary for an AI to do something of the sort if it's trying to maximize any sort of utility. Otherwise, risk / waste / competition will cause the universe to be less than optimal.

Trying to take over the universe isn't utility-maximizing under many circumstances: if you have a small chance of succeeding, or if the battle to do so will destroy most of the resources, or if you discount the future at all (remember, computation speed increases as speed of light stays constant), or if your values require other independent agents.

By your logic, it is necessary for SIAI to try to take over the world. Is that true? The US probably has enough military strength to take over the world - is it purely stupidity that it doesn't?

The modern world is more peaceful, more enjoyable, and richer because we've learned that utility is better maximized by cooperation than by everyone trying to rule the world. Why does this lesson not apply to AIs?

Just what do you think "controlling the universe" means? My cat controls the universe. It probably doesn't exert this control in a way anywhere near optimal to most sensible preferences, but it does have an impact on everything. How do we decide that a superintelligence "controls the universe", while my cat "doesn't"? The only difference is in what kind of the universe we have, which preference it is optimized for. Whatever you truly want, roughly means preferring some states of the universe to other states, and making the universe better for you means controlling it towards your preference. The better the universe, the more specifically its state is specified, the stronger the control. These concepts are just different aspects of the same phenomenon.

Trying to take over the universe isn't utility-maximizing under many circumstances: if you have a small chance of succeeding, or if the battle to do so will destroy most of the resources

Obviously, if you can't take over the world, then trying is stupid. If you can (for example, if you're the first SAI to go foom) then it's a different story.

or if you discount the future at all (remember, computation speed increases as speed of light stays constant), or if your values require other independent agents.

Taking over the world does not require you to destroy all other life if that is contrary to your utility function. I'm not sure what you mean regarding future-discounting; if reorganizing the whole damn universe isn't worth it, then I doubt anything else will be in any case.

By your logic, it is necessary for SIAI to try to take over the world. Is that true? The US probably has enough military strength to take over the world - is it purely stupidity that it doesn't?

For one, the U.S. doesn't have the military strength. Russia still has enough nuclear warheads and ICBMs to prevent that. (And we suck at being occupying forces.)

I think the situation of the US is similar to a hypothesized AI. Sure, Russia could kill a lot of Americans. But we would probably "win" in the end. By all the logic I've heard in this thread, and in others lately about paperclippers, the US should rationally do whatever it has to to be the last man standing.

Well, also the US isn't a single entity that agrees on all its goals. Some of us for example place a high value on human life. And we vote. Even if the leadership of the United States wanted to wipe out the rest of the planet, there would be limits to how much they could do before others would step in.

Also, most forms of modern human morality strongly disfavor large scale wars simply to impose one's views. If our AI doesn't have that sort of belief then that's not an issue. And if we restrict ourselves to just the issue of other AIs, I'm not sure if I gave a smart AI my morals and preferences it would necessarily see anything wrong with making sure that no other general smart AIs were created.

Well, also the US isn't a single entity that agrees on all its goals.

I think it is quite plausible that an AI structured with a central unitary authority would be at a competitive disadvantage with an AI that granted some autonomy to sub systems. This at least raises the possibility of goal conflicts between different sub-modules of an efficient AI. There are many examples in nature and in human societies of a tension between efficiency and centralization. It is not clear that an AI could maintain a fully centralized and unified goal structure and out-compete less centralized designs.

An AI that wanted to control even a relatively small region of space like the Earth will still run into issues with the speed of light when it comes to projecting force through geographically dispersed physical presences. The turnaround time is such that decision making autonomy would have to be dispersed to local processing clusters in order to be effective. Hell, even today's high end processors run into issues with the time it takes an electron to get from one side of the die to the other. It is not obvious that the optimum efficiency balance between local decision making autonomy and a centralized unitary goal system will always favour a singleton type AI.

There is some evidence of evolutionary competition between different cell lines within a single organism. Human history is full of examples of the tension between centralized planning and less centrally coordinated but more efficient systems of delegated authority. We do not see a clear unidirectional trend towards more centralized control or towards larger conglomerations of purely co-operating units (whether they be cells, organisms, humans or genes) in nature or in human societies. It seems to me that the burden of proof is on those who would propose that a system with a unitary goal structure has an unbounded upper physical extent of influence where it can outcompete less unitary arrangements (or even that it can do so over volumes exceeding a few meters to a side).

There is a natural tendency for humans to think of themselves as having a unitary centralized consciousness with a unified goal system. It is pretty clear that this is not the case. It is also natural for programmers trained on single threaded Von-Neumann architectures or those with a mathematical bent to ignore the physical constraints of the speed of light when imagining what an AI might look like. If a human can't even catch a ball without delegating authority to a semi-autonomous sub-unit I don't see why we should be confident that non human intelligences subject to the same laws of physics should be immune to such problems.

This at least raises the possibility of goal conflicts between different sub-modules of an efficient AI.

A well designed AI should have an alignment of goals between sub modules that is not achieved in modern decentralized societies. A distributed AI would be like multiple TDT/UDT agents with mutual knowledge that they are maximizing the same utility function, not a bunch of middle managers engaging in empire building at the expense of the corporation they work for.

This is not even something that human AI designers have to figure out how to implement, the seed can be single agent, and it will figure out the multiple sub agent architecture when it needs it over the course of self improvement.

Even if this is possible (which I believe is still an open problem, if you think otherwise I'm sure Eliezer would love to hear from you) you are assuming no competition. The question is not whether this AI can outcompete humans but whether it can outcompete other AIs that are less rigid.

It is not obvious that the optimum efficiency balance between local decision making autonomy and a centralized unitary goal system will always favor a singleton type AI.

I agree that it would probably make a lot of sense for an AI who wished to control any large area of territory to create other AIs to manage local issues. However, AIs, unlike humans or evolution can create other AIs which share perfectly its values and interests. There is no reason to assume that an AI would create another one, which it intends to delegate substantial power to, which it could get into values disagreements with.

However, AIs, unlike humans or evolution can create other AIs which share perfectly its values and interests.

This is mere supposition. You are assuming the FAI problem is solvable. I think both evolutionary and economic arguments weigh against this belief. Even if this is possible in theory it may take far longer for a singleton AI to craft its faultlessly loyal minions than for a more... entrepreneurial... AI to churn out 'good enough' foot soldiers to wipe out the careful AI.

This is mere supposition.

No. All an AI needs to do to create another AI which shares its values is to copy itself.

So if you cloned yourself you would be 100% confident you would never find yourself in a situation where your interests conflicted with your clone? Again, you are assuming the FAI problem is solvable and that the idea of an AI with unchanging values is even coherent.

I am not an AI. I am not an optimization process with an explicit utility function. A copy of an AI that undertook actions which appeared to work against another copy, would be found, on reflection, to have been furthering the terms of their shared utility function.

I am not an optimization process with an explicit utility function.

You are still assuming that such an optimization process is a) possible (on the scale of greater than human intelligence) and b) efficient compared to other alternatives. a) is the very hard problem that Eliezer is working on. Whether it is possible is still an open question I believe. I am claiming b) is non-obvious (and even unlikely), you need to explain why you think otherwise rather than repeatedly stating the same unsupported claim if you want to continue this conversation.

Human experience so far indicates that imperfect/heuristic optimization processes are often more efficient (in use of computational resources) than provably perfect optimization processes. Human experience also suggests that it is easier to generate an algorithm to solve a problem satisfactorily than it is to rigorously prove that the algorithm does what it is supposed to or generates an optimal solution. The gap between these two difficulties seems to increase more than linearly with increasing problem complexity. There are mathematical reasons to suspect that this is a general principle and not simply due to human failings. If you disagree with this you need to provide some reasoning - the burden of proof is on those who would claim otherwise it seems to me.

You are still assuming that such an optimization process is a) possible (on the scale of greater than human intelligence) and b) efficient compared to other alternatives. a) is the very hard problem that Eliezer is working on. Whether it is possible is still an open question I believe. I am claiming b) is non-obvious (and even unlikely), you need to explain why you think otherwise rather than repeatedly stating the same unsupported claim if you want to continue this conversation.

I certainly agree that creating an optimization process which provably advances a set of values under a wide variety of taxing circumstances is hard. I further agree that it is quite likely that the first powerful optimization process created does not have this property, because of the difficulty involved, even if this is the goal that all AI creators have. I will however state that if the first such powerful optimization process is not of the sort I specified, we will all die.

Human experience so far indicates that imperfect/heuristic optimization processes are often more efficient (in use of computational resources) than provably perfect optimization processes. Human experience also suggests that it is easier to generate an algorithm to solve a problem satisfactorily than it is to rigorously prove that the algorithm does what it is supposed to or generates an optimal solution. The gap between these two difficulties seems to increase more than linearly with increasing problem complexity. There are mathematical reasons to suspect that this is a general principle and not simply due to human failings.

I also agree that the vast majority of mind-design space consists of sloppy approximations that will break down outside their intended environments. This means that most AI designs will kill us. When I use the word AI, I don't mean a randomly selected mind, I mean a reflectively consistent, effective optimizer of a specific utility function.

In many environments, quick and dirty heuristics will cause enormous advantage, so long as those environments can be expected to continue in the same way for the length of time the heuristic will be in operation. This means that if you have two minds, with equal resources (ie the situation you describe) the one willing to use slapdash heuristics will win, as long as the conditions facing it don't change. But, the situation where two minds are created with equal resources is unlikely to occur, given that one of them is an AI (as I use the term), even if that AI is not maximizing human utility (ie an FAI). The intelligence explosion means that a properly designed AI will be able to quickly take control of its immediate environment. Why would an AI with a stable goal allow another mind to be created, gain resources and threaten it? It wouldn't. It would crush all possible rivals, because to do otherwise is to invite disaster.

In short: The AI problem is hard. Sloppy minds are quite possibly going to be made before proper AIs. But true AIs, of the type that Eliezer wants to build, will not run into the problems you would expect from the majority of minds.

I certainly agree that creating an optimization process which provably advances a set of values under a wide variety of taxing circumstances is hard.

We know that perfect solutions to even quite simple optimization problems are a different kind of hard. We have quite good reason to suspect that this is an essential property of reality and that we will never be able to solve such problems simply. The kinds of problems we are talking about seem likely to be more complex to solve. In other words if (and it is a big if) it is possible to create an optimization process that provably advances a set of values (let's call it 'friendly') it is unlikely to be a perfect optimization process. It seems likely to me that such 'friendly' optimization processes will represent a subset of all possible optimization processes and that it is quite likely that some 'non-friendly' optimization processes will be better optimizers. I see no reason to suppose that the most effective optimizers will happily fall into the 'friendly' subset.

The intelligence explosion means that a properly designed AI will be able to quickly take control of its immediate environment.

I don't consider this hypothesis proved or self-evident. It is at least plausible but I can think of lots of reasons why it might not be true. Taking an outside view, we do not see much evidence from evolution or human societies of 'winner takes all' being a common outcome (we see much diversity in nature and human society), nor of first mover advantage always leading to an insurmountable lead. And yes, I know there are lots of reasons why 'self improving AI is different' but I don't consider the matter settled. It is a realistic enough concern for me to broadly support SIAI's efforts but it is by no means the only possible outcome.

Why would an AI with a stable goal allow another mind to be created, gain resources and threaten it? It wouldn't. It would crush all possible rivals, because to do otherwise is to invite disaster.

Why does any goal directed agent 'allow' other agents to conflict with its goals? Because it isn't strong enough to prevent them. We know of no counter examples in all of history to the hypothesis that all goal directed agents have limits. This does not rule out the possibility that a self improving AI would be the first counter-example but neither does it make me as sure of that claim as many here seem to be.

But true AIs, of the type that Eliezer wants to build, will not run into the problems you would expect from the majority of minds.

I understand the claim. I am not yet convinced it is possible or likely.

It seems likely to me that such 'friendly' optimization processes will represent a subset of all possible optimization processes and that it is quite likely that some 'non-friendly' optimization processes will be better optimizers.

I agree that human values are unlikely to be the easiest to maximize. However, for another mind to optimize our universe, it needs to be created. This is why SIAI advocates creating an AI friendly to humans before other optimization processes are created.

It seems to me that your true objection to what I am saying is contained within the statement that "it is at the very least possible for an intelligence to not take over its immediate environment before another, with possibly inimical goals, is created." Does this agree with your assessment? Would convincing argument for the intelligence explosion cause you to change your mind?

It seems to me that your true objection to what I am saying is contained within the statement that "it is at the very least possible for an intelligence to not take over its immediate environment before another, with possibly inimical goals, is created." Does this agree with your assessment?

More or less, though I actually lean towards it being likely rather than merely possible. I am also making the related claim that a widely spatially dispersed entity with a single coherent goal system may be a highly unstable configuration.

Would convincing argument for the intelligence explosion cause you to change your mind?

On the first point, yes. I don't believe I've seen my points addressed in detail, though it sounds like Eliezer's debate with Robin Hanson that was linked earlier might cover the same ground. I will take some time to follow up on that later.

it sounds like Eliezer's debate with Robin Hanson that was linked earlier might cover the same ground.

I'm working my way through it and indeed it does. Robin Hanson's post Dreams of Autarky is close to my position. I think there are other computational, economic and physical arguments in this direction as well.

It's not obvious that "shared utility function" means something definite, though.

It certainly does if the utility function doesn't refer to anything indexical; and an agent with an indexical utility function can build another agent (not a copy of itself, though) with a differently-represented (non-indexical) utility function that represents the same third-person preference ordering.

It should apply to AIs if you think that there will be multiple AIs that are at roughly the same capability level. A common assumption here is that as soon as there is a single general AI it will quickly improve to the point where it is so far beyond everything else in capability that there capabilities won't matter. Frankly, I find this assumption to be highly questionable and very optimistic about potential fooming rates among other problems, but if one accepts the idea it makes some sense. The analogy might be to the hypothetical situation of the US instead of having just the strongest military but also having monopolies on cheap fusion power, an immortality pill, and having a bunch of superheroes on their side. The distinction between the US controlling everything and the US having direct military control might quickly become irrelevant.

Edit: Thinking about the rate of fooming issue. I'd be really interested if a fast-foom proponent would be willing to put together a top-level post outlining why fooming will happen so quickly.

Eliezer and Robin had a lengthy debate on this perhaps a year ago. I don't remember if it's on OB or LW. Robin believes in no foom, using economic arguments.

The people who design the first AI could build a large number of AIs in different locations and turn them on at the same time. This plan would have a high probability of leading to disaster; but so do all the other plans that I've heard.

I'm getting lost in my own argument.

If Michael was responding to the problem that human preference systems can't be unambiguously extended into new environments, then my chronologically first response applies, but needs more thought; and I'm embarrassed that I didn't anticipate that particular response.

If he was responding to the problem that human preferences as described by their actions, and as described by their beliefs, are not the same, then my second response applies.

Presumably you act out a weighted balance of the voting power of possible human preferences extrapolated over different possible environments which they might create for themselves.

If a person could label each preference system "evolutionary" or "organismal", meaning which value they preferred, then you could use that to help you extrapolate their values into novel environments.

The problem is that the person is reasoning only over the propositional part of their values. They don't know what their values are; they know only what the contribution within the propositional part is. That's one of the main points of my post. The values they come up with will not always be the values they actually implement.

If you define a person's values as being what they believe their values are, then, sure, most of what I posted will not be a problem. I think you're missing the point of the post, and are using the geometry-based definition of identity.

If you can't say whether the right value to choose in each case is evolutionary or organismal, then extrapolating into future environments isn't going to help. You can't gain information to make a decision in your current environment by hypothesizing an extension to your environment, making observations in that imagined environment, and using them to refine your current-environment estimates. That's like trying to refine your estimate of an asteroid's current position by simulating its movement into the future, and then tracking backwards along that projected trajectory to the present. It's trying to get information for free. You can't do that.

(I think what I said under "Fuzzy values and fancy math don't help" is also relevant.)

I may be a little slow and missing something, but here are my jumbled thoughts.

I found moral nihilism convincing for a brief time. The argument seems convincing: just about any moral statement you can think of, some people on earth have rejected it. You can't appeal to universal human values... we've tried, and I don't think there's a single one that has stood up to scrutiny as actually being literally universal. You always end up having to say, "Well, those humans are aberrant and evil."

Then I realized that there must be something more complicated going on. Else how explain the fact that I am curious about what is moral? I've changed my mind on moral questions -- pretty damn foundational ones. I've experienced moral ignorance ("I don't know what is right here.") I don't interact with morality as a preference. Or, when I do, sometimes I remember not to, and pull myself back.

I know people who claim to interact with morality as a preference -- only "I want to do this," never "I must do this." I'm skeptical. If you could really have chosen any set of principles ... why did you happen to choose principles that match pretty well with being a person of integrity? Quite a coincidence, that.

It's the curiosity and ignorance that really stumps me. I can be as curious about moral matters, or feel ignorant about moral matters, as about anything else. Why would I be curious, if not to learn how things really are? Is curiosity just another thing I have a preference for?

But It's weird to talk about a preference for curiosity, because I'm not sure that if you say "I want to be curious" that you're actually being curious. Curiosity is "I want to know why the sky is blue." It refers to something. I doubt it's coherent to make a principle of curiosity. (Curiosity is one of the Virtues of Rationality, but it's understood that you aren't curious by force of will, or by deciding to value curiosity. You're curious only if you want to know the answer.)

I've been reading Bury the Chains, a history of British abolitionism, and the beginning does give the impression of morals as something to be either discovered or invented.

The situation starts with vast majority in Britain not noticing there was anything wrong with slavery. A slave ship captain who later became a prominent abolitionist is working improving his morals-- by giving up swearing.

Once slavery became a public issue, opposition to it grew pretty quickly, but the story was surprising to me because I thought of morals as something fairly obvious.

Yes! And I think the salient point is not only that 18th century Englishmen didn't think slavery was wrong -- again, it's a fact that people disagree radically about morals -- but that the story of the abolition of slavery looks a lot like people learning for the first time that it was wrong. Changing their minds in response to seeing a diagram of a slave ship, for instance. "Oh. Wow. I need to update." (Or, to be more historically accurate, "I once was lost, but now am found; was blind, but now I see.")

This is an excellent question. I think it's curiosity about where reflective equilibrium would take you.

I suspect that, at an evolutionary equilibrium, we wouldn't have the concept of "morality". There would be things we would naturally want to do, and things we would naturally not want to do; but not things that we thought we ought to want to do but didn't.

I don't know if that would apply to reflective equilibrium.

I think agents in reflective equilibrium would (almost, but not quite, by definition) not have "morality" in that sense (unsatisfied higher-order desires, though that's definitely not the local common usage of "morality") except in some very rare equilibria with higher-order desires to remain inconsistent. However, they might value humans having to work to satisfy their own higher-order desires.

This article is a bit long. If it would not do violence to the ideas, I would prefer it had been broken up into a short series.

I think you're altogether correct, but with the caveat that "Friendly AI is useless and doomed to failure" is not a necessary conclusion of this piece.

Any one of us has consistent intuitions

I think this is false. Most of us have inconsistent intuitions, just like we have inconsistent beliefs. Though this strengthens, not undermines, your point.

This means that our present ethical arguments are largely the result of cultural change over the past few thousand years; and that the next few hundred years of change will provide ample grounds for additional arguments even if we resolve today's disagreements.

Indeed, but hopefully we'll get better at adapting ourselves to the changing times.

This article is a bit long. If it would not do violence to the ideas, I would prefer it had been broken up into a short series.

Plus, I could have gotten more karma that way. :)

It started out small. (And more wrong.)

I think you're altogether correct, but with the caveat that "Friendly AI is useless and doomed to failure" is not a necessary conclusion of this piece.

Agreed.

I feel like this post provides arguments similar to those I would have given if I was mentally more organized. For months I've been asserting (without argument), 'don't you see? -- without "absolute values" to steer us, optimizing over preferences is incoherent". The incoherence stems from the fact that our preferences are mutable, and something we modify and optimize over a lifetime, and making distinctions between preferences given to us by genetics, our environmental history, or random chance is too arbitrary. There's no reason to elevate one source of preferences as somehow more special or more uniquely "myself".

I think that, despite my lack of mental organization on this topic, I experience this incoherence more immediately because my preferences for existence, consciousness and personal identity are not that strong. For example, I don't care about existential risk because I don't care about humankind existing in the abstract. If I must exist, then I would like my existence to have certain qualities. I don't see things being "my way" in the long run, so I prefer a short life, optimizing the present as best I can. I have some hope that things will nevertheless work out, but this hope is more of a 'watch and see' than something that drives action.

In the long run, as we understand more and more what human preferences are, I think there will be just a few coherent options: the choice of non-existence, the choice to wire-head or the choice to repress (what I consider to be) the salient aspects of consciousness and self-awareness.

Overt self-awareness may have been an evolutionary mistake that needs to be corrected. Perhaps we encounter no intelligent life because there is such a small temporal gap between the development of intelligence, the meta-awareness of goals, and the desire to relinquish awareness.

What causes me to pause in this analysis is that persons here, clearly more intelligent than myself, do not have a similar impulse to lie down and quit. I infer that it is a difference in preferences, and I represent some subset of human preferences that are not meant to persist ... because my preferences dissolve upon reflection.

I feel like these ideas are all connected and follow from the thesis of the post . Do they seem peripheral?

In the long run, as we understand more and more what human preferences are, I think there will be just a few coherent options: the choice of non-existence, the choice to wire-head or the choice to repress (what I consider to be) the salient aspects of consciousness and self-awareness.

In my imagination Less Wrong becomes really influential and spurs a powerful global movement, develops factions along these fault lines (with a fourth faction, clinging desperately to their moral nostalgia) and then self-destructs in a flame war to end all flame wars.

Maybe I'll write a story.

You can pry my self-awareness from my cold, dead neurons.

I don't think we're talking about the same type of incoherence; but I wouldn't want to have been deprived of these thoughts of yours because of that. Even though they're the most depressing thing I've heard today.

It wouldn't surprise me if strong preferences for existence, consciousness, and personal identity are partly physiologically based. And I mean fairly simple physiology, like neurotransmitter balance.

This doesn't mean they should be changed.

It does occur to me that I've been trying to upgrade my gusto level by a combination of willpower and beating up on myself, and this has made things a lot worse.

Did pjeby write a post against willpower? I think willpower is overrated. Cognitive behavioral therapy is better.

I find that careful introspection always dissolves the conceptual frames within which my preferences are formulated but generally leaves the actionable (but not the non-actionable) preferences intact.

I don't follow. Can you give examples? What's a conceptual frame, and what's an actionable vs. non-actionable preference? I infer the actionable/non-actionable distinction is related to the keep/don't keep decision, but the terminology sounds to me like it just means "a preference you can satsify" vs. "a preference you can't act to satisfy".

And, also, could you give an example of a conceptual frame which got dissolved?

Free will vs. determinism, deontology vs. utilitarianism.

Could you give an example of an actionable preference that stays intact? Preferably one that is not evolutionary, because I agree that those are mostly indissoluble.

What about paperclips, though? Aren't those pretty consistently good?

Maybe sometime I'll write a post on why I think the paperclipper is a strawman. The paperclipper can't compete; it can happen only if a singleton goes bad.

The value systems we revile yet can't prove wrong (paperclipping and wireheading) are both evolutionary dead-ends. This suggests that blind evolution still implements our values better than our reason does; and allowing evolution to proceed is still better than computing a plan of action with our present level of understanding.

Besides, Clippy, a paperclip is just a staple that can't commit.

Besides, Clippy, a paperclip is just a staple that can't commit.

And a staple is just a one-use paperclip.

So there.

Maybe sometime I'll write a post on why I think the paperclipper is a strawman. The paperclipper can't compete; it can happen only if a singleton goes bad.

I think everyone who talks about paperclippers is talking about singletons gone bad (rather, started out bad and having reached reflective consistency).

This is extremely confused. Wireheading is an evolutionary dead-end because wireheads ignore their surroundings. Paperclippers, and for that matter, staplers and FAIs pay exclusive attention to their surroundings and ignore their terminal utility functions except to protect them physically. It's just that after acquiring all the resources available, clippy makes clips and Friendly makes things that humans would want if they thought more clearly, such as the experience of less clear thinking humans eating ice cream.

such as the experience of less clear thinking humans eating ice cream

If the goal is to give people the same experience that they would get from giving ice cream, is it satisfied by giving them a button they can press to get that experience?

It's only wireheading if it becomes a primary value. If it's just fun subordinate to other values, it isn't different from "in the body" fun.

What's a primary value? This sounds like a binary distinction, and I'm always skeptical of binary distinctions.

You could say the badness of the action is proportional to the fraction of your time that you spend doing it. But for that to work, you would assign the action the same bad value per unit time.

Are you saying that wireheading and other forms of fun are no different; and all fun should be pursued in moderation? So spending 1 hour pushing your button is comparable to spending 1 hour attending a concert?

(That's only a paperclipper with no discounting of the future, BTW.)

Paperclippers are not evolutionarily viable, nor is there any plausible evolutionary explanation for paperclippers to emerge.

You can posit a single artificial entity becoming a paperclipper via bad design. In the present context, which is of many agents trying to agree on ethics, this single entity has only a small voice.

It's legit to talk about paperclippers in the context of the danger they pose if they become a singleton. It's not legit to bring them up outside that context as a bogeyman to dismiss the idea of agreement on values.

Maybe sometime I'll write a post on why I think the paperclipper is a strawman. The paperclipper can't compete; it can happen only if a singleton goes bad.

You don't think we can accidentally build a singleton that goes bad?

(I'm not even sure a singleton can start off not being bad.)

The context here is attempting to agree with other agents about ethics. A singleton doesn't have that problem. Being a singleton means never having to say you're sorry.

Clear thinkers who can communicate cheaply are automatically collectively a singleton with a very complex utility function. No-one generally has to attempt to agree with other agents about ethics, they only have to take actions that take into account the conditional behaviors of others.

What?

If we accept these semantics (a collection of clear thinkers is a "singleton" because you can imagine drawing a circle around them and labelling them a system), then there's no requirement for the thinkers to be clear, or to communicate cheaply. We are a singleton already.

Then the word singleton is useless.

No-one generally has to attempt to agree with other agents about ethics, they only have to take actions that take into account the conditional behaviors of others.

This is playing with semantics to sidestep real issues. No one "has to" attempt to agree with other agents, in the same sense that no one "has to" achieve their goals, or avoid pain, or live.

You're defining away everything of importance. All that's left is a universe of agents whose actions and conflicts are dismissed as just a part of computation of the great Singleton within us all. Om.