Every now and then, you run across someone who has discovered the One Great Moral Principle, of which all other values are a mere derivative consequence.

    I run across more of these people than you do.  Only in my case, it's people who know the amazingly simple utility function that is all you need to program into an artificial superintelligence and then everything will turn out fine.

    (This post should come as an anticlimax, since you already know virtually all the concepts involved, I bloody well hope.  See yesterday's post, and all my posts since October 31st, actually...)

    Some people, when they encounter the how-to-program-a-superintelligence problem, try to solve the problem immediately.  Norman R. F. Maier:  "Do not propose solutions until the problem has been discussed as thoroughly as possible without suggesting any."  Robyn Dawes:  "I have often used this edict with groups I have led - particularly when they face a very tough problem, which is when group members are most apt to propose solutions immediately."  Friendly AI is an extremely tough problem so people solve it extremely fast.

    There's several major classes of fast wrong solutions I've observed; and one of these is the Incredibly Simple Utility Function That Is All A Superintelligence Needs For Everything To Work Out Just Fine.

    I may have contributed to this problem with a really poor choice of phrasing, years ago when I first started talking about "Friendly AI".  I referred to the optimization criterion of an optimization process - the region into which an agent tries to steer the future - as the "supergoal".  I'd meant "super" in the sense of "parent", the source of a directed link in an acyclic graph.  But it seems the effect of my phrasing was to send some people into happy death spirals as they tried to imagine the Superest Goal Ever, the Goal That Overrides All Over Goals, the Single Ultimate Rule From Which All Ethics Can Be Derived.

    But a utility function doesn't have to be simple.  It can contain an arbitrary number of terms.  We have every reason to believe that insofar as humans can said to be have values, there are lots of them - high Kolmogorov complexity.  A human brain implements a thousand shards of desire, though this fact may not be appreciated by one who has not studied evolutionary psychology.  (Try to explain this without a full, long introduction, and the one hears "humans are trying to maximize fitness", which is exactly the opposite of what evolutionary psychology says.)

    So far as descriptive theories of morality are concerned, the complicatedness of human morality is a known fact.  It is a descriptive fact about human beings, that the love of a parent for a child, and the love of a child for a parent, and the love of a man for a woman, and the love of a woman for a man, have not been cognitively derived from each other or from any other value.  A mother doesn't have to do complicated moral philosophy to love her daughter, nor extrapolate the consequences to some other desideratum.  There are many such shards of desire, all different values.

    Leave out just one of these values from a superintelligence, and even if you successfully include every other value, you could end up with a hyperexistential catastrophe, a fate worse than death.  If there's a superintelligence that wants everything for us that we want for ourselves, except the human values relating to controlling your own life and achieving your own goals, that's one of the oldest dystopias in the book.  (Jack Williamson's "With Folded Hands", in this case.)

    So how does the one constructing the Amazingly Simple Utility Function deal with this objection?

    Objection?  Objection?  Why would they be searching for possible objections to their lovely theory?  (Note that the process of searching for real, fatal objections isn't the same as performing a dutiful search that amazingly hits on only questions to which they have a snappy answer.)  They don't know any of this stuff.  They aren't thinking about burdens of proof.  They don't know the problem is difficult.  They heard the word "supergoal" and went off in a happy death spiral around "complexity" or whatever.

    Press them on some particular point, like the love a mother has for her children, and they reply "But if the superintelligence wants 'complexity', it will see how complicated the parent-child relationship is, and therefore encourage mothers to love their children."  Goodness, where do I start?

    Begin with the motivated stopping:  A superintelligence actually searching for ways to maximize complexity wouldn't conveniently stop if it noticed that a parent-child relation was complex.  It would ask if anything else was more complex.  This is a fake justification; the one trying to argue the imaginary superintelligence into a policy selection, didn't really arrive at that policy proposal by carrying out a pure search for ways to maximize complexity.

    The whole argument is a fake morality.  If what you really valued was complexity, then you would be justifying the parental-love drive by pointing to how it increases complexity.  If you justify a complexity drive by alleging that it increases parental love, it means that what you really value is the parental love.  It's like giving a prosocial argument in favor of selfishness.

    But if you consider the affective death spiral, then it doesn't increase the perceived niceness of "complexity" to say "A mother's relationship to her daughter is only important because it increases complexity; consider that if the relationship became simpler, we would not value it."  What does increase the perceived niceness of "complexity" is saying, "If you set out to increase complexity, mothers will love their daughters - look at the positive consequence this has!"

    This point applies whenever you run across a moralist who tries to convince you that their One Great Idea is all that anyone needs for moral judgment, and proves this by saying, "Look at all these positive consequences of this Great Thingy", rather than saying, "Look at how all these things we think of as 'positive' are only positive when their consequence is to increase the Great Thingy."  The latter being what you'd actually need to carry such an argument.

    But if you're trying to persuade others (or yourself) of your theory that the One Great Idea is "bananas", you'll sell a lot more bananas by arguing how bananas lead to better sex, rather than claiming that you should only want sex when it leads to bananas.

    Unless you're so far gone into the Happy Death Spiral that you really do start saying "Sex is only good when it leads to bananas."  Then you're in trouble.  But at least you won't convince anyone else.

    In the end, the only process that reliably regenerates all the local decisions you would make given your morality, is your morality.  Anything else - any attempt to substitute instrumental means for terminal ends - ends up losing purpose and requiring an infinite number of patches because the system doesn't contain the source of the instructions you're giving it.  You shouldn't expect to be able to compress a human morality down to a simple utility function, any more than you should expect to compress a large computer file down to 10 bits.

    Addendum:  Please note that we're not yet ready to discuss Friendly AI, as such, on Overcoming Bias.  That will require a lot more prerequisite material.  This post is only about why simple utility functions fail to compress our values.

    New to LessWrong?

    New Comment
    63 comments, sorted by Click to highlight new comments since: Today at 10:43 PM

    Since you've advertised this as a summing up post, which took a month of prerequisites to set up, perhaps you could give us a hint of where you are going with all this. Is the conclusion going to be that the best way to tell a machine what we want is to keep lots of us around, unmodified, to continue to tell the machine what we want?

    But it's not a grand climax, as I remarked yesterday - just a post that happened to require a lot of prerequisites.

    Verbally instructing a powerful AI as to what we want would hardly suffice to make it safe, if the AI did not already know what we wanted (Type II genie).

    I'm not going to spring some grand, predefined conclusion at the end of this. All that is being gradually sprung here is the ability to understand what the Friendly AI problem is, and why it is hard. I have observed that the chief difficulty I have with Friendly AI discussions is getting people to understand the question, the requirements faced by an attempted answer.

    Verbally instructing a powerful AI as to what we want would hardly suffice to make it safe, if the AI did not already know what we wanted (Type II genie).

    The genie only needs to have a terminal goal of interpreting instructions correctly. If it has that .TG, it will acquire the instrumental goal of checking for areas of ambiguity and misunderstanding, and the further instrumental goal of resolving them. At the point where the AI is statisted it has understood the instruction it will know as much about human morality/preferences as it needs to understand the instruction correctly. It does not need to be preloaded, with complete knowledge of morality/preferences: it will ask questions or otherwise research.

    The type II genie story is not very relevant to the wider UFAI issue, because the genie is posited as being none sentient, apparently meaning it does not have full natural language, and also does not have any self-reflexive capabilities. As such, It can neither realise it is in a box, nor talk it's way out. But why shouldn't an AI that is not linguistically gifted enough to talk it's way out of a box, be linguistically gifted enough to understand instructions correctly?

    More probems with morality = preferences:-

    It has been stated that this post shows that all values are moral values (or that there is no difference between morality and valuation in general, or..) in contrast with the common sense view that there are clear examples of morally neutral preferences, such as prefences for differnt flavours of ice cream.I am not convinced by the explanation, since it also applies ot non-moral prefrences. If I have a lower priority non moral prefence to eat tasty food, and a higher priority preference to stay slim, I need to consider my higher priority preference when wishing for yummy ice cream.To be sure, an agent capable of acting morally will have morality among their higher priority preferences -- it has to be among the higher order preferences, becuase it has to override other preferences for the agent to act morally. Therefore, when they scan their higher prioriuty prefences, they will happen to encounter their moral preferences. But that does not mean any preference is necessarily a moral preference. And their moral prefences override other preferences which are therefore non-moral, or at least less moral.There is no safe wish smaller than an entire human morality.There is no safe wish smaller than all the subset of value structure, moral or amoral, above it in priority. The subset below doesn't matter. However, a value structure need not be moral at all, and the lower stories will probably be amoral even if the upper stories are not.Therefore morality is in general a subset of prefences, as common sense maintained all along.

    a terminal goal of interpreting instructions correctly

    There is a huge amount of complexity hidden beneath this simple description.

    I'll say it again: absolute complexity is not relative complexity.

    Everything in AGI us very complex in absolute teams.

    In relative terms, language is less complex than language+morality

    That would matter if you didn't need language+morality to interpret language in this case. To interpret instructions correctly, you have to understand what they mean, and that requires a full understanding of the motivations underlying the request.

    You don't just need language, you need language+thought, which is even more complex than language+morality.

    I am using "having language" to mean "having language plus thought", ie to have linguistic understanding, ie to have the ability to pass a Turning Test. Language without thought is just parotting.

    To follow instructions relating morality correctly, an entity must be able to understand them correctly at the semantic levlel. An entity need not agree with them, or hold to them itself, as we can see from the ability of people to play along with social rules they don't personally agree with.

    No, that's not right. language + thought is to understand language and be able to fully model the mindstate of the person who was speaking to you. If you don't have this, and just have language, 'get grandma out of the burning house ' gets you the lethal ejector seat method. If you want do-what-I-mean rather than do-what-I-say, you need full thought modeling. Which is obviously harder than language + morality, which requires only being able to parse language correctly and understand a certain category of thought.

    Or to phrase it a different way: language on its own gets you nothing productive, just a system that can correctly parse statements. To understand what they mean, rather than what they say, you need something much broader, and language+morality is smaller than that broad thing.

    Fully understanding the semantics of morality may be simpler than fully understanding the semantics of everything, but it doesn't get you AI safety, because an AI can understand something without being motivated to act on it.

    When I wrote "language", I meant words + understanding ....understanding in general, therefore including understanding of ethics..and when I wrote "morality" I meant a kind motivation.

    When I wrote "language", I meant

    When I use a word… it means just what I choose it to mean

    (Alice in Wonderland)

    The traditional idea with genie is that they give you what you wanted but you missed the implications of what you wanted to have.

    It's garbage in, garbage out.

    The problem isn't vague instructions but vague goals.

    Yeah...didn't I just argue against that?* A genie with the goal of interpreting instruction perfectly and and the competence to interpret instructions corrrctly would interpret instruction s correctly.

    • or at least stipulate in the very first sentence.

    I guess I'd be interested in hearing sometime why the usual subordinate/superior relation mechanisms are not expected to work well in this case. Usually a subordinate has a sloppy model of what his superior wants, and when he has doubts whether his superior would approve of some action, he asks. Both sides adjust their rates of talking, acting, and monitoring to match their estimated modeling uncertainty.

    It may be that I need to read one of those links in the previous post, but - I tend to imagine that AIs will need to have upbringings of some sort. We acquire morality much as we acquire knowledge - does it suffice for the AIs to do the same?

    You shouldn't expect to be able to compress a human morality down to a simple utility function, any more than you should expect to compress a large computer file down to 10 bits.

    I think it is a helpful exercise, in trying to live "The Examined Life", to attempt to compress a personal morality down to the fewest number of explicitly stated values.

    Then, pay special attention to exactly where the "compressed morality" is deficient in describing the actual personal morality.

    I find, often but not always, that it is my personal morality that would benefit from modification, to make it more like the "compressed morality".


    My opinion - which Eliezer may find quite silly - is that the formalism of utility functions is itself flawed. I think that the key problem with this system is that it utterly resists change. As many people have pointed out, if you have some utility function U, which may be very complicated ["we are godshatter"], then a rational agent whose motavational system strives to maximize U will never ever ever knowingly change U. One might say that a utility maximizer is motivationally frozen. No input from the real world can change what it thinks is important.

    In my opinion, the big thing missing in FAI theory is a good understanding of how to set up a motivational architecture which does not have the above property, i.e. a system which allows inputs from the world to change what an agent thinks is important.

    I've written down a vague idea of how such a thing might work on my blog:


    Of course this is all very preliminary, but I think that we have to get away from the limiting formalism of fixed utility functions.

    There are certainly a lot of people who have been working on this problem for a long time. Indeed, since before computers were invented. Obviously I'm talking about moral philosophers. There is a lot of bad moral philosophy, but there is also a fair amount of very good moral philosophy tucked away in there -- more than one lifetime worth of brilliant insights. It is tucked away well enough that I doubt Eliezer has encountered more than a little of it. I could certainly understand people thinking it is all rubbish by taking a reasonably large sample and coming away only with poorly thought out ethics (which happens all too often), but there really is some good stuff in there.

    My advice would be to read Reasons and Persons (by Derek Parfit) and The Methods of Ethics (by Henry Sidgwick). They are good starting places and someone like Eliezer would probably enjoy reading them too.

    The post implies that utilitarianism is obviously false, but I don't think this is so. Even if it were false, do you really think it would be so obviously false? Utilitarians have unsurprisingly been aware of these issues for a very long time and have answers to them. Happiness being the sole good (for humans at least) is in no way invalidated by the complexity of relationship bonds. It is also not invalidated by the fact that people sometimes prefer outcomes which make them less happy (indeed there is one flavour of utilitarianism for happiness and one for preferences and they each have adherents).

    It is certainly difficult to work out the exhaustive list of what has intrinsic value (I agree with that!), and I would have strong reservations about putting 'happiness' into the AI given my current uncertainty and the consequences of being mistaken, but it is far from being obviously false. In particular, it has the best claim I know of to fitting your description of the property that is necessary in everything that is good ('what use X without leading to any happiness?').


    Oops: forgot the html. I've tackled this issue on my blog

    The above comment was posted by me, Toby Ord. I'm not sure why my name didn't appear -- I'm logged in.


    what do you think of the following idea that I have already posted on the sl4 mailing list:

    Bootstrap the FAI by first building a neutral obedient AI(OAI) that is constrained in such a way that it doesn't act besides giving answers to questions. Once you have that could it possibly be easier to build a FAI? The OAI could be a tremendous help answering difficult questions and proposing solutions.

    This reminds me of the process of bootstrapping compilers: http://en.wikipedia.org/wiki/Bootstrapping_(compilers) http://cns2.uni.edu/~wallingf/teaching/155/sessions/session02.html

    Gentle readers, I'm afraid we're not yet ready to have more general conversations about Friendly AI. Just about how a simple utility function won't compress all human values. One thing at a time.

    (You should all feel welcome to look for the flaws in your own solutions, though.)

    Toby Ord, the case I made against happiness as sole value is contained in "Not for the Sake of Happiness (Alone)".

    "Bootstrap the FAI by first building a neutral obedient AI(OAI) that is constrained in such a way that it doesn't act besides giving answers to questions."

    As long as we be sure not to feed it too hard questions, specifically, questions that it is hard to answer a priori without actually doing something. (eg, an AI that tried to plan the economy would likely find it impossible to define and thus solve the relevant equations without being able to adjust some parameters)

    If you have to avoid asking the AI too hard a question to stop it from taking over the world, you've already done something wrong. The important part is to design the oracle AI such that it is capable of admitting that it can't figure out an adequate answer with it's currently available resources, and then moving on to the next question.

    Turning the many in to the one (concepts) is just what humans do, so it's understandable that people keep falling in to this trap. But, I agree, a bit of honest observation does tell you that we have many "ultimate" values.

    I am wondering whether simple utility functions exist in special cases. For example, if a man is alone and starving in the artic, does he have a simple utility function? Or suppose he's a drug addict criminal psychotic desperately scheming to get the next fix? Also, just to make sure I understand the terminology, suppose it's a lower animal, say an alligator, rather than a human. Does an alligator have a simple utility function?

    I'm not sure that friendly AI even makes conceptual sense. I think of it as the "genie to an ant problem". An ant has the ability to give you commands, and by your basic nature you must obey the letter of the command. How can the ant tie you up in fail-safes so you can't take an excuse to stomp him, burn him with a magnifying glass, feed him poison, etc? (NB: said fail-safes must be conceivable to an ant!) It's impossible. Even general benevolence doesn't help - you might decide to feed him to a starving bird.

    What about, if by your basic nature you like the ant? Hell, you might even find yourself doing things like moving him off the road he'd wandered onto on his own.

    But just liking the ants is also not sufficient. You might kill the bird for wanting to eat the ant, and then realize that all birds are threats, and kill all birds without telling the ants, because that's the maximizing solution, despite the possibility of the ants not wanting this and objecting had they known about it.

    FAI is not impossible, but it's certainly a hard problem in many ways.

    Also, there are problems with "by your basic nature you like the ant". Have you read the Guide to Words yet?

    Also, there are problems with "by your basic nature you like the ant".

    Indeed. I was hoping to refute the refutation in its own language.

    Ah, thanks for clarifying that. While I don't want to other-optimize, I feel compelled to tell you that many people would say such a strategy is usually suboptimal, and often leads into falling into the trap of an "argumentative opponent".

    Incidentally, this strikes me as particularly vulnerable to the worst argument in the world, (probably) due to my availability heuristic.

    Actually I rather like the idea of being optimized. Have you got any good links to sources of argument/counterargument strategies? The more I read about the Dark Arts, the more I wish to learn them.

    Being optimized is net positive, and generally understood as good. Other-optimizing, on the other hand, is prone to tons of heuristical errors, map imperfections, scaling problems, mind projection and many other problems such that attempting to optimize the strategy of someone else for many things not already reduced is very risky and has low expected utility, often in the negatives. Telling others what argumentative techniques to use or not definitely falls into this category.

    That's the thing with the Dark Arts. They lure you in, craft a beautiful song, fashion an intricate and alluring web of rapid-access winning arguments with seemingly massive instrumental value towards achieving further-reaching goals... but they trap and ensnare you, they freeze your thought into their logic, they slow down and hamper solid rationality, and they sabotage the thoughts of others.

    It takes quite a master of rationality to use the art of Shadowdancing with reliably positive net expected utility. As long as you want to optimize for the "greater good", that is.

    I thank you for your concern. I'd never even think about being evil unless it were for the greater good.

    I'd never even think about being evil unless it were for the greater good.

    Most evil people would say that. They'd even believe it.

    curses. there goes my cover.

    (BTW, this is an outdated opinion and I no longer think this.)

    Bootstrap the FAI by first building a neutral obedient AI(OAI) that is constrained in such a way that it doesn't act besides giving answers to questions.

    If you're smart enough, you could rule the world very easily just by giving answers to questions.

    Very simple example: Questioner: Is there a command economy that can produce much more wealth and large happiness for all? AI: Yes. Questioner: Could you design it? AI: Yes. Questioner: Would you need to micromanage it? AI: Yes. Questioner: Would it truly be fabulous? AI: Yes.

    Then people implement the command economy (remember that if the AI is a social genius, it can shade its previous answers to bring people to want such an economy).

    Then a few years later, we have the AI in total control of the entire economy and absolutely vital to it - in which case it doesn't matter that the constraint is only to answer questions, the AI has already ascended to absolute power. No need even for the dramatic finish:

    Questioner: Would you shut down the entire economy, causing misery, war, and the probable extinction of humanity, if we don't obey you? AI: Yes. Questioner: Is there any chance we could resist you? AI: No.

    And the term "obedient" doesn't help your case. Unless you want the AI to obey, blindly, every single order given to it in every single case by every single human, then "obedient" is a massively subtle and complicated concept - nearly as hard as friendly, in my opinion (maybe harder, in fact - it may be that quasi-friendly is a prerequisite for obedient).


    In 'Not for the Sake of Happiness (Alone)' you made a case that happiness is not what each of us is consciously aiming at, or what each of us is ultimately aiming at (potentially through unconscious mechanisms). However, these points are not what utilitarianism is about and few utilitarians believe either of those things. What they do believe is that happiness is what is good for each of us. Even if someone consciously and unconsciously shuns happiness through his or her life, utilitarians argue that none-the-less that life is better for the person the more happiness it involves. While a person may not ultimately value it (in the sense that they aim for it), it may be the only thing of ultimate value to them (in the sense that it makes their life go better).

    This distinction may sound controversial, but it clearly comes up in less difficult cases. For example, someone might have had a brain malfunction and aim to reduce their life's value according to all sensible measures (reduce happiness, pleasure, education, knowledge, friends, etc). They are deeply and sincerely aiming at this, but it makes their life go worse (they value it, but it is not of value to them).

    As I've said, I don't think the happiness account is obviously true (I have some doubts) but it is a very plausible theory and I can't see any argument you have actually made against it, only at similar sounding theories.

    Toby Ord, you should probably sign out and sign back on. Re: utilitarianism, if we are neither consciously nor unconsciously striving for nothing except happiness, and if I currently take a stand against rewiring my brain to enjoy an illusion of scientific discovery, and if I regard this as a deeply important and moral decision, then why on Earth should I listen to the one who comes and says, "Ah, but happiness is all that is good for you, whether you believe it or not"? Why would I not simply reply "No"?

    And, all:

    WE ARE NOT READY TO DISCUSS GENERAL ISSUES OF FRIENDLY AI YET. It took an extra month, beyond what I had anticipated, just to get to the point where I could say in defensible detail why a simple utility function wouldn't do the trick. We are nowhere near the point where I can answer, in defensible detail, most of these other questions. DO NOT PROPOSE SOLUTIONS BEFORE INVESTIGATING THE PROBLEM AS DEEPLY AS POSSIBLE WITHOUT PROPOSING ANY. If you do propose a solution, then attack your own answer, don't wait for me to do it! Because any resolution you come up with to Friendly AI is nearly certain to be wrong - whether it's a positive policy or an impossibility proof - and so you can get a lot further by attacking your own resolution than defending it. If you have to rationalize, it helps to be rationalizing the correct answer rather than the wrong answer. DON'T STOP AT THE FIRST ANSWER YOU SEE. Question your first reaction, then question the questions.

    But above all, wait on having this discussion, okay?


    @ Roland: "Bootstrap the FAI by first building a neutral obedient AI(OAI) that is constrained in such a way that it doesn't act besides giving answers to questions"

    Yes, I've had the same idea, or almost the same idea. I call this the "Artificial Philosophy paradigm" - the idea that if you could build a very intelligent AI, then you could give it the goal of answering your questions, subject to the constraint that it is not allowed to influence the world except through talking to you. You would probably want to start by feeding this AI a large amount of "background data" about human life [videofeeds from ordinary people, transcripts of people's diaries, interviews with ordinary folks] and ask it to get into the same moral frame of reference as we are.

    @ Stuart Armstrong: "If you're smart enough, you could rule the world very easily just by giving answers to questions."

    +10 points for spotting the giant cheesecake fallacy in this criticism. This AI has no desire to rule the world.


    I'm not saying that I have given you convincing reasons to believe this. I think I could give quite convincing reasons (not that I am totally convinced myself) but it would take at least a few thousand words. I'll probably wait until you next swing past Oxford and talk to you a bit about what the last couple of thousands of years of ethical thought can offer the FAI program (short answer: not much for 2,500 years, but more than you may think).

    For the moment, I'm just pointing out that it is currently nil all in the argument regarding happiness as an ultimate value. You have given reasons to believe it is not what we aim at (but this is not very related) and have said that you have strong intuitions that the answer is 'no'. I have used my comments pointing out that this does not provide any argument against the position, but have not made any real positive arguments. For what its worth, my intuition is 'yes' it is all that matters. Barring the tiny force of our one bit statements of our intuitions, I think the real question hasn't begun to be debated as far as this weblog is concerned. The upshot of this is that the most prominent reductive answer to the question of what is of ultimate value (which has been developed for centuries by those whose profession is to study this question) is still a very live option, contrary to what your post claims.


    @Eliezer: "But above all, wait on having this discussion, okay?"

    I don't see a good way to work on this problem without proposing some preliminary solutions. Let me just say that I certainly don't expect the solutions that I've proposed (UIVs, artificial philosophy) to actually work, I see them as a starting point, and I absolutely promise not to become emotionally attached to them.

    If you can think of a good way of attacking the problem that does not involve proposing and criticizing solutions, then do a post on it.

    If we assume that our morality arises from the complex interplay of a thousand shards of genetic desire, each with its own abstract 'value' in our minds, then I wholly agree that we can't reduce things down to a single function.

    But clear this up for a non-expert: does stating that ethics cannot come from a single utility function imply that no amount of utility functions (i.e. no computer program) will do the job on their own? Surely an enormously complex morality-machine (NOT a Friendly AI!) would be of no application to human morality unless it had a few hard-coded values such those we place on life, fairness, freedom and so on? If only because this is the way our morality works.

    If parental love is ultimately a genetic function, surely it's impossible to separate it from the biological substrate without simply reverse-engineering the brain? More bluntly, how can we write a value-free program that makes parental love look sensible?

    Roko, I would say information gathering is the number 1 goal at this point. Unless you understand the question your answers are empty. Whatever you think you know about the question; you don't know, I don't know and I'm pretty sure Eli doesn't know. Some may know more than others but it still isn't close to whole.

    For the record, I would guess you know more about the AI problem than I do.

    In my experience, the best answers become intuitive at a certain level of understanding. The goal in my eyes isn't to reach the answer, but to reach the understanding that yields the answer.


    I think if you read Eli's chapter on biases, proposing solutions tends to cause a version of the anchoring bias, even if you consciously attempt to emotionally divorce yourself from them.

    I'd love to know the loose timeframe we're working with here, though, and yeah, I realize sometimes a day's work turns into a month. If I had an idea, I could go back and start from post one, and time it to be up to speed at about the right time.

    It seems to me that you are making an error in conflating, or at least not distinguishing, what people in fact prefer/strive at and what is in fact morally desirable.

    So long as you are talking about what people actually strive for the only answer is the actual list of things people do. There is unlikely to be any fact of the matter AT ALL about what someones 'real preferences' are that's much less complicated than a description of their total overall behavior.

    However, the only reason your arguments seems to be making a nontrivial point is because it talks about morality and utility functions. But the reason people take moral talk seriously and give it more weight then they would arguments about asthetics or what's tasty is because people take moral talk to be descibring objective things out in the world. That is when someone (other than a few philosophically inclined exceptions) says, "Hey don't do that it's wrong" they are appealing to the idea that there are objective moral facts the same way their are objective physical facts and people can be wrong about them just like they can be wrong about physical facts.

    Now there are reasonable arguments to the effect that there is no such thing as morality at all. If you find these persuasive that's fine but then talking about moral notions without qualification is just as misleading as talking about angels without explaining you redefined them to mean certain neurochemical effects. On the other hand if morallity is a real thing that's out there in some sense it's perfectly fair to induct on it the same way we induct on physical laws. If you go look at the actual results of physical experiments you see lots of noise (experimental errors, random effects) but we reasonable pick the simple explanation and assume that the other effects are due to measuring errors. People who claim that utilitarianism is the one true thing to maximize aren't claiming that this is what other people actually work towards. They are saying other people are objectively wrong in not choosing to maximize this.

    Now despite being (when I believe in morality at all) a utilitarian myself I think programming this safeguard into robots would be potentially very dangerous. If we could be sure they would do it perfectly accurately fine, people sacrificed for the greater good not withstanding. However, I would worry that the system would be chaotic with even very small errors in judgement about utility causing the true maximizer to make horrific mistakes.


    @ J. Hill:

    "Whatever you think you know about the question; you don't know" - a little too Zen for my liking. I'm pretty sure I know something about FAI.

    @ burger flipper:

    "proposing solutions tends to cause a version of the anchoring bias"

    yeah, I can see how that might be a problem. For example, I get the feeling that everyone is "anchored" to the idea of a utility-maximization based AI.

    The best way to get rid of anchoring bias is, IMO, to propose lots of radically different solutions to the problem so that the anchoring to each one is canceled out by anchoring to the others.

    Toby Ord, you need to click "Sign Out", then sign in again, even if it shows "You are currently signed in as Toby Ord."

    The reason I'm skeptical of utilitarianism in this context is that my view on morality is that it is subjectively objective. I experience morality as objective in the sense of not being subject to arbitrary alterations and having questions with determinable yet surprising answers. The one who experiences this is me, and a differently constructed agent could experience a different morality. While I still encounter moral questions with surprising answers, I don't expect to ever encounter an answer so surprising as to convince me that torturing children should be positively terminally valued. I would be very surprised if there were a valid moral argument leading up to that conclusion. I would be less surprised, but still very much so, if there were a valid moral argument leading up to hedonic utilitarianism - and if you present this moral philosophy today, you must have arguments that justify it today. You can't just say, "Well, there might be an argument" any more than you can just say, "Well, there might be an argument that you should give all your money to Reverend Moon."

    But clear this up for a non-expert: does stating that ethics cannot come from a single utility function imply that no amount of utility functions (i.e. no computer program) will do the job on their own?

    That doesn't follow from the information given. As far as we know at this point, a complex utility function will do the job. There are reasons why you might not want to use a standard-yet-complex utility function in a Friendly AI, but those reasons are beyond the scope of this particular discussion.

    You yourself are a computer. You are a protein computer. What you can do, computers can do.


    Re utilitarianism: its fine to have an intuition that it is incorrect. It is also fine to be sceptical in the presence of a strong intuition against something and no good arguments presented so far in its favor (not in this forum, and presumably not in your experience). I was just pointing out that you have so far offered no arguments against it (just against a related but independant point) and so it is hardly refuted.

    Re posts and names: I posted the 7:26pm, 5:56am and the 9:40am posts (and I tried the log in and out trick before the 9:40 post to no avail). I did not post the 1:09pm post that has my name signed to it and is making similar points to those I made earlier. Either the Type Key system is really undergoing some problems or someone is impersonating me. Probably the former. Until this is sorted out, I'll keep trying things like logging in and out and resetting my cookies, and will also sign thus,


    Now that I think of it, you probably just saw that it was unsigned and assumed it was me, putting my name on it.


    Toby, I don't understand what would count as an "argument against" utilitarianism if the arguments I gave here don't count. What argument has ever been offered for utilitarianism of a type that I have not countered in kind? What criterion are they appealing to which I am not?

    "Utilitarians have unsurprisingly been aware of these issues for a very long time and have answers to them. Happiness being the sole good (for humans at least) is in no way invalidated by the complexity of relationship bonds." (Toby Ord)

    Toby, one can be utilitarian and pluralist, so "happiness" need not be the only good on a utilitarian theory. Right? (I contradict only to corroborate.)

    Eliezer, when you say you think morality is "subjectively objective," I take that to mean that a given morality is "true" relative to this or that agent -- not "relative" in the pejorative sense, but in the "objective" sense somewhat analogous to that connoted by relativity theory in physics: In observing moral phenomena, the agent is (part of) the frame of reference, so that the moral facts are (1) agent-relative but (2) objectively true. (Which is why, as a matter of moral theory, it would probably be more fruitful to construe 'moral relativity' merely as the denial of moral universality instead of as the denial of normative facts-of-the-matter tout court -- particularly since no one really buys moral relativity in the conventional sense.)

    +10 points for spotting the giant cheesecake fallacy in this criticism. This AI has no desire to rule the world.

    You're claiming that we know enough about the AI motivations to know that ruling the world will never be an instrumental value for it? (Terminal values are easier to guard against). If we have an AI that will never make any attempt to rule the world (or strongly influence it, or paternalisticaly guide it, or other synonyms), then congratulations! You've sucessfuly built a harmless AI. But you want it to help design a Friendly AI, knowing that the Friendly AI will intervene in the world? If it accepts to do that, while refusing to intervene in the world in other ways, then it is already friendly.

    I'd need muuuuuch more thought out evidence before I could be persuaded that some variant on this plan is a good idea.

    If anyone wants to continue the discussion, my email is dragondreaming@gmail.com (Eliezer has asked that we not use this comment section for talking about general AI stuff).


    @stuart Armstrong:

    yes, this is a good point, and I don't think we can really resolve the issue until we know more. For what it's worth, I think that it would probably be easier to design a harmless AI than a friendly one, and easier to make an AI friendly if the AI has a very limited scope for what it is supposed to achieve.

    "You are a protein computer. What you can do, computers can do."

    Absolutely true, but I'm a stupendously complex computer with an enormous number of hard-wired, if modifiable, conflicting moral values that come together holistically to create 'my ethics'. My original question really relates to whether there's a way to generate this effect (or even model it) without replicating this massively complex network.


    That is a very interesting question. I'm not sure how to answer it. It would be a good test of a scientific claim as you need to provide falsification conditions. However philosophy does not work the same way. If utilitarianism is true, it is in some way conceptually true. I wouldn't know how to tell you what a good argument against 3 + 3 = 6 would look like (and indeed there are no decisive arguments against it). This does not count against the statement or my belief in it.

    My best attempt to say that a good argument would be one that showed that happiness being the only thing that is good for someone is in direct conflict with something that many people find to be clearly false and that this would still be the case after considerable reflection. This last part is important as I find many of its consequences unintuitive before reflection and then see why I was confused (or I think I see why I was conufused...). It has to appeal to what is good for people rather than what they aim at as it is a theory about goodness, not about psychology (though you might be able to use psychological premises in an argument with conclusions about goodness).


    ...there really is some good stuff in there. My advice would be to read Reasons and Persons (by Derek Parfit) and The Methods of Ethics (by Henry Sidgwick).

    Looked up both. Two bum steers. Sidgwick is mostly interested is naming and taxonomizing ethical positions, and Parfit is just wrong.

    and the love of a man for a woman, and the love of a woman for a man

    Eliezer, as much as I agree with 95% of what you say, the devil lies in the details. In this case, a heteronormative bias. (Lesbian here.)

    I thought about that too when I read it! Good catch! :) To be fair though, he did not exclude anything, only listed a few examples. But for his point of complexity, including homosexual love would have made it even clearer!

    That the human "program" contains a coherent utility function seems to be an unargued assumption. Of course, if it doesn't contain one, the potential adequacy of a simple artificial implementation is probably even more doubtful.

    We have every reason to believe that insofar as humans can said to be have values, there are lots of them - high Kolmogorov complexity. A human brain implements a thousand shards of desire

    However it is no a reason why we shouldn't use low Kolmogorov complexity as a criterion in evaluating normative models (utility functions / moral philosophies), just as it serves a criterion for evaluating descriptive models of the universe

    There are two possible points of view. One point of view is that each of us has a "hardcoded" immutable utility function. In this case arguments about moral values appear meaningless since a person is unable to change her own values. Another point of view is that our utility function is reprogrammable, in some sense. In this case there is a space of admissible utility functions: the moral systems you can bring yourself to believe. I suggest using Occam's razor (Kolmogorov complexity minimization) as a fundamental principle of selecting the "correct" normative model from the admissible space

    A possible model of the admissible space is as follows. Suppose human psychology contains a number of different value systems ("shades of desire") each with its own utility function. However the weights used for combining these functions are reprogrammable. It might be that in reality they are not even combined with well-defined weights, i.e. in a way preserving the VNM axioms. In this case the "correct" normative model results from choosing weight functions such that the resulting convex linear combination is of minimal Kolmogorov complexity

    "But at least you won't convince anyone else" - maybe the poorest AND the most unexpected argument in all I've read on lesswrong.com. What if the optimal utility function does exist (not literally "bananas" but something), but some human biases - of the sort you describe or of some other - prevents even good Bayescraft masters from admitting the derivation?

    (This is an attack on the proof not on what it claims to prove.)

    Popular with Silicon Valley VCs 16 years later: just maximize the rate of entropy creation🤦🏻‍♂️