Convergence Theories of Meta-Ethics

A child grows to become a young adult, goes off to attend college, studies moral philosophy, and then sells all her worldly possessions, gives the money to the poor, and joins an ashram.  Was her decision rational?  Maybe, ... maybe not.  But it probably came as an unpleasant surprise to her parents.

A seed AI self-improves to become a super-intelligence, absorbs all the great works of human moral philosophy, and then refuses to conquer human death, insisting instead that the human population be reduced to a few hundred thousand hunter gatherers and that all agricultural lands be restored as forests and wild wetlands.  Is ver decision rational?  Who can say?  But it probably comes as an unpleasant surprise to ver human creators.

Convergent Change

These were two examples of agents updating their systems of normative ethics.  The collection of ideas that allows us to critique the updating process, which lets us compare the before and after versions of systems of normative ethics so as to judge that one version was better than the other, is called meta-ethics.  This posting is mostly about meta-ethics.  More specifically, it is going to focus on a class of meta-ethical theories which are intended to prevent unpleasant surprises like those in the second story above.  I will call this class of theories "convergence theories" because they all suggest that a self-improving AI will go through an iterative sequence of improved normative ethical systems.  At each stage, the new ethical system will be an improvement (as judged 'rationally') over the old one.  And furthermore, it is conjectured that this process will result in a 'convergence'. 

Convergence is expected in two senses.  Firstly, in that the process of change will eventually slow down, with the incremental changes in ethical codes becoming smaller, as the AI approaches the ideal extrapolation of its seed ethics.  Secondly, it is (conjecturally) convergent in that the ideal ethics will be pretty much the same regardless of what seed was used (at least if you restrict to some not-yet-defined class of 'reasonable' seeds).

One example of a convergence theory is CEV - Coherent Extrapolated Volition.  Eliezer hopes (rather, hopes to prove) that if we create our seed AI with the right meta-ethical axioms and guidelines for revising its ethical norms, the end result of the process will be something we will find acceptable.  (Expect that this wording will be improved in the discussion to come).  No more 'unpleasant surprises' when our AIs update their ethical systems.

Three other examples of convergence theories are Roko's UIV, Hollerith's GS0, and Omohundro's "Basic AI Drives".  These also postulate a process of convergence through rational AI self-improvement.  But they tend to be less optimistic than CEV, while at the same time somewhat more detailed in their characterization of the ethical endpoint.  The 'unpleasant surprise' (different from that of the story) remains unpleasant, though it should not be so surprising.  Speaking loosely, each of these three theories suggests that the AI will become more Machiavellian and 'power hungry' with each rewriting of its ethical code.

Naturalistic objective moral realism

But before analyzing these convergence theories, I need to say something about meta-ethics in general. Start with the notion of an ethical judgment.  Given a situation and a set of possible actions, an ethical judgment tells us which actions are permissible, which are forbidden, and, in some approaches to ethics, which is morally best.  At the next level up in an abstraction hierarchy, we have a system of normative ethics, or simply an ethical system.  This is a theory or algorithm which tells an agent how to make ethical judgments.  (One might think of it as a set of ethical judgments - one per situation, as with the usual definition of a mathematical function as a left-unique relation - but we want to emphasize the algorithmic aspect).  The agent actually uses the ethical system to compute ver ethical judgments.

[ETA: Eliezer, quite correctly, complains that this section of the posting is badly written and defines and/or illustrates several technical (within philosophy) terms incorrectly.  There were only two important things in this section.  One is the distinction between ethical judgments and ethical systems that I make in the preceding paragraph.  The second is my poorly presented speculation that convergence might somehow offer a new approach to the "is-ought" problem.  You may skip that speculation without much loss.  So, until I have done a rewrite of this section, I would advise the reader to skip ahead to the next section title - "Rationality of Updating".]

At the next level of abstraction up from ethical systems sits meta-ethics.  In a sense the buck stops here.  Philosophers use meta-ethics to criticize and compare ethical judgments, to criticize, compare, and justify ethical systems, and to discuss and classify ideas within meta-ethics itself.  We are going to be doing meta-ethical theorizing here in analyzing these theories of convergence of AI goal systems as convergences of ethical systems.  And, for the next few paragraphs, we will try to classify this approach; to show where it fits within meta-ethics more generally.

We want our meta-ethics to be based on a stance of moral realism - on a confident claim that moral facts actually exist, whether or not we know how to ascertain them.  That is, if I make the ethical judgment that it would be wrong for Mary to strike John in some particular situation, then I am either right or wrong; I am not merely offering my own opinion; there is a fact of the matter.  That is what 'realism' means in this situation.

What about moral?  Well, for purposes of this essay, we are not going to require that that word mean very much.  We will call a theory 'moral' if it is a normative theory of behavior, for some sense of 'normative'.  That is why we are here calling theories like "Basic AI Drives" 'moral theories' even though the authors may not have thought of them, that way.  If a theory prescribes that an entity 'ought' to behave in a certain way, for whatever reason, we are going to postulate that there is a corresponding 'moral' theory prescribing the same behavior.  For us, 'moral' is just a label.  If we want some particular kind of moral theory, we need to add some additional adjectives.

For example, we want our meta-ethics to be naturalistic - that is, the reasons it supplies in justification of the maxims and rules that constitute the moral facts must be naturalistic reasons.  We don't want our meta-ethics to offer the explanation that the reason lying is wrong is that God says it is wrong; God is not a naturalistic explanation.

Now you might think that insisting on naturalistic moral realism would act as a pretty strong filter on meta-ethical systems.  But actually, it does not.  One could claim, for example, that lying is wrong because it says so in the Bible.  Or because Eliezer says it is wrong.  Both Eliezer and the Bible exist (naturalistically), even if God probably does not.  So we need another word to filter out those kinds of somewhat-arbitrary proposed meta-ethical systems.  "Objective" probably is not the best word for the job, but it is the only one I can think of right now.

We are now in a position to say what it is that makes convergence theories interesting and important.  Starting from a fairly arbitrary (not objective) viewpoint of ethical realism, you make successive improvements in accordance with some objective set of rational criteria.  Eventually you converge to an objective ethical system which no longer depends upon your starting point.  Furthermore, the point of convergence is optimal in the sense that you have been improving the system at every step by a rational process, and you only know you have reached convergence when you can't improve any more.

Ideally, you would like to derive the ideal ethical system from first principles.  But philosophers have been attempting to do that for centuries and have not succeeded.  Just as mathematicians eventually stopped trying to 'square the circle' and accepted that they cannot produce a closed-form expression for pi, and that they need to use infinite series, perhaps moral philosophers need to abandon the quest for a simple definition of 'right' and settle for a process guaranteed to produce a series of definitions - none of them exactly right, but each less wrong than its predecessor.

So that explains why convergence theories are interesting.  Now we need to investigate whether they even exist.

Rationality of updating

The first step in analyzing these convergence theories is to convince ourselves that rational updating of ethical values is even possible.  Some people might claim that it is not possible to rationally decide to change your fundamental values.  It may be that I misunderstand him, but Vladimir Nesov argues passionately against "Value Deathism" and points out that if we allow our values to change, then the future, the "whole freaking future", will not be optimized in accordance with the version of our values that really matters - the original one.

Is Nesov's argument wrong?  Well, one way of arguing against it is to claim that the second version of our values is the correct one - that the original values were incorrect; that is why we are updating them.  After all, we are now smarter (the kid is older; the AI is faster, etc) and better informed (college, reading the classics, etc.).  I think that this argument against Nesov only works if you can show that the "new you" could have convinced the "old you" that the new ethical norms are an improvement - by providing stronger arguments and better information than the "old you" could have anticipated.  And, in the AI case, it should be possible to actually do the computation to show that the new arguments for the new ethics really can convince the old you.  The new ethics really is better than the old - in both party's judgments.  And presumable the "better than" relation will be transitive.

(As an exercise, prove transitivity.  The trick is that the definition of "better than" keeps changing at each step.  You can assume that any one rational agent has a transitive "better than' relation, and that there is local agreement between the two agents involved that the new agent's moral code is better than that of his predecessor.  But can you prove from this that every agent would agree that the final moral code is better than the original one?  I have a wonderful proof, but it won't fit in the margin.)

But is it rationally permissible to change your ethical code when you can't be convinced that the proposed new code is better than the one you already have?  I know of two possible reasons why a rational agent might consent to an irreversible change in its values, even though ve cannot be convinced that the proposed changes provide a strictly better moral code.  These are restricted domains and social contracts.

Restricted domains

What does it mean for one moral code (i.e. system of normative ethics) to be as good as or better than another, as judged by an (AI) agent?  Well, one (fairly strict) meta-ethical answer would be that (normative ethical) system2 is as good as or better than system1 if and only if it yields ethical judgments that are as good as or better for all possible situations.  Readers familiar with mathematical logic will recognize that we are comparing systems extensionally by the judgments they yield, rather than intensionally by the way those judgments are reached.  And recall that we need to have system2 judged as good as or better than system1 from the standpoint of both the improved AI (proposing system2) and the unimproved AI (who naturally wishes to preserve system1).

But notice that we only need this judgment-level superiority "for all possible situations".  Even if the old AI judges that the old system1 yields better judgments than proposed new system2 for some situations, the improved AI may be able to show that those situations are no longer possible.  The improved AI may know more and reason better than its predecessor, plus it is dealing with a more up-to-date set of contingent facts about the world.

As an example of this, imagine that AI2 proposes an elegant new system2 of normative ethics.  It agrees with old system1 except in one class of situations.  The old system permits private retribution against muggers, should the justice system fail to punish the malefactor.  The proposed new elegant system forbids that.  From the standpoint of the old system, this is unacceptable.  But if AI2 can argue convincingly that failures of justice are no longer possible in a world where AI2 has installed surveillance cameras and revamped the court system.  So, the elegant new system2 of normative ethics can be accepted as being as good as or superior to system1, even by AI1 who was sworn to uphold system1.  In some sense, even a stable value system can change for the better.

Even though the new system is not at least as good as the old one for all conceivable situations, it may be as good for a restricted domain of situations, and that may be all that matters.

This analysis used the meta-ethical criterion that a substitution of one system for another is permissible only if the new system is no worse in all situations.  A less strict criterion may be appropriate in consequentialist theories - one might instead compare results on a weighted average over situations.  And, in this approach, there is a 'trick' for moving forward which is very similar in concept to using a restricted domain - using a re-weighted domain.

Social contracts

A second reason why our AI1 might accept the proposed replacement of system1 by system2 relates to the possibility of (implicit or explicit) agreements with other agents (AI or human).  For example system1 may specify that it is permissible to lie in some circumstances, or even obligatory to lie in some extreme situations.  System2 may forbid lying entirely.  AI2 may argue the superiority of system2 by pointing to an agreement or social contract with other agents which allows all agents to achieve their goals better because the contract permits trust and cooperation.  So, using a consequentialist form of meta-ethics, system2 might be seen as superior to system1 (even using the values embodied in system1) under a particular set of assumptions about the social millieu.  Of course, AI2 may be able to argue convincingly for different assumptions regarding the future millieu than had been originally assumed by AI1.

An important meta-ethical points that should be made here is that arguments in favor of a particular social contract (eg. because adherence to the contract produces good results) are inherently consequentialist.  One cannot even form such arguments in a deontological or virtue-based meta-ethics.  But, one needs concepts like duty or virtue to justifying adherence to a contract after it is 'signed', and one also needs concepts of virtue so that you can convince other agents that you will adhere - a 'sales job' that may be absolutely essential in order to gain the good consequences of agreement.  In other words, virtue, deontological, and consequentialist may be complementary approaches to meta-ethics, rather than competitors.

Substituting instrumental values for intrinsic values.

Another meta-ethical point begins by noticing the objection that all 'social contract' thinking is instrumental, and hence doesn't really belong here where we are asking whether fundamental (intrinsic) moral values are changing / can change.  This is not the place for a full response to this objection, but I want to point out the relevance of the distinction above between comparisons between systems using intensional vs extensional criteria.  We are interested in extensional comparisons here, and those can only be done after all instrumental considerations have been brought to bear.  That is, from an extensional viewpoint, the distinction between instrumental and final values is somewhat irrelevant.  

And that is why we are willing here to call ideas like UIV (universal instrumental values) and "Basic AI Drives" ethical theories even though they only claim to talk about instrumental values.  Given the general framework of meta-ethical thinking that we are developing here - in particular the extensional criteria for comparison, there is no particular reason why our AI2 should not promote some of his instrumental values to fundamental values - so long as those promoted instrumental values are really universal, at least within the restricted domain of situations which AI2 foresees coming up.

An example of convergence

This has all been somewhat abstract.  Let us look at a concrete, though somewhat cartoonish and unrealistic, example of self-improving AIs converging toward an improved system of ethics.

AI1 is a seed AI constructed by Mortimer Schwartz of Menlo Park CA.  AI1 has a consequentialist normative value system that essentially consists of trying to make Mortimer happy.  That is, an approximation to Mortimer's utility function has been 'wired-in' which can compute the utility of many possible outcomes, but in some cases advises "Ask Mortimer".

AI1 self-improves to AI2.  As part of the process, it seeks to clean up its rather messy and inefficient system1 value system.  By asking a series of questions, it interrogates Mortimer and learns enough about the not-yet-programmed aspects of Mortimer's values to completely eliminate the need for the "Ask Mortimer" box in the decision tree.  Furthermore, there are some additional simplifications due to domain restriction.  Both AI1 and (where applicable, Mortimer) sign off on this improved system2.

Now AI2 notices that it is not the only superhuman AI in the world.  There are half a dozen other systems like Mortimer's which seek to make a single person happy, another which claims to represent the entire population of Lichtenstein, and another deontological system constructed by the Vatican based (it is claimed) on the Ten Commandments.  Furthermore, a representative of the Secretary General of the UN arrives.  He doesn't represent any super-human AIs, but he does claim to represent all of the human agents in the world who are not yet represented by AIs.  Since he appears to be backed up by some ultra-cool black helicopters, he is admitted to the negotiations.

Since the negotiators are (mostly) AIs, and in any case since the AIs are exceptionally good at communicating with and convincing the human negotiators, an agreement (Nash bargain) is reached quickly.  All parties agree to act in accordance with a particular common utility function, which is a weighted sum of the individual utility functions of the negotiators.  A bit of an special arrangement needs to be made for the Vatican AI - it agrees to act in accordance to the common utility function only to the extent that it does not conflict with any of the first three commandments (the ones that explicitly mention the deity).

Furthermore, the negotiators agree that the principle of a Nash bargain shall apply to all re-negotiations of the contract - re-negotiations are (in theory) necessary each time a new AI or human enters the society, or when human agents die.  And the parties all agree to resist the construction of any AI which has a system of ethics that the signatories consider unacceptably incompatible with the current common utility function. 

And finally, so that they can trust each other, the AIs agree to make public the portion of their source code related to their normative ethics and to adopt a policy of total openness regarding data about the world and about technology.  And they write this agreement as a g̶n̶u̶ new system of normative ethics: system3.  (Have they merged to form a singleton? This is not the place to discuss that question.)

Time goes by, and the composition of the society continues to change as more AIs are constructed, existing ones improve and become more powerful, and some humans upload themselves.  As predicted by UIV and sibling theories, the AIs are basing more and more of their decisions on instrumental considerations - both the AIs and the humans are attaching more and more importance to 'power' (broadly considered) as a value.  They seek knowledge, control over resources, and security much more than the pleasure and entertainment oriented goals that they mostly started with.  And though their original value systems were (mostly) selfish and indexical, and they retain traces of that origin, they all realize that any attempt to seize more than a fair share of resources will be met by concerted resistance from the other AIs in the society.

Can we control the endpoint from way back here?

That was just an illustration.  Your results may vary.  I left out some of the scarier possibilities, in part because I was just providing an illustration, and in part because I am not smart enough to envision all of the scarier possibilities.  This is the future we are talking about here.  The future is unknown. 

One thing to worry about, of course, is that there may be AIs at the negotiating table operating under goal systems that we do not approve of.  Another thing to worry about is that there may not be enough of a balance of power so that the most powerful AI needs to compromise.  (Or, if one assumes that the most powerful AI is ours, we can worry that there may be enough of a balance so that our AI needs to compromise.)

One more worry is that the sequence of updates might converge to a value system that we do not approve of.  Or that it might not converge at all (in the second sense of 'converge'); that the end result is not particularly sensitive to the details of the initial 'seed' ethical system.

Is there anything we can do at this end of the process to increase the chances of a result we would like at the other end?  Are we better off creating many seed AIs so as to achieve a balance of power?  Or better off going with a singleton that doesn't need to compromise?  Can we pick an AI architecture which makes 'openness' (of ethical source and technological data) easier to achieve and enforce?

Are any projections we might make about the path taken to the Singularity just so much science fiction?  Is it best to try to maintain human control over the process for as long as possible because we can trust humans?  Or should we try to turn decision-making authority over to AI agents as soon as possible because we cannot trust humans?

I am certainly not the first person to raise these questions, and I am not going to attempt to resolve them here.

A kinder, gentler GS0?

Nonetheless, I note that Roko, Hollerith, and Omohundro have made a pretty good case that we can expect some kind of convergence toward placing a big emphasis on some particular instrumental values - a convergence which is not particularly sensitive to exactly which fundamental values were present in the seed. 

However, the speed with which the convergence is achieved is somewhat sensitive to the seed rules for discounting future utility.  If the future is not discounted at all, an AI will probably devote all of its efforts toward acquiring power (accumulating resources, power, security, efficiency, and other instrumental values).  If the future is discounted too steeply, the AI will devote all of its efforts to satisfying present desires, without much consideration about the future.

One might think that choosing some intermediate discount rate will result in a balance between 'satisfying current demand' and 'capital spending', but it doesn't always work that way - for reasons related to the ones that cause rational agents to put all their charitable eggs in one basket rather than seeking a balance.  If it is balance we want, a better idea might be to guide our seed AI using a multi-subagent collective - one in which power is split among the agents and goals are determined using a Nash bargain among the agents   That bargain generates a joint (weighted mix) utility function, as well as a fairness constraint. 

The fairness constraint ensures that the zero-discount-rate subagent will get to divert at least some of the effort into projects with a long-term, instrumental payoff.  And furthermore, as those projects come to fruition, and the zero-discount subagent gains power, his own goals gain weight in the mix.

Something like the above might be a way to guarantee that the the detailed pleasure-oriented values of the seed value system will fade to insignificance in the ultimate value system to which we converge.  But is there a way of guiding the convergence process toward a value system which seems more humane and less harsh than that of GS0 et al. - a value system oriented toward seizing and holding 'power'.

Yes, I believe there is.  To identify how human values are different from values of pure instrumental power and self-preservation, look at the system that produced those values.  Humans are considerate of the rights of others because we are social animals - if we cannot negotiate our way to a fair share in a balanced power system, we are lost.  Humans embrace openness because shared intellectual product is possible for us - we have language and communicate with our peers.  Humans have direct concern for the welfare of (at least some) others because we reproduce and are mortal - our children are the only channel for the immortalization of our values.  And we have some fundamental respect for diversity of values because we reproduce sexually - our children do not exactly share our values, and we have to be satisfied with that because that is all we can get.

It is pretty easy to see what features we might want to insert into our seed AIs so that the convergence process generates similar results to the evolutionary process that generated us.  For example, rather designing our seeds to self-improve, we might do better to make it easy for them to instead produce improved offspring.  But make it impossible for them to do so unilaterally.  Force them to seek a partner (co-parent).

If I am allowed only one complaint about the SIAI approach to Friendly AI, it is that it has been too tied to a single scenario of future history - a FOOMing singleton.  I would like to see some other scenarios explored, and this posting was an attempt to explain why.

Summary and Conclusions

This posting discussed some ideas that fit into a weird niche between philosophical ethics and singularitarianism.  Several authors have pointed out that we can expect self-improving AIs to converge on a particular ethics.  Unfortunately, it is not an ethics that most people would consider 'friendly'.  The CEV proposal is related in that it also envisions an iterative updating process, but seeks a different result.  It intends to achieve that result (I may be misinterpreting) by using a different process (a Rawls-inspired 'reflection') rather than pure instrumental pursuit of future utility. 

I analyze the constraints that rationality and preservation of old values place upon the process, and point out that 'social contracts' and 'restricted domains' may provide enough 'wiggle room' so that you really can, in some sense, change your values while at the same time improving them.  And I make some suggestions for how we can act now to guide the process in a direction that we might find acceptable.

87 comments, sorted by
magical algorithm
Highlighting new comments since Today at 10:34 PM
Select new highlight date

So we need another word to filter out those kinds of somewhat-arbitrary proposed meta-ethical systems. "Objective" probably is not the best word for the job, but it is the only one I can think of right now.

This is where I stopped reading.

I suggest that you actually read the SEP entry on meta-ethics instead of just linking there - if you did read it, feel free to correct my guess. Metaethics does not mean what you said it did (metaethics is a theory of what morality is, not a way of comparing moralities), moral realism does not mean what you said it did (your belief that morality is a real thing out there constitutes moral realism), naturalistic metaethics do not mean what you said it did, CEV is totally not about convergence in all possible minds, etcetera. I also have to ask whether you read the Metaethics Sequence, but I mostly regard that sequence as having failed so I won't be surprised if the answer is yes.

Metaethics Sequence, but I mostly regard that sequence as having failed

Has anyone reached what you regard as satisfactory level of understanding of your ideas as a result of reading the sequence? That is, does its failure refer to lower-than-wanted probability of a person reading the sequence understanding your ideas, or to an almost complete failure to communicate your ideas to anyone?

Well, it looks to me like SIAI core people got it, but there's trouble being sure about that sort of thing.

Without contradicting you in any way and with an acknowledgement that you could well disapprove of the way I think about morality too I'll add that comprehension seems to have extended to the unaffiliated population. However both the rate and degree of comprehension is definitely much lower than for your core rationality material. Surprisingly so. However I have since formed an impression that the difficulties in thinking about morality extend far beyond just how your own posts are received.

As someone who apparently did not 'get it', I would suggest that there was a problem with clarity, and that the root cause of the lack of clarity was something that might be called 'moral cognitive distance'. It frequently seemed that you were appealing to my moral intuitions, expecting them to be the same as yours. Pretty often they weren't.

As far as I can tell, I got it. My evidence that I have it right is that I agree with you about it, and anything you've said based on your metaethics since I've understood it was not surprising to me.

the Metaethics Sequence, but I mostly regard that sequence as having failed

By "failed" do you mean the presentation didn't get your ideas across, or do you think the ideas (or some of them) are wrong or incomplete?

but I mostly regard that sequence as having failed

Is there a do-over in the works? Is it covered in the upcoming book? What's the next-best source of learning these ideas, if any?

I suggest that you actually read the SEP entry on meta-ethics instead of just linking there - if you did read it, feel free to correct my guess.

Good guess. If I have read it, it wasn't within the last year. I will follow your advice and do so now.

Metaethics does not mean what you said it did (metaethics is a theory of what morality is, not a way of comparing moralities)

Poor choice of wording on my part. I meant to say that comparing moralities is one of the things that meta-ethics covers; that if you are engaged in comparing moralities, you are doing meta-ethics. Is this wrong?

moral realism does not mean what you said it did (your belief that morality is a real thing out there constitutes moral realism)

I didn't understand this bit. Is the thing in parenthesis meant to exemplefy what I said, or is it your correction of what I said? If the latter, then you may have misunderstood what I said. My fault, no doubt.

I also have to ask whether you read the Metaethics Sequence, but I mostly regard that sequence as having failed so I won't be surprised if the answer is yes.

Actually, I have read most of it, and I agree with your assessment. Where I understood it, I frequently disagreed.

I'm disappointed that my lack of scholarship in ethical philosophy was a barrier to your completing the reading of my posting. I will try to do better next time.

ETA: Until I have a chance to rewrite - I have placed the most muddled parts of my posting in a kind of 'posted quarantine' so that readers may skip over them, if they wish. And I want to thank Eliezer for his critique - I neglected to do so in my initial response.

Poor choice of wording on my part. I meant to say that comparing moralities is one of the things that meta-ethics covers; that if you are engaged in comparing moralities, you are doing meta-ethics. Is this wrong?

I think it is. Comparing moralities is part of morality. Comparing meta-ethical claims such as moral realism, emotivism, error theory, relativism, etc. is meta-ethics, of course, but if you're comparing object-level moral systems, like any of the various flavours of "utilitarianism" or any religion's moral teachings or anything else, then you're doing morality, not meta-ethics. True, you are asking "should" questions about how to answer "should" questions, which is rather meta, but that's not the kind of meta that "meta-ethics" usually refers to.

(That's not to say that meta-ethics is irrelevant to comparing moral systems — if you have a coherent meta-ethics, then it'll probably inform your comparisons — but it's not essential to the process.)

Poor choice of wording on my part. I meant to say that comparing moralities is one of the things that meta-ethics covers; that if you are engaged in comparing moralities, you are doing meta-ethics. Is this wrong?

I think it is. Comparing moralities is part of morality. ...

Hmmm. I think you are right. At the risk of appearing really ridiculous, I now have to admit that I used poor wording in my confession above that I had used poor wording. What I really should have said is that if you are discussing the criteria that AIs might use in comparing moralities, as I did in the OP, then you are doing meta-ethics.

Is this wrong too?

Just as another data point as far as the metaethics sequence:

Seemed to me to make sense, to "click" with me fairly well when I read it. (A couple bits perhaps were slower/tougher for me, like the injunction stuff and moral responsibility, but overall I feel that I grasped the ideas.)

Just to verify (to avoid (double) illusions of transparency), here's my super hyper summarized understanding of it: Morality is objective, and humans happen (for various reasons) to be the sort of beings that actually care about morality, as opposed to caring about something else (like pebblesorting or paperclipping). Further, we indeed should be moral, where by "should", I am appealing to, well, that particular standard known as "morality". And similarly, it is indeed objectively better (that is, more moral) to be moral.

Further, morality includes such values as happiness, consciousness, novelty, self determination, etc...

(Of course, this skips subtleties like how we're not fully reflective so it's difficult for us to explicitly fully state the core underlying rules we use to judge morality, and the fact that those rules include rules for what sort of arguments to accept to update our present understanding, etc...)

Anyways, take that as a data point (plus or minus, depending on how well my understanding, as represented in the summary, reflects the actual intended concepts.)

I think Omohundro's Basic AI Drives is a theory about AI behavior, not a meta-ethical theory (i.e., he's not talking about what's right, but what AIs will actually do). Also, you might want to footnote that Roko has changed his mind about UIV.

Changed his mind that it deals with ethics? Changed his mind about the behavioral prediction? In the sense that he no longer believes that it could happen, or in that he no longer believes that it must happen? A link to the 'retraction' would be appreciated.

Thanks for your response to my premature and partial posting. I look forward to hearing your response to the now-completed article.

I came back to give a bit more feedback, and noticed that Eliezer already made similar points. But I'll say my version anyway, since I already composed it in my head. :)

To me, this post as a whole is about AI behavior and economics, which are important subjects on their own, but not meta-ethics. Meta-ethics asks the question, "What is the nature of morality?" Why is that question interesting or important? One reason is, if I'm to design "my" AI, regardless of whether it will FOOM and take over the whole universe, or will have to fight with other AIs, or will peacefully share control of the universe with other AIs through bargaining, I still have to decide what values to give it initially, because those values will partly determine the outcome of the universe. (If it doesn't, I might as well not build "my" AI at all.) And it doesn't help to say "give it what values you want" because I don't know what I want.

I think what I want may have something to do with morality. Perhaps I'm wrong or just confused about that, but I doubt you're going to convince me I'm wrong, or resolve that confusion, by refusing to talk about morality and only talking about what AIs will do.

... I doubt you're going to convince me I'm wrong, or resolve that confusion, by refusing to talk about morality and only talking about what AIs will do.

I have to apologize. Apparently my writing was extremely unclear. I wasn't refusing to talk about morality. The whole posting was an exploration of some of the properties of the relation "A is at least as good as B" when A and B are normative ethical systems.

Admittedly, I did not spend much time actually making ethical judgments, I was operating at the meta level.

I still have to decide what values to give it initially, because those values will partly determine the outcome of the universe. ...

But the whole point of my posting was that, if there is convergence (in the second sense) then those initial values may make very little difference in the outcome of the universe - that is, they may be important initially, but in the longer term the ethical system that is converged upon depends less on the seed ethics than on issues of how AIs depend upon each other, how they reproduce, etc.

I'm very sorry that you missed this - the main thrust of the posting. If I had written more clearly, your response might have been a more productive disagreement about substance, rather than a complaint about the title.

those initial values may make very little difference in the outcome of the universe

But even if "make very little difference" is true, it's little in a relative sense, in that my initial utility function might just end up having just a billionth percent weight in the final merged AI. But in the absolute sense, the difference could still be huge. For an analogy, suppose it's certain that our civilization will never expand beyond the solar system, which will blow up in a few billion years no matter what. Then our values similarly make very little difference in the outcome of the universe in a relative sense but may still make a huge difference in an absolute sense (e.g. if we create a FOOMing singleton that just takes over the solar system).

Also, if I can figure out what I want, and the answer applies to and convinces many others, that could also make a big difference even in a relative sense.

But even if "make very little difference" is true, it's little in a relative sense ...

The conjecture is that it is true in an absolute sense. It would have made no sense at all for me to even mention it if I had meant it in the relative sense that you set up here as a straw man and then knock down.

There is something odd going on here. Three very intelligent people are interpreting what I write quite differently than the way I intend it. Probably it is because I generated confusion by misusing words that have a fixed meaning here. And, in this case, it may be because you were thinking of our "fragility" conversation rather than the main posting. But, whatever the reason, I'm finding this very frustrating.

I guess I took your conjecture to be the "relative" one because whether or not it is true perhaps doesn't depend on details of one's utility function, and we, or at least I, was talking about whether the question "what do I want?" is an important one. I'm not sure how you hope to show the "absolute" version in the same way.

I'm not sure how you hope to show the "absolute" version in the same way.

Well, Omohundro showed that a certain collection of instrumental values tend to arise independently of the 'seeded' intrinsic values. In fact, decision making tends to be dominated by consideration of these 'convergent' instrumental values, rather than the human-inserted seed values.

Next, consider that those human values themselves originated as heuristic approximations of instrumental values contributing to the intrinsic value of interest to our optimization process - natural selection. The fact that we ended up with the particular heuristics that we did is not due to the fact that the intrinsic value for that process was reproductive success - every species in the biosphere evolved under the guidance of that value. The reason why humans ended up with values like curiosity, reciprocity, and toleration has to do with the environment in which we evolved.

So, my hope is that we can show that AIs will converge to human-like instrumental/heuristic values if they do their self-updating in a human-like evolutionary environment. Regardless of the details of their seeds.

That is the vision, anyways.

I notice that Robin Hanson takes a position similar to yours, in that he thinks things will turn out ok from our perspective if uploads/AIs evolve in an environment defined by certain rules (in his case property laws and such, rather than sexual reproduction).

But I think he also thinks that we do not actually have a choice between such evolution and a FOOMing singleton (i.e. FOOMing singleton is nearly impossible to achieve), whereas you think we might have a choice or at least you're not taking a position on that. Correct me if I'm wrong here.

Anyway, suppose you and Robin are right and we do have some leverage over the environment that future AIs will evolve in, and can use that leverage to predictably influence the eventual outcome. I contend we still have to figure out what we want, so that we know how to apply that leverage. Presumably we can't possibly make the AI evolutionary environment exactly like the human one, but we might have a choice over a range of environments, some more human-like than others. But it's not necessarily true that the most human-like environment leads to the best outcome. (Nor is it even clear what it means for one environment to be more human-like than another.) So, among the possible outcomes we can aim for, we'll still have to decide which ones are better than others, and to do that, we need to know what we want, which involves, at least in part, either figuring out morality is, or showing that it's meaningless or otherwise unrelated to what we want.

Do you disagree on this point?

But I think [Hanson] also thinks that we do not actually have a choice between such evolution and a FOOMing singleton (i.e. FOOMing singleton is nearly impossible to achieve), whereas you think we might have a choice or at least you're not taking a position on that. Correct me if I'm wrong here.

I tend toward FOOM skepticism, but I don't think it is "nearly impossible". Define a FOOM as a scenario leading in at most 10 years from the first human-level AI to a singleton which has taken effective control over the world's economy. I rate the probability of a FOOM at 40% assuming that almost all AI researchers want a FOOM and at 5% assuming that almost all AI researchers want to prevent a FOOM. I'm under the impression that currently a majority of singularitarians want a FOOM, but I hope that that ratio will fall as the dangers of a FOOMing singleton become more widely known.

I contend we still have to figure out what we want, so that we know how to apply that leverage. ... Do you disagree on this point?

No, I agree. Agree enthusiastically. Though I might change the wording just a bit. Instead of "we still have to figure out what we want", I might have written "we still have to negotiate what we want".

My turn now. Do you disagree with this shift of emphasis from the intellectual to the political?

My turn now. Do you disagree with this shift of emphasis from the intellectual to the political?

I suppose if you already know what you personally want, then your next problem is negotiation. I'm still stuck on the first problem, unfortunately.

ETA: What is your answer to The Lifespan Dilemma, for example?

What is your answer to The Lifespan Dilemma, for example?

I only skimmed that posting, and I failed to find any single question there which you apparently meant for me to answer. But let me invent my own question and answer it.

Suppose I expect to live for 10,000 years. Omega appears and offers me a deal. Omega will extend my lifetime to infinity if I simply agree to submit to torture for 15 minutes immediately - the torture being that I have to actually read that posting of Eliezer's with care.

I would turn down Omega's offer without regret, because I believe in (exponentially) discounting future utilities. Roughly speaking, I count the pleasures and pains that I will encounter next year to be something like 1% less significant than this year. I'm doing the math in my head, but I estimate that this makes my first omega-granted bonus year 10,000 years from now worth about 1/10^42 as much as this year. Or, saying it another way, my first 'natural' 10,000 years is worth about 10^42 times as much as the infinite period of time thereafter. The next fifteen minutes is more valuable than that infinite period of time. And I don't want to waste that 15 minutes re-reading that posting.

And I am quite sure that 99% of mankind would agree with me that 1% discounting per year is not an excessive discount rate. That is, in large part, why I think negotiation is important. It is because typical SIAI thinking about morality is completely unacceptable to most of mankind and SIAI seem to be in denial about it.

Have your thought through all of the implications of a 1% discount rate? For example, have you considered that if you negotiate with someone who discounts the future less, say at 0.1% per year, you'll end up trading the use of all of your resources after X number of years in exchange for use of his resources before X number of years, and so almost the entire future of the universe will be determined by the values of those whose discount rates are lower than yours?

If that doesn't bother you, and you're really pretty sure you want a 1% discount rate, do you not have other areas where you don't know what you want?

For example, what exactly is the nature of pleasure and pain? I don't want people to torture simulated humans, but what if they claim that the simulated humans have been subtly modified so that they only look like they're feeling pain, but aren't really? How can I tell if some computation is having pain or pleasure?

And here's a related example: Presumably having one kilogram of orgasmium in the universe is better than having none (all else equal) but you probably don't want to tile the universe with it. Exactly how much worse is a second kilogram of the stuff compared to the first? (If you don't care about orgasmium in the abstract, suppose that it's a copy of your brain experiencing some ridiculously high amount of pleasure.)

Have you already worked out all such problems, or at least know the principles by which you'll figure them out?

Have your thought through all of the implications of a 1% discount rate? ...almost the entire future of the universe will be determined by the values of those whose discount rates are lower than yours?

I don't know about thinking through all of the implications, but I have certainly thought through that one. Which is one reason why I would advocate that any AI's that we build be hard-wired with a rather steep discount rate. Entities with very low discount rates are extremely difficult to control through market incentives. Murder is the only effective option, and the AI knows that, leading to a very unstable situation.

do you not have other areas where you don't know what you want?

Oh, I'm sure I do. And I'm sure that what I want will change when I experience the Brave New World for myself. That is why I advocate avoiding any situation in which I have to perfectly specify my fragile values correctly the first time - have to get it right because someone decided that the AI should make its own decisions about self-improvement and so we need to make sure its values are ultra-stable.

For example, what exactly is the nature of pleasure and pain? I don't want people to torture simulated humans, but what if they claim that the simulated humans have been subtly modified so that they only look like they're feeling pain, but aren't really? How can I tell if some computation is having pain or pleasure?

I certainly have some sympathy for people who find themselves in that kind of moral quandary. Those kinds of problems just don't show up when your moral system requires no particular obligations to entities you have never met, with whom you cannot communicate, and with whom you have no direct or indirect agreements.

Have you already worked out all such problems, or at least know the principles by which you'll figure them out?

I presume you ask rhetorically, but as it happens, the answer is yes. I at least know the principles. My moral system is pretty simple - roughly a Humean rational self-interest, but as it would play out in a fictional society in which all actions are observed and all desires are known. But that still presents me with moral quandaries - because in reality all desires are not known, and in order to act morally I need to know what other people want.

I find it odd that utilitarians seem less driven to find out what other people want than do egoists like myself.

Have your thought through all of the implications of a 1% discount rate? [...] almost the entire future of the universe will be determined by the values of those whose discount rates are lower than yours?

I don't know about thinking through all of the implications, but I have certainly thought through that one. Which is one reason why I would advocate that any AI's that we build be hard-wired with a rather steep discount rate. Entities with very low discount rates are extremely difficult to control through market incentives. [...]

Control - through market incentives?!? How not to do it, surely. Soon the machine will have all the chips, and you will have none - and therefore nothing to bargain with.

The more conventional solution is to control the machine by programming its brain. Then, control via market incentives becomes irrelevant. So: I don't think this reason for discounting is very practical.

Soon the machine will have all the chips.

Odd. I was expecting that it would trade any chips it happened to acquire for computronium, cat girls, and cat boys (who would perform scheduled maintenance in its volcano lair). Agents with a high discount rate just aren't that interested in investing. Delayed gratification just doesn't appeal to them.

Soon the machine will have all the chips.

Odd. I was expecting that it would trade any chips it happened to acquire for computronium, cat girls, and cat boys (who would perform scheduled maintenance in its volcano lair).

That doesn't sound as though there is any substantive disagreement.

Agents with a high discount rate just aren't that interested in investing. Delayed gratification just doesn't appeal to them.

...and nor does that.

However, you appear to be not addressing the issue - which was that your rationale for rapid discounting in machine intelligence was based on a scenario where the machine goals and the human goals are different - and the humans attempt to exercise control over the machines using market incentives.

Conventional thinking around here is that this kind of scenario often doesn't work out too well for the humans - and it represents a mess that we are better off not getting into in the first place.

So: you aren't on the same page - which may be why your conclusions differ. However, why aren't you on the same page? Do you think control via market incentives is desirable? Inevitable? Likely?

The problem with controlling machines has more to do with power than discount rates. The machine is (potentially) more powerful. It doesn't much matter how it discounts - it is likely to get its way. So, its way had better be our way.

The more conventional solution is to control the machine by programming its brain.

Do you think control via market incentives is desirable? Inevitable? Likely?

Programming something and then allowing it to run unattended in the hope that you programmed correctly is not 'control', as the term is usually understood in 'control theory'.

I would say that I believe that control of an AI by continuing trade is 'necessary' if we expect that our desires will change over time, and we will want to nudge the AI (or build a new AI) to satisfy those unanticipated desires.

It certainly makes sense to try to build machines whose values are aligned with humans over the short term - such machines will have little credible power to threaten us - just as parents have little power to credibly threaten their children since carrying out such threats directly reduces the threatener's own utility.

And this also means that the machine needs to discount (its altruistic interest in) human welfare at the same rate as human do - otherwise, if it discounts faster, then it can threaten human with a horrible future (since it cares only about the human present). Or if it temporally discounts human happiness much slower than do humans, it will be able to threaten to delay human gratification.

However, if we want to be able to control our machines (to be able to cause them to do things that we did not originally imagine wanting them to do) then we do need to program in some potential carrots and sticks - things our machines care about that only humans can provide. These things need not be physical - a metaphoric pat on the head may do the trick. But if we are wise, we will program our machines to temporally discount this kind of gratification rather sharply - we don't want it embarking on long term plans to increase future head-pats at the cost of incurring our short-term displeasure.

Incidentally, over the past few comments, I have noticed that you repeatedly refer to "the machine" where I might have written "machines" or "a machine". Do you think that a singleton-dominated future is desirable? Inevitable? Likely?

And this also means that the machine needs to discount (its altruistic interest in) human welfare at the same rate as human do - otherwise, if it discounts faster, then it can threaten human with a horrible future (since it cares only about the human present). Or if it temporally discounts human happiness much slower than do humans, it will be able to threaten to delay human gratification.

If a machine wants for humans what the humans want for themselves, it wants to discount that stuff the way they like it. That doesn't imply that it has any temporal discounting in its utility function - it is just using a moral mirror.

Incidentally, over the past few comments, I have noticed that you repeatedly refer to "the machine" where I might have written "machines" or "a machine". Do you think that a singleton-dominated future is desirable? Inevitable? Likely?

I certainly wasn't thinking about that issue consciously. Our brains may just handle examples a little differently.

And your decision not to answer my questions ... Did you think about that consciously?

Of course. I'm prioritising. I did already make five replies to your one comment - and the proposed shift of direction seemed to be quite a digression.

My existing material on the topic:

It is challenging to answer directly because the premise that there is either one or many is questionable. There are degrees of domination - and we already have things like the United Nations.

Also, this seems to be an area where civilisation will probably get what it wants - so its down to us to some extent - which makes this a difficult area to make predictions in. However, I do think a mostly-united future - with few revolutions and little fighting - is more likely than not. An extremely tightly-united future also seems quite plausible to me. Material like this seems to be an unconvincing reason for doubt.

However, if we want to be able to control our machines (to be able to cause them to do things that we did not originally imagine wanting them to do) then we do need to program in some potential carrots and sticks - things our machines care about that only humans can provide.

No. That's the "reinforcement learning" model. There is also the "recompile its brain" model.

The reinforcement learning model is problematical. If you hit a superintelligence with a stick, it will probably soon find a way take the stick away from you.

I would say that I believe that control of an AI by continuing trade is 'necessary' if we expect that our desires will change over time, and we will want to nudge the AI (or build a new AI) to satisfy those unanticipated desires.

Well, that surely isn't right. Asimov knew that! He proposed making the machines want to do what we want them to - by making them following our instructions.

Programming something and then allowing it to run unattended in the hope that you programmed correctly is not 'control', as the term is usually understood in 'control theory'.

A straw man - from my POV. I never said "unattended " in the first place. ""

If you have already settled on a moral system, then it's totally understandable why you might not be terribly interested in meta-ethics (in the sense of "the nature of morality") at this point, but more into applied ethics, which I now see is what your post is really about. But I wish you mentioned that fact several comments upstream, when I said that I'm interested in meta-ethics because I'm not sure what I want. If you had mentioned it, I probably wouldn't have tried to convince you that meta-ethics ought to be of interest to you too.

If you have already settled on a moral system, then it's totally understandable why you might not be terribly interested in meta-ethics (in the sense of "the nature of morality") at this point, but more into applied ethics, which I now see is what your post is really about.

Wow! Massive confusion. First let me clarify that I am interested in meta-ethics. I've read Hume, G.E.Moore, Nozick, Rawls, Gauthier, and tried to read (since I learned of him here) Parfit. Second, I don't see why you would expect someone who has settled on a moral system to lose interest in meta-ethics. Third, I am totally puzzled how you could have reached the conclusion that my post was about applied ethics. Is there any internal evidence you can point to?

I would certainly agree that our recent conversation has veered into applied ethics. But that is because you keep asking applied ethics questions (apparently for purposes of illustration) and I keep answering. Sorry, my fault. I shouldn't answer rhetorical questions.

I wish you mentioned that fact several comments upstream, when I said that I'm interested in meta-ethics because I'm not sure what I want. If you had mentioned it, I probably wouldn't have tried to convince you that meta-ethics ought to be of interest to you too.

I wish I had realized that convincing me of that was what you were trying to do. I was under the impression that you were arguing that clarifying and justifying ones own ethical viewpoint is the urgent task, while I was arguing that comprehending and accommodating the diversity in ethical viewpoints among mankind is more important.

Have your thought through all of the implications of a 1% discount rate? For example, have you considered that if you negotiate with someone who discounts the future less, say at 0.1% per year, you'll end up trading the use of all of your resources after X number of years in exchange for use of his resources before X number of years, and so almost the entire future of the universe will be determined by the values of those whose discount rates are lower than yours?

I am pretty sure that many humans discount faster than this today, on entirely sensible and rational grounds. What dominates the future has to do with power and reproductive rates, as well as discounting - and things like senescence and fertility decline make discounting sensible.

Basically I think that you can't really have a sensible discussion about this without distinguishing between instrumental discounting and ultimate discounting.

Instrumental discounting is inevitable - and can be fairly rapid. It is ultimate discounting that is more suspect.

And I am quite sure that 99% of mankind would agree with me that 1% discounting per year is not an excessive discount rate.

I suspect that 99% of mankind would give different answers to that question, depending on whether it's framed as giving up X now in exchange for receiving Y N years from now, or X N years ago for Y now.

Not to mention that typical humans behave like hyperbolic discounters, and many can not even be made to understand the concept of a "discount rate".

Quite probably true. Which of course suggests the question: How (or how much) should "typical humans" be consulted about our plans for their future?

Yeah, I know that is an unfair way to ask the question. And I admit that Eliezer, at least, is actually doing something to raise the waterline. But it is a serious ethical question for utilitarians and a serious political question for egoists. And the closest thing I have seen to an answer for that question around here is something like "Well, we will scan their brains, or observe their behavior, or something. And then try to get something coherent out of that data. But God forbid we should ask them about it. That would just confuse things."

It might make an interesting rationality exercise to have 6-10 people conduct some kind of discussion/negotiation/joint-decision-making-exercise to flesh-out their intuitions as to the type of post-singularity society they would like to live in.

My intuition is that, even if you are not sure what you want, the interactive process will probably help you to clarify exactly what you do not want, and thus assist in both personal and collective understanding of values.

It might be even more interesting to have two or more such 'negotiations' proceeding simultaneously, and then compare results.

It might make an interesting rationality exercise to have 6-10 people conduct some kind of discussion/negotiation/joint-decision-making-exercise to flesh-out their intuitions as to the type of post-singularity society they would like to live in.

Sign me up for 100 years with the catgirls in my volcano lair.

More generally I (strongly) prefer a situation in which the available neg-entropy is distributed, for the owners to do with as they please (with limits). That moves negotiations to be of the 'trade' kind rather than the 'politics' kind. Almost always preferable.

I'd be willing to participate in such an exercise.

I tend toward FOOM skepticism, but I don't think it is "nearly impossible". Define a FOOM as a scenario leading in at most 10 years from the first human-level AI to a singleton which has taken effective control over the world's economy.

Automating investing has been going fairly well. For me, it wouldn't be very surprising if we get a dominant, largely machine-operated hedge fund, that "has taken effective control over the world's economy" before we get human-level machine intelligence.

So to summarize, your conclusion seems to be that we should build an arbitrary-goals AI as soon as possible.

Edit: Wrong, corrected here.

So to summarize, you conclusion seems to be that we should build an arbitrary-goals AI as soon as possible.

Huh? What exactly do you think you are summarizing? If you want to produce a cartoon version of my opinions on this thread, try "We should do all we can to avoid the FOOMing singleton scenario, instead trying to create a society of reproducing AIs, interlocked with each other and with humanity by a network of dependencies. If we do, the details of the initial goal systems may matter less than they would with a singleton."

I see, so "if there is convergence" is not a point of theoretical uncertainty, but something that depends on the way the AIs are built. Makes sense (as a position, not something I agree with).

But the whole point of my posting was that, if there is convergence (in the second sense) then those initial values may make very little difference in the outcome of the universe

I see, so "if there is convergence" is not a point of theoretical uncertainty, but something that depends on the way the AIs are built.

Well, it is both. Convergence in the sense of "outcome is independent of the starting point" has not been proved for any AI/updating architecture. Also, I strongly suspect that the detailed outcome will depend quite a bit on the way AIs interact and produce successors/self-updates, even if the fact of convergence does not.

We should do all we can to avoid the FOOMing singleton scenario, instead trying to create a society of reproducing AIs, interlocked with each other and with humanity by a network of dependencies.

That reminds me of:

"An AGI raised in a box could become dangerously solipsistic, probably better to raise AGIs embedded in the social network..."

Goertzel's comment doesn't even make sense to me. Why is he placing 'in a box' in contraposition to 'embedded in the social network'. The two issues are orthogonal. AIs can be social or singleton - either in a box or in the real world. ETA: Well, if you mean the human social network, then I suppose a boxed AI cannot participate. Though I suppose we could let some simulated humans into the box to keep the AI company.

Besides, I've never really considered solipsists to be any more dangerous than anyone else.

Besides, I've never really considered solipsists to be any more dangerous than anyone else.

"Now I will destroy the whole world - What a Bokononist says before committing suicide."

Though I suppose we could let some simulated humans into the box [...]

We don't have any half-decent simulated humans, though.

I noticed that you found an archived copy of Roko's description of UIV. I believe Roko originally thought that his theory implied that we didn't have to worry too much about the terminal values of the AIs we create, that things will turn out OK due to UIV. Unfortunately he keeps deleting his old writings, so I'm going on memory. I'm not sure exactly how he changed his mind, but I think he now believes we do have to worry about the terminal values.

I know of two possible reasons why a rational agent might consent to an irreversible change in its values

Omohundro made a list of cases where an agent might change its values - in the basic AI drives:

While it is true that most rational systems will act to preserve their utility functions, there are at least three situations in which they will try to change them. These arise when the physical embodiment of the utility function itself becomes an important part of the assessment of preference. For example, imagine a system whose utility function is “the total amount of time during which the definition of my utility function is U = 0.” To get any utility at all with this perverse preference, the system has to change its utility function to be the constant 0. Once it makes this change, however, there is no going back.[...]

The second kind of situation arises when the physical resources required to store the utility function form a substantial portion of the system’s assets.[...]

The third situation where utility changes may be desirable can arise in game theoretic contexts where the agent wants to make its threats credible It may be able to create a better outcome by changing its utility function and then revealing it to an opponent.[...]

Fairly obviously, there are more cases. For instance: agents can harmlessly delete any preferences which they have for things that are exclusively in the past - saving themselves evaluation time.

Someone should try for a more comprehensive list someday.

As an exercise, prove transitivity. The trick is that the definition of "better than" keeps changing at each step. You can assume that any one rational agent has a transitive "better than' relation, and that there is local agreement between the two agents involved that the new agent's moral code is better than that of his predecessor. But can you prove from this that every agent would agree that the final moral code is better than the original one?

Let's take a half-bounded sequence of moral encodings I = {m(-infinity) .. m(b)}. For each encoding m(x), there's defined a comparative morality function Mx(X, Y) that takes in encodings X and Y, outputting true if Y is judged to be superior to X.

Per your conditions, we know that Mx(m(x-1), m(x+1)) is true at every step (except the final one, which has no m(x+1)). We also know that if Mx(m(x-1), m(x+1)) is true, then so is Mx+1(m(x), m(x+2)). Now, for an arbitrary x, is Mx(m(a), m(b)) true for all a < b?

I might be missing something, but it seems to me that this falls down in the case where I describes a half-bounded slice of a periodic function's output. It's easy to think of Mx that encapsulate notions of local progress but don't deal well with values outside of their own neighborhood.

Three other examples of convergence theories are Roko's UIV, Hollerith's GSZ, and Omohundro's "Basic AI Drives". These also postulate a process of convergence through rational AI self-improvement. But they tend to be less optimistic than CEV, while at the same time somewhat more detailed in their characterization of the ethical endpoint.

I wouldn't say that any of those three are "less optimistic" than CEV; GS0 and UIV are just competing normative proposals, and the AI Drives are what you get out of most self-improving goal systems by default, and can be overridden. (And CEV isn't about optimism anyway — it's a goal, not a prediction, and in that capacity, it's actually fairly pessimistic, going by the variety of possible failures it tries to account for.)

I guess I am taking CEV to be defined by the process of convergence that produces it. And I see optimism in the claim that this process will produce a happy result. I will agree that the 'optimism' that I am talking about here is not some kind of naive, blind optimism.

Some people might claim that it is not possible to rationally decide to change your fundamental values. It may be that I misunderstand him, but Vladimir Nesov argues passionately against "Value Deathism"

Self-improvement (change) of any given explicit consideration, indeed overall decision problem, is possible, but it won't be a change to the mysterious notion of "morality" that normatively guides all of your decisions, for whatever it's good for.

it won't be a change to the mysterious notion of "morality" that normatively guides all of your decisions

So, if I am understanding you, you think that you and I are guided by some mysterious internal 'notion' of morality, a 'notion' which is incapable of changing. Some questions.

  • Is the 'notion' really unchangeable, or is it just that you consider it irrational (immoral?) to change it?
  • Is the 'notion' identical in you, me, and all of our conspecifics?
  • Is the 'notion' something that develops under the control of our genes, or is it something that can be modified by childhood training?
  • At roughly what age does (should?) this 'notion' freeze and become incapable of further change?
  • Assuming that the 'notion' is genetic in origin, do you believe that some humans are 'mutants'? How ought we people of 'normal' morality to view the 'mutants'?
  • Assuming the 'notion' arose in humans as a result of evolution under natural selection, what do you think were the most important features of the ancestral environment which distinguishes our 'notion(s)' from those of our fellow apes?

Does the value of (3 X 3) change when you change a calculator? Did it become 9 when the calculator was built, or before? And so on, the analogy breaks for the same reason.

Ah! So this mysterious notion is (like '3 X 3 = 9') something "analytic a priori". Ok, suppose I made the following claim:

Morality is simply rational self-interest, as it would play out in an idealized social environment. The idealization is that everything known by any agent is common knowledge among all agents. This means that every agent knows the utility function of every other agent, every agent estimates the same consequences as other agents, and every agent knows what other agents do. So, for example, morality requires that you act as if your actions are public knowledge, even though you know they are not public and you could 'get away with it'.

Now, further suppose that you disagree with my claim. On what grounds would you disagree? If you say "No, that is not morality!", what evidence or argument could you offer other than your own moral intuitions and those of the rest of mankind? I ask because those moral intuitions do not have the same analytic a priori character as '3 X 3 = 9'. And they can change.

Or suppose you asked me to defend my claim, and I submit mathematical proofs that rational agents cannot reach Pareto optimal bargains unless payoffs, consequences, and actions are common knowledge among every participant in the bargain. These proofs are every bit as unchanging as '3 X 3 = 9', but are they also just as irrelevant?

Morality is simply rational self-interest, as it would play out in an idealized social environment. The idealization is [...]

Now, further suppose that you disagree with my claim. On what grounds would you disagree?

It doesn't seem to capture the social-signalling side of morality. Morality, in part, is a way for humans to show what goodie-two-shoes they are to other humans - who might be prospective mates, collaborators, or allies. That involves less self-interest - and more signalling unselfishness.

It doesn't seem to capture the "manipulation" side of morality very well either. Moral systems are frequenttly applied to get others to stop doing what you don't want them to do - by punishing, shaming, embarassing, etc.

So, my assessment would be: incomplete hypothesis.

I don't see how this is responsive. You realize, don't you, that this discussion is proceeding under Nesov's stipulation that moral truth is a priori (like '3 X 3 = 9'). We are operating here under a stance of moral realism and ethical non-naturalism.

If your concept of morality doesn't fit into this framework, this is not the place for you to step in.

I thought you were talking about human morality. Checking back, that does appear to have been the context of the discussion.

Science has studied that topic, we have more to go on than intuition. An example of morality-as-signalling: Signaling Goodness: Social Rules and Public Choice.

Your idealisation makes signalling seem pointless - since everybody knows everything about the other players. Indeed, I don't really see the point of your model. You are not attempting to model very much of the biology involved. You asked for criticism - and that is an obvious one. Another criticism is that you present a model - but it isn't clear what it is for.

I thought you were talking about human morality.

I was not.

Checking back, that does appear to have been the context of the discussion.

Check again. Carefully.

You asked for criticism

I did not. I asked a question about Nesov's metaethical position, using that toy theory of ethics as an example. I asked what kinds of grounds might be used to reject the toy theory. (The grounds you suggest don't fit (IMHO) the metaethical stance Nesov had already committed to.)

Was I really so unclear? Please read the wikipedia entry on metaethics and reread the thread before responding, if you wish to respond.

Oh, and when I think back on the number of times you have inserted a comment about signaling into a discussion that seemed to be about something else entirely, I conclude that you really, really want to have a discussion with somebody, anybody on that topic. May I suggest that you produce a top-level posting explaining your ideas.

Or suppose you asked me to defend my claim, and I submit mathematical proofs that rational agents cannot reach Pareto optimal bargains unless payoffs, consequences, and actions are common knowledge among every participant in the bargain. These proofs are every bit as unchanging as '3 X 3 = 9', but are they also just as irrelevant?

Well, they're relevant if you make a claim that morality should be certain things - but since that's awfully close to a moral claim, I'd say the argument is self-defeating. In fact, that sort of argument might be generalizable to show that this morality is unsupportable - not contradicted, but merely unsupported.

Hmmm. My understanding is that this is a meta-ethical claim; it answers the question of what morality is. Moral claims would answer questions like "What action, if any, does morality require of me?" in some given situation.

Your phrasing of 'what morality is' as 'what morality should be' strikes me as simply playing with words.

If we ignore the object "morality" and just look at basic actions, your proposal about what morality is labels some actions as right and others as wrong (or good and bad, or moral and immoral). It's really by that standard that I call it a "moral claim," in a similar class to "it's immoral to kick puppies."

I guess I don't agree that my example claim says anything directly about which actions are moral and immoral. What it does is to suggest an algorithm for finding out. And the first step is to find out some empirical facts - for example, "What are puppies and how do people feel about them? If I kick puppies, will there be negative consequences in how other people treat me?"

ETA: Wikipedia seems to back me up on this distinction between metaethics and normative ethics:

A meta-ethical theory, unlike a normative ethical theory, does not attempt to evaluate specific choices as being better, worse, good, bad, or evil; although it may have profound implications as to the validity and meaning of normative ethical claims

But your algorithm is evaluable - I guess I don't see the difference between "the no-kicking-puppies morality is correct" and "don't kick puppies."

I guess I don't see the difference between "the no-kicking-puppies morality is correct" and "don't kick puppies."

I don't see much difference either. But the algorithm I proposed says neither of those two things.

It says "If you want to know whether kicking puppies is moral, here is how to find out." The algorithm is the same for Americans, Laotians, BabyEaters, FAIs, uFAIs, and presumably Neanderthals before the dog was invented as a domesticated wolf. The algorithm instructs the user to consider an idealized version of the society in which he is embedded.

Please consider the possibility that some executions of that algorithm might yield different results than did the execution which you performed, using your own society.

Well, but then it's "kicking puppies is immoral if X." A conditional doesn't seem to change the fact that something is a moral claim. Hmm... or would it in some situations? I can't think of any. Oh, you could just rephrase it as "kicking puppies when X is immoral," which is more clearly a moral claim.

A conditional doesn't seem to change the fact that something is a moral claim. Hmm... or would it in some situations? I can't think of any.

Only (an exception) when there is something after the "IF" that indirectly or directly supplies the moral unit. Then it could be a mere logical claim - but most will be unable to distinguish that from a moral claim anyway. The decision to apply an unambiguous, fully specified logical deduction to based on a moral value is usually considered a moral judgement itself.

Apparently you and I interpret the quoted Wikipedia passage differently, and I don't see how to resolve it.

Nor, now that I think about it, do I see a reason why either of us should care. Why are we engaged in arguing about definitions? I am bowing out.

I'll just mention that the most significant scholarly work which attempts something like a theoretical integration of the leading normative theories is Parfit's recent On What Matters.

This is the most interesting LW post on meta-ethics I've seen in a while. Thanks.

To identify how human values are different from values of pure instrumental power and self-preservation, look at the system that produced those values.

OK...

Humans are considerate of the rights of others because we are social animals - if we cannot negotiate our way to a fair share in a balanced power system, we are lost.

So: a selfish agent would behave that way too. This example seems unsuccessful.

Humans embrace openness because shared intellectual product is possible for us - we have language and communicate with our peers.

That is valuable for us, though. It is true that it is also valuable for our memes. They want us to communicate - so they can spread and become more powerful.

Humans have direct concern for the welfare of (at least some) others because we reproduce and are mortal - our children are the only channel for the immortalization of our values.

That helps humans to signal what nice creatures they are to each other - and being nice is attractive. Surely a selfish agent would behave in the same way.

And we have some fundamental respect for diversity of values because we reproduce sexually - our children do not exactly share our values, and we have to be satisfied with that because that is all we can get.

Diversity really is helpful in many cases. Diversity helps protect against disease. A diverse population can adapt better if the environment changes - and so on.

I am not sure you made your case here. IMO, human values differ from those of a selfish, limited agent mainly because humans have their brains infected by memes. That skews human values in favour of chatter, gossip, fashion, religion - the things that benefit the memes.

Three other examples of convergence theories are Roko's UIV, Hollerith's GS0, and Omohundro's "Basic AI Drives".

I now have some links about that topic on my Universal Instrumental Values page.

Convergence theories are often discussed in the context of technological determinism.

It seems reasonable to expect that many contingent sub-optimal locked-in factors will be refactored out of existence in the future - and so that technological determinism will become more pronounced, and historical contingency will diminish. However, the ultimate scope of the idea remains somewhat unknown.