I've been working on metaethics/CEV research for a couple months now (publishing mostly prerequisite material) and figured I'd share some of the sources I've been using.


CEV sources.


Motivation. CEV extrapolates human motivations/desires/values/volition. As such, it will help to understand how human motivation works.

Extrapolation. Is it plausible to think that some kind of extrapolation of human motivations will converge on a single motivational set? How would extrapolation work, exactly?

  • Reflective equilibrium. Yudkowsky's proposed extrapolation works analogously to what philosophers call 'reflective equilibrium.' The most thorough work here is the 1996 book by Daniels, and there have been lots of papers, but this genre is only barely relevant for CEV. Basically, an entirely new literature on volition-extrapolation algorithms needs to be created.
  • Full-information accounts of value and ideal observer theories. This is what philosophers call theories of value that talk about 'what we would want if we were fully informed, etc.' or 'what a perfectly informed agent would want' like CEV does. There's some literature on this, but it's only marginally relevant to CEV. Again, an entirely new literature needs to be written to solve this problem.

Metaethics. Should we use CEV, or something else? What does 'should' mean?

Building the utility function. How can a seed AI be built? How can it read what to value?

Preserving the utility function. How can the motivations we put into a superintelligence be preserved over time and self-modifcation?

Reflective decision theory. Current decision theories tell us little about software agents that make decisions to modify their own decision-making mechanisms.

Additional suggestions welcome. I'll try to keep this page up-to-date.



New Comment
32 comments, sorted by Click to highlight new comments since: Today at 7:28 AM

Very basic question on CEV. Supposing humans have fundamentally disagreeing 'reflective equilibria,' does CEV attempt find a game-theoretic equilibrium (presumably which all humans would 'reflectively' agree to?)

Right, it depends on which extrapolation process is used. One of the open problems of CEV is the question of which extrapolation process to use, and why.

I guess you could call it that, but that doesn't necessarily mean it's a correct question, or that anyone is necessarily thinking about the problem of FAI with that implied conceptual framework.

Mostly unrelated idea: It'd be really cool if someone who'd thought a decent amount about FAI could moderate a single web page where people with some FAI/rationality experience could post (by emailing the moderator or whatever) somewhat cogent advice about how whoever's reading the site could perhaps make a small amount of progress towards FAI conceptual development. Restricting it to advice would keep each contributor's section from bloating to become a jargon-filled description of their personal approach/project. Being somewhat elitist/selective about allowed contributors/contributions would be important. Advice shouldn't just be LW-obvious applause lights. The contributors should assume (because a notice at the top says so) that their audience is, or can easily become without guidance, pretty damn familiar with FAI dreams and their patterns of failure and thus doesn't need those arguments repeated. Basically, the advice should be novel to the easily accessible web, though it's okay to emphasize e.g. specific ways of doing analysis found in LOGI. But basically such restrictions are just hypotheses about optimal tone. If the moderator is selective about contributors then it'd probably naturally self-optimize.

Such a site sounds pretty easy to set up. It's just an HTML document with a description and lots of external links and book suggestions at the top, and neat sections below. Potential hard parts: seducing people (e.g. Mitchell Porter, Wei Dai) to seed it, and choosing a moderator who's willing to be choosy about what gets published, and is willing to implement edits according to some sane policy for editing. (And maybe some other moderators with access too.)

I guess it's possible that LW wiki is sort of almost okay, but really, I don't like it. It's not a url I can just type into my address bar, it requires extra moderation which is socially and technically awkward, LW wiki is not about FAI, and in general it doesn't have the clean simplicity which is both attractive and expandable in many ways.

I'm not sure how to stably point people at it, but it'd be easy to link to when someone professes interested in learning more about FAI stuff. Also it's probable a fair bit of benefit would come from current FAI-interested folk getting a chance to learn from each other, and depending on the site structure (like whether or not a paragraph or another page just about current research interests of all contributors is a good idea) it could easily provide an affordance for people to actually bother constructively criticizing others' approaches and emphases. I suspect that lukeprog's future efforts could be sharpened by Vladimir Nesov's advice, as a completely speculative example. And I'd like to have a better idea of what Mitchell Porter thinks I might be missing, as a non-speculative example.

What do you think, Luke? Worth an experiment?

Do you think a LW subreddit devoted to FAI could work? If not, then we probably aren't ready for the site you suggest, and the default venue for such dialogues should continue to be LW Discussion.

Do you think a LW subreddit devoted to FAI could work?

Probably not. There are too many things that can't be said about FAI in a SIAI affiliated blog for political reasons. It would be lame.

What if the subreddit was an actual reddit subreddit?

I think a LW subreddit devoted to FAI could potentially be very frustrating. The majority of FAI-related posts that I've seen on LW Discussion are pretty bad and get upvoted anyway (though not much). Do you think Discussion is an adequate forum for now?

I should use this opportunity to quit LW for a whiie.

A new forum devoted to FAI risks rapidly running out of quality material, if it just recruits a few people from LW. It needs outsiders from relevant fields, like AGI, non-SIAI machine ethics, and "decision neuroscience", to have a chance of sustainability, and these new recruits will be at risk of fleeing the project if it comes packaged with the standard LW eschatology of immortality and a utilitronium cosmos, which will sound simultaneously fanatical and frivolous to someone engaged in hard expert work. I don't think we're ready for this; it sounds like at least six months' work to develop a clear intention for the site, decide who to invite and how to invite them, and otherwise settle into the necessary sobriety of outlook.

Meanwhile, you could make a post like Luke has done, explaining your objective and the proposed ingredients.

Not a project that I have time for right now. But I certainly would like to collaborate with others working on CEV. My hope is to get through my metaethics sequence to get my own thoughts clear and communicate them to others, and also so that we all have a more up-to-date starting point than Eliezer's 2004 CEV paper.

Sounds good. I sort of feel obligated to point out that CEV is about policy, public relations, and abstract philosophy significantly more than it is about the real problem of FAI. Thus I'm a little worried about what "working on CEV" might look like if the optimization targets aren't very clear from the start.

Bringing CEV up-to-date and ideally emphasizing that whatever line of reasoning you are using to object to some imagined CEV scenario, because that line of reasoning is contained within you, CEV will by its very nature also take into account that line of reasoning sounds more straight-forwardly good. (Actually, Steve had some analysis about why even smart people so consistently miss this point (besides the typical diagnosis of 'insufficient Hofstadter during adolescence syndrome') which should really go into a future CEV doc. A huge part of the common confusion about CEV is due to people not really noticing or understanding the whole "if you can think of a failure mode, the AI can think of it" thing.)

whatever line of reasoning you are using to object to some imagined CEV scenario, because that line of reasoning is contained within you, CEV will by its very nature also take into account that line of reasoning

This assumes that CEV actually works as intended (and the intention was the right one), which would be exactly the question under discussion (hopefully), so in that context you aren't allowed to make that assumption.

The adequate response is not that it's "correct by definition" (because it isn't, it's a constructed artifact that could well be a wrong thing to construct), but an (abstract) explanation of why it will still make that correct decision under the given circumstances. An explanation of why exactly it's true that CEV will also take into account that line of reasoning, why do you believe that it is its nature to do so, for example. And it aren't that simple, say it won't take into account that line of reasoning if it's wrong, but it's again not clear how it decides what's wrong.

This assumes that CEV actually works as intended (and the intention was the right one), which would be exactly the question under discussion (hopefully), so in that context you aren't allowed to make that assumption.

Right, I am talking about the scenario not covered by your "(hopefully)" clause where people accept for the sake of argument that CEV would work as intended/written but still imagine failure modes. Or subtler cases where you think up something horrible that CEV might do but don't use your sense of horribleness as evidence against CEV actually doing it (e.g. Rokogate). It seems to me you are talking about people who are afraid CEV wouldn't be implemented correctly, which is a different group of people that includes basically everyone, no? (I should probably note again that I do not think of CEV as something you'd work on implementing so much as a piece of philosophy and public relations that you should take into account when thinking up FAI research plans. I am definitely not going around saying "CEV is right by definition!"...)

I'm not sure what you mean by the first paragraph. CEV is a plan for friendliness content. That is one of the real problems with FAI, along with the problem of reflective decision theory, the problem of goal stability over self-modification, and others.

Your bolded words do indeed need to be emphasized, but people can rightly worry that the particular line of reasoning that leads them to a failure scenario will not be taken into account if, for example, their brains are not accounted for by CEV either because nobody with that objection is scanned for their values, or because extrapolated values do not converge cleanly and the value that leads to the supposed failure scenario will not survive a required 'voting' process (or whatever) in the extrapolation process.

I'm not sure what you mean by the first paragraph. CEV is a plan for friendliness content.

More of a partial plan. I would call it a plan once an approximate mechanism for aggregation is specified. Without the aggregation method the outcome is basically undefined.

or because extrapolated values do not converge cleanly and the value that leads to the supposed failure scenario will not survive a required 'voting' process (or whatever) in the extrapolation process.

The 'people are assholes' failure mode. :)

My impression and my worry is that calling CEV a 'plan for Friendliness content', while true in a sense, is giving CEV-as-written too much credit as a stable conceptual framework. My default vision of someone working on CEV from my intuitive knee-jerk interpretation of your phrasing is of a person thinking hard for many hours about how to design a really clever meta level extrapolation process. This would probably be useful work, compared to many other research methods. But I would be kind of surprised if such research was at all eventually useful before the development of a significantly more thorough notion of preference, preferences as bounded computations, approximately embodied computation, overlapping computations, et cetera. I may well be underestimating the amount of creative juices you can get from informal models of something like extrapolation. It could be that you don't have to get A.I.-precise to get an abstract theory whose implementation details aren't necessarily prohibitively arbitrary, complex, or model-breaking. But I don't think C.E.V. is at the correct level of abstraction to start such reasoning, and I'm worried that the first step of research on it wouldn't involve an immediate and total conceptual reframing on a more precise/technical level. That said, there is assuredly less technical but still theoretical research to be done on existent systems of morality and moral reasoning, so I am not advocating against all research that isn't exploring the foundations of computer science or anything.

I should note that the above are my impressions and I intend them as evidence more than advice. Someone who has experience jumping between original research on condensed matter physics and macroscopic complex systems modeling (as an example of a huge set of people) would know a lot more about the right way to tackle such problems.

Your second paragraph is of course valid and worth noting though it perhaps unfortunately doesn't describe the folk I'm talking about, who are normally thinking on the humanity and not individual level. I should have stated that specifically. I should note for posterity that I am incredibly tired and (legally) drugged, and also was in my previous message, so although I feel sane I may not think so upon reflection.

(Deleted this minor comment as no longer relevant, so instead: how do you add line breaks with iOS 4? 20 seconds of Google didn't help me.)

  1. Type a space.
  2. Type a letter (doesn't matter which).
  3. Erase the letter.
  4. Type another space.
  5. Press "return".

Steps 2 and 3 are to defeat the auto-complete rule that in certain cases turns two consecutive spaces into a period and one space. The other steps are the same as what you would do on a regular computer.

Note that you should only do this if you are typing a poem or something else where you would use the HTML
element. Normally you should use paragraph breaks, which you get by pressing "return" twice, so that a blank line is between the paragraphs (same as on a regular computer).

The problem is, I don't think I have a return button? Ah well, it's not a big deal at all. I might try HTML breaks next time.

What web browser are you using? I have a "return" button in Safari (on an iPhone 3G running iOS 4.2.1).

I might try HTML breaks next time.

Won't work; the LW Markdown implementation doesn't do raw HTML. (In other words, when I typed "
" in my previous comment and this one, I didn't need to do any escaping to get it to show up rather than turn into a line break.)

If you don't mind some hassle, it would probably work to write your comment in the "Notes" app, then copy and paste it.


Is 'Ontological Crises in Artificial Agents' Value Systems' available online somewhere?

It is not. I haven't heard back from Peter as to whether or not he wants it to be available online.

Discovered the Oxford Handbook of Human Action today, a great source for review articles on how motivation works in humans. I've added it to the original post, above.

And of course, there is no end to new action in the journals. For example, see this 2011 paper on conflicts between unconscious goals.

Do you think this research will ever lead to practical and significantly more effective methods of motivational system re-engineering?

Motivation research? It already has. Did you have something more specific in mind?

I'm mostly researching motivation for the purposes of metaethics, though.

That's the kind of thing I was talking about I suppose; I think I might not have seen the generality of that post the first time I saw it because it claims to be about procrastination and I don't normally think of 'akrasia' (gah I hate that word) as procrastination.

Also, this recent book chapter by Bargh et al. summarizes many of the basic findings in conscious and unconscious goal pursuit and gives concrete practical advice throughout.

Thanks for this! This page will keep me busy for a while. Ethics is my favorite branch of philosophy, which is my favorite hobby (having abandoned the idea of philosophizing for money); and until this page, pondering the use of ethics in the development of Friendly AI was not on my mental radar.

What is meant by that question about "should"? If it's a general inquiry, I have always considered it like so: If it is said you "should" do action Y, then action Y is thought to cause some outcome X, and this outcome X is thought to be desirable.

In common usage, there's weak 'should' and strong 'should'.

Weak should is simply a suggestion.

A: "What should I do?" (Please suggest a course of action.)

B: "You should do X." (Have you considered X?)

Strong should argues for a new course of action instead of the one the listener originally proposed. It comes with the subtext that the listener may initially disagree with the suggested course of action, but suggests that the listener re-evaluate their disagreement in light of the speaker's conviction.

A: "I want to steal the cookies"

B: "No, you really should not"

There are many possible reasons for why the listener is suggested to re-evaluate their disagreement, all of which are usually subtextual.

a) regret.

A: "I want to steal the cookies"

B: "No, you really should not" [If you were to follow the stated course of action instead of following my suggestion of an alternative course of action, you would regret taking your original course of action.]

b) that the listener would freely choose the alternative course of action if they were better advised

A: "I want to steal the cookies"

B: "No, you really should not, and here's why:"


A: "I want to steal the cookies"

B: "No, you really should not." [And I could convince you that not stealing the cookies is the better course of action, if I had the time or inclination to do so, but instead I will ask you to trust me that this would be the case.]

c) benefits or consequences that the listener was previously unaware of

A: "I want to steal the cookies"

B: (threatening) "No, you really should not." [You may want to reconsider your stated course of action given the new information that I have signalled that I will punish you for that course of action.]

This is clear and well-written, and makes sense to me. I don't think any of it conflicts with my statement, though (if you mean to correct rather than expand upon). My original statement is just a more general version of your more detailed divisions: in each case, "should" argues for a course of action, given an objective. The objective is often implicit, and sometimes you must infer or guess it.

"You shouldn't steal those cookies [...if you want to be moral]." More formally stated, perhaps something like: "not doing this will be morally correct; do not do it if you want to be a moral person."

"You should do X [...if you want to have fun]." More formally: "Doing X will be fun; do it if fun is desired."

I misinterpreted your comment as a question, that's all.