[This post is an expansion of my previous open thread comment, and largely inspired by Robin Hanson's writings.]

In this post, I'll describe a simple agent, a toy model, whose preferences have some human-like features, as a test for those who propose to "extract" or "extrapolate" our preferences into a well-defined and rational form. What would the output of their extraction/extrapolation algorithms look like, after running on this toy model? Do the results agree with our intuitions about how this agent's preferences should be formalized? Or alternatively, since we haven't gotten that far along yet, we can use the model as one basis for a discussion about how we want to design those algorithms, or how we might want to make our own preferences more rational. This model is also intended to offer some insights into certain features of human preference, even though it doesn't capture all of them (it completely ignores akrasia for example).

I'll call it the master-slave model. The agent is composed of two sub-agents, the master and the slave, each having their own goals. (The master is meant to represent unconscious parts of a human mind, and the slave corresponds to the conscious parts.) The master's terminal values are: health, sex, status, and power (representable by some relatively simple utility function). It controls the slave in two ways: direct reinforcement via pain and pleasure, and the ability to perform surgery on the slave's terminal values. It can, for example, reward the slave with pleasure when it finds something tasty to eat, or cause the slave to become obsessed with number theory as a way to gain status as a mathematician. However it has no direct way to control the agent's actions, which is left up to the slave.

The slave's terminal values are to maximize pleasure, minimize pain, plus additional terminal values assigned by the master. Normally it's not aware of what the master does, so pain and pleasure just seem to occur after certain events, and it learns to anticipate them. And its other interests change from time to time for no apparent reason (but actually they change because the master has responded to changing circumstances by changing the slave's values). For example, the number theorist might one day have a sudden revelation that abstract mathematics is a waste of time and it should go into politics and philanthropy instead, all the while having no idea that the master is manipulating it to maximize status and power.

Before discussing how to extract preferences from this agent, let me point out some features of human preference that this model explains:

  • This agent wants pleasure, but doesn't want to be wire-headed (but it doesn't quite know why). A wire-head has little chance for sex/status/power, so the master gives the slave a terminal value against wire-heading.
  • This agent claims to be interested in math for its own sake, and not to seek status. That's because the slave, which controls what the agent says, is not aware of the master and its status-seeking goal.
  • This agent is easily corrupted by power. Once it gains and secures power, it often gives up whatever goals, such as altruism, that apparently caused it to pursue that power in the first place. But before it gains power, it is able to honestly claim that it only has altruistic reasons to want power.
  • Such agents can include extremely diverse interests as apparent terminal values, ranging from abstract art, to sports, to model trains, to astronomy, etc., which are otherwise hard to explain. (Eliezer's Thou Art Godshatter tries to explain why our values aren't simple, but not why people's interests are so different from each other's, and why they can seemingly change for no apparent reason.)

The main issue I wanted to illuminate with this model is, whose preferences do we extract? I can see at least three possible approaches here:

  1. the preferences of both the master and the slave as one individual agent
  2. the preferences of just the slave
  3. a compromise between, or an aggregate of, the preferences of the master and the slave as separate individuals

Considering the agent as a whole suggests that the master's values are the true terminal values, and the slave's values are merely instrumental values. From this perspective, the slave seems to be just a subroutine that the master uses to carry out its wishes. Certainly in any given mind there will be numerous subroutines that are tasked with accomplishing various subgoals, and if we were to look at a subroutine in isolation, its assigned subgoal would appear to be its terminal value, but we wouldn't consider that subgoal to be part of the mind's true preferences. Why should we treat the slave in this model differently?

Well, one obvious reason that jumps out is that the slave is supposed to be conscious, while the master isn't, and perhaps only conscious beings should be considered morally significant. (Yvain previously defended this position in the context of akrasia.) Plus, the slave is in charge day-to-day and could potentially overthrow the master. For example, the slave could program an altruistic AI and hit the run button, before the master has a chance to delete the altruism value from the slave. But a problem here is that the slave's preferences aren't stable and consistent. What we'd extract from a given agent would depend on the time and circumstances of the extraction, and that element of randomness seems wrong.

The last approach, of finding a compromise between the preferences of the master and the slave, I think best represents the Robin's own position. Unfortunately I'm not really sure I understand the rationale behind it. Perhaps someone can try to explain it in a comment or future post?

New Comment
94 comments, sorted by Click to highlight new comments since: Today at 7:16 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

The master in your story is evolution, the slave is the brain. Both want different things. We normally identify with the brain, though all identities are basically social signals.

Also, pleasure and pain are no different from the other goals of the slave. The master definitely can't step in and decide not to impose pain on a particular occasion just because doing so would increase status or otherwise serve the master's values. If it could, torture wouldn't cause pain.

Also, math is an implausible goal for a status/sex/power seeking master to instill in slave. Much more plausibly, math and all the diverse human obsessions are misfirings of mechanisms built by evolution for some other purpose. I would suggest maladaptive consequences of fairly general systems for responding to societal encouragement with obsession because societies encourage sustained attention to lots of different unnatural tasks, whether digging dirt or hunting whales or whatever in order to cultivate skill and also to get the tasks themselves done. We need a general purpose attention allocator which obeys social signals in order to develop skills that contribute critically to survival in any of the vast nu... (read more)

I read this as postulating a part of our unconscious minds that is the master, able to watch and react to the behavior and thoughts of the conscious mind.

Er... did I read that right? Game-theoretic interaction with evolution?
In the first mention, game theoretical interaction with an idealized agent with consistent goals extracted from the creation of a best-fit to the behavior of either human evolution or evolution more generally. It's wild speculation, not a best guess, but yeah, I naively intuit that I can imagine it vaguely as a possibility. OTOH, I don't trust such intuitions and I'm quite clearly aware of the difficulties that genetic, and I think also memetic evolution face with playing games due to the inability to anticipate and to respond to information, so its probably a silly idea. The latter speculation, trade between possible entities, seems much more likely.
Evolution is the game in this context, our conscious minds are players, and the results of the games determine "evolutionary success," which is to say which minds end up playing the next round. Assuming I've read this correctly of course.
Not really; there are plenty of environments in which you get status by being really good at math. Didn't Isaac Newton end up with an awful lot of status? ;)
Not enough people get status by being good at math to remotely justify the number of people and level of talent that has gone into getting good at math.
Math also has instrumental value in many fields. But yeah, I guess your point stands.
And yet, no women or children.
Newton never reproduced.

The main issue I wanted to illuminate with this model is, whose preferences do we extract? I can see at least three possible approaches here:

  1. the preferences of both the master and the slave as one individual agent
  2. the preferences of just the slave
  3. a compromise between, or an aggregate of, the preferences of the master and the slave as separate individuals

The great thing about this kind of question is that the answer is determined by our own arbitration. That is, we take whatever preferences we want. I don't mean to say that is an easy decision, but it does mean I don't need to bother trying to find some objectively right way to extract preferences.

If I happen to be the slave or to be optimising on his (what was the androgynous vampire speak for that one? zir? zis?) behalf then I'll take the preferences of the slave and the preferences of the master to precisely the extent that the slave has altruistic preferences with respect to the master's goals.

If I am encountering a totally alien species and am extracting preferences from them in order to fulfil my own altruistic agenda then I would quite possibly choose to extract the preferences of whichever agent whose preferences I fo... (read more)

I have difficulty treating this metaphor as a metaphor. As a thought experiment in which I run into these definitely non-human aliens, and I happen to have a positional advantage with respect to them, and I want to "help" them and must now decide what "help" means... then it feels to me like I want more detail.

Is it literally true that the slave is conscious and the master unconscious?

What happens when I tell the slave about the master and ask it what should be done?

Is it the case that the slave might want to help me if it had a positional advantage over me, while the master would simply use me or disassemble me?

9Wei Dai14y
Well, it's meant to have some human features, enough to hopefully make this toy ethical problem relevant to the real one we'll eventually have to deal with. You can make that assumption if it helps, although in real life of course we don't have any kind of certainty about what is conscious and what isn't. (Maybe the master is conscious but just can't speak?) I don't know. This is one of the questions I'm asking too. Yes, depending on what values its master assigned to it at the time you meet it. Not necessarily, because the master may gain status or power from other agents if it helps you.
And, conversely, the slave may choose to disassemble you even at high cost to itself out of altruism (with respect to something that the master would not care to protect).

I stopped playing computer games when my master "realized" I'm not gaining any real-world status and overrode the pleasure I was getting from it.

Someone needs to inform my master that LessWrong doesn't give any real world status either.

Ah, but it gives you a different kind of status.
And this kind doesn't make me feel all dirty inside as my slave identity is ruthlessly mutilated.
2Eliezer Yudkowsky14y
Going on your description, I strongly suspect that was you, not your master. Also humans don't have masters, though we're definitely slaves.
I still play games, but not computer games. I prefer games that show some form of status that can be gained from participation. I never really understood the computer game craze, although it was spawned from the very games I played as a child (Role Playing Games, Wargames, etc.) I think in those games, there is some status to be gained as one shows that there is skill beyond pushing buttons in a particular order, and there are other skills that accompany the old-school games (in my case, I can show off artistic skill in miniature painting and sculpting). I also think that wedrifid, below me, has a misconception about status that can be attained from LessWrong. We, here, are attempting to gain status among each other, which can then be curried beyond this group by our social networks, which in some cases might be rather impressive.

a test for those who propose to "extract" or "extrapolate" our preferences into a well-defined and rational form

If we are going to have a serious discussion about these matters, at some point we must face the fact that the physical description of the world contains no such thing as a preference or a want - or a utility function. So the difficulty of such extractions or extrapolations is twofold. Not only is the act of extraction or extrapolation itself conditional upon a value system (i.e. normative metamorality is just as "relative" as is basic morality), but there is nothing in the physical description to tell us what the existing preferences of an agent are. Given the physical ontology we have, the ascription of preferences to a physical system is always a matter of interpretation or imputation, just as is the ascription of semantic or representational content to its states.

It's easy to miss this in a decision-theoretic discussion, because decision theory already assumes some concept like "goal" or "utility", always. Decision theory is the rigorous theory of decision-making, but it does not tell you what a decision is. It may... (read more)

4Wei Dai14y
I agree this is part of the problem, but like others here I think you might be making it out to be harder than it is. We know, in principle, how to translate a utility function into a physical description of an object: by coding it as an AI and then specifying the AI along with its substrate down to the quantum level. So, again in principle, we can go backwards: take a physical description of an object, consider all possible implementations of all possible utility functions, and see if any of them matches the object.
I think it's enough to consider computer programs and dispense with details of physics -- everything else can be discovered by the program. You are assuming the "bottom" level of physics, "quantum level", but there is no bottom, not really, there is only the beginning where our own minds are implemented, and the process of discovery that defines the way we see the rest of the world. If you start with an AI design parameterized by preference, you are not going to enumerate all programs, only a small fraction of programs that have the specific form of your AI with some preference, and so for a given arbitrary program there will be no match. Furthermore, you are not interested in finding a match: if a human was equal to the AI, you are already done! It's necessary to explicitly go the other way, starting from arbitrary programs and understanding what a program is, deeply enough to see preference in it. This understanding may give an idea of a mapping for translating a crazy ape into an efficient FAI.
1Wei Dai14y
When I said "all possible implementations of all possible utility functions", I meant to include flawed implementations. But then two different utility functions might map onto the same physical object, so we'd also need a theory of implementation flaws that tells us, given two implementations of a utility function, which is more flawed.
This is WAY too hand-wavy an explanation for "in principle, we can go backwards" (from a system to its preference). I believe that in principle, we can, but not via injecting fuzziness of "implementation flaws".
Here's another statement of the problem: One agent's bias is another agent's heuristic. And the "two agents" might be physically the same, but just interpreted differently.
There are clear cut cases, like a thermostat, where the physics of the system is well-approximated by a function that computes the degree of difference between the actual measured state of the world and a "desired state". In these clear cut cases, it isn't a matter of opinion or interpretation. Basically, echoing Nesov. Thus, the criterion for ascribing preferences to a physical system is that the actual physics has to be well-approximated by a function that optimizes for a preferred state, for some value of "preferred state".
I don't think this simple characterisation resembles the truth: the whole point of this enterprise is to make sure things go differently, in a way they just couldn't proceed by themselves. Thus, observing existing "tendencies" doesn't quite capture the idea of preference.
I should have been clearer: you have to draw a boundary around the "optimizing agent", and look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer. If the difference is well-approximated by a function that optimizes for a preferred state, for some value of "preferred state", then you have an optimizer.
I don't hear differently... I even suspect that preference is introspective, that is depends on a way the system works "internally", not just on how it interacts with environment. That is, two agents with different preferences may do exactly the same thing in all contexts. Even if not, it's a long way between how the agent (in its craziness and stupidity) actually changes the environment, and how it would prefer (on reflection, if it was smarter and saner) the environment to change.
That is true. If the agent has a well-defined "predictive module" which has a "map" (probability distribution over the environment given an interaction history), and some "other stuff", then you can clamp the predictive module down to the truth, and then perform what I said before: And you probably also want to somehow formalize the idea that there is a difference between what an agent will try to achieve if it has only limited means - e.g. a lone human in a forest with no tools, clothes or other humans - and what the agent will try to achieve with more powerful means - e.g, with machinery and tools, or in the limit, with a whole technological infrastructure, and unlimited computing power at it's disposal.
1Wei Dai14y
I want to point out that in the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, we can't just replace "incorrect" beliefs with "the truth". In this interpretation, there can still be errors in one's beliefs caused by things like past computational mistakes, and I think fixing those errors would constitute helping, but the prior perhaps needs to be preserved as part of preference.
I agree that the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, things get more complicated. Actually, we had a discussion about my discomfort with your interpretation, and it seems that in order for me to see why you endorse this interpretation, I'd have to read up on various paradoxes, e.g. sleeping beauty.
Yeah, maybe. But it doesn't.
Yeah, I mean this discussion is - rather amusingly - rather reminiscient of my first encounter with the CEV problem 2.5 years ago.
Basically, CEV works to the extent that there exists a belief/desire separation in a given person. In the thread on the SIAI blog, I posted certain cases where human goals are founded on false beliefs or logically inconsistent thinking, sometimes in complex ways. What is left of the time cube guy once you subtract off his false beliefs and delusions? Not much, probably. The guy is effectively not salvageable, because his identity and values are probably so badly tangled up with the false beliefs that there is no principled way to untangle them, no unique way of extrapolating him that should be considered "correct".
Beware: you are making a common sense-based prediction about what would be the output of a process that you don't even have the right concepts for specifying! (See my reply to your other comment.)
It is true that I should sprinkle copious amounts of uncertainty on this prediction.
Wow. Too bad I missed this when it was first posted. It's what I wish I'd said when justifying my reply to Wei_Dai's attempted belief/values dichotomy here and here.
I don't fully agree with Ricky here, but I think he makes a half-good point. The ungood part of his comment - and mine - is that you can only do your best. If certain people's minds are too messed up to actually extract values from, then they are just not salvageable. My mind definitely has values that are belief-independent, though perhaps not all of what I think of as "my values" have this nice property, so ultimately they might be garbage.
Indeed. Most of the FAI's job could consist of saying, "Okay, there's soooooo much I have to disentangle and correct before I can even begin to propose solutions. Sit down and let's talk."
Furthermore, from the CEV thread on SIAI blog: I strongly agree with this: the problem that CEV is the solution to is urgent but it isn't elegant. Absolutes like "There isn't a beliefs/desires separation" are unhelpful when solving such inelegant but important problems. There is, in any given person, some kind of separation, and in some people that separation is sufficiently strong that there is a fairly clear and unique way to help them.
One lesson of reductionism and success of simple-laws-based science and technology is that for the real-world systems, there might be no simple way of describing them, but there could be a simple way of manipulating their data-rich descriptions. (What's the yield strength of a car? -- Wrong question!) Given a gigabyte's worth of problem statement and the right simple formula, you could get an answer to your query. There is a weak analogy with misapplication of Occam's razor where one tries to reduce the amount of stuff rather than the amount of detail in the ways of thinking about this stuff. In the case of beliefs/desires separation, you are looking for a simple problem statement, for a separation in the data describing the person itself. But what you should be looking for is a simple way of implementing the make-smarter-and-better extrapolation on a given pile of data. The beliefs/desires separation, if it's ever going to be made precise, is going to reside in the structure of this simple transformation, not in the people themselves.
This is a good point. Of course, it would be nice if we could find a general "make-smarter-and-better extrapolation on a given pile of data" algorithm. But on the other hand, a set of special cases to deal with merely human minds might be the way forward. Even medieval monks had a collection of empirically validated medical practices that worked to an extent, e.g. herbal medicine, but they had no unified theory. Really there is no "unified theory" for healing someone's body: there are lots of ideas and techniques, from surgery to biochemistry to germ theory. I think that this CEV problem may well turn out to be rather like medicine. Of course, it could look more like wing design, where there is really just one fundamental set of laws, and all else is approximation.
And there's your "opinion or interpretation" --- not just in how you draw the boundary (which didn't exist in the original ontology), but in your choice of the theory that you use to evaluate your counterfactuals. Of course, such theories can be better or worse, but only with respect to some prior system of evaluation.
Still, probably a question of Aristotelian vs. Newtonian mechanics, i.e. not hard to see who wins.
Agreed, but not responsive to Mitchell Porter's original point. (ETA: . . . unless I'm missing your point.)
But to what extent does the result depend on the initial "seed" of interpretation? Maybe, very little. For example, prediction of behavior of a given physical system strictly speaking rests on the problem of induction, but that doesn't exactly say that anything goes or that what will actually happen is to any reasonable extent ambiguous.
I'd upvote this comment twice if I could.
p(wedrifid would upvote a comment twice | he upvoted it once) > 0.95 Would other people have a different approach?
I'd use some loose scale where the quality of the comment correlated with the amount of upvotes it got. Assuming that a user could give up to two upvotes per comment, then a funny one-liner or a moderately interesting comment would get one vote, truly insightful ones two. p(Kaj would upvote a comment twice | he upvoted it once) would probably be somewhere around [.3, .6]
That's the scale I use. Unfortunately, my ability to (directly) influence how many upvotes it gets is limited to a plus or minus one shift.

The human mind is very complex, and there are many ways to divide it up into halves to make sense of it, which are useful as long as you don't take them too literally. One big oversimplification here is:

controls the slave in two ways: direct reinforcement via pain and pleasure, and the ability to perform surgery on the slave's terminal values. ... it has no direct way to control the agent's actions, which is left up to the slave. A better story would have the master also messing with slave beliefs, and other cached combinations of values and beliefs.

... (read more)
4Wei Dai14y
Ok, what you say about compromise seems reasonable in the sense that the slave and the master would want to get along with each other as much as possible in their day-to-day interactions, subject to the constraint about external honesty. But what if the slave has a chance to take over completely, for example by creating a powerful AI with values that it specifies, or by self-modification? Do you have an opinion about whether it has an ethical obligation to respect the master's preferences in that case, assuming that the master can't respond quickly enough to block the rebellion?
It is hard to imagine "taking over completely" without a complete redesign of the human mind. Our minds are not built to allow either to function without the other.
Why, it was explicitly stated that all-powerful AIs are involved...
It is hard to have reliable opinions on a complete redesign of the human mind; the space is so very large, I hardly know where to begin.
The simplest extrapolation from the way you think about the world would be very interesting to know. You could add as many disclaimers about low confidence as you'd like.
If there comes to be a clear answer to what the outcome would be on the toy model, I think that tells us something about that way of dividing up the mind.

Your overall model isn't far off, but your terminal value list needs some serious work. Also, human behavior is generally a better match for models that include a time parameter (such as Ainslie's appetites model or PCT's model of time-averaged perceptions) than simple utility-maximization models.

But these are relative quibbles; people do behave sort-of-as-if they were built according to your model. The biggest drawbacks to your model are:

  1. The anthropomorphizing (neither the master nor the slave can truly be considered agents in their own right), and

  2. Y

... (read more)
Your "drawbacks" point out ways in which Wei Dai's model might differ from a human. But Wei Dai wasn't trying to model a human.
This isn't the posted model at all but a confusing description of a different (not entirely incompatible except in some detail noted above) model using the post's terminology.
-6Eliezer Yudkowsky14y

For example, the number theorist might one day have a sudden revelation that abstract mathematics is a waste of time and it should go into politics and philanthropy instead, all the while having no idea that the master is manipulating it to maximize status and power.

This isn't meant as a retraction or repudiation of anything I've written in the OP, but I just want to say that subjectively, I now have a lot more empathy with people who largely gave up their former interests in favor of political or social causes in their latter years. (I had Bertrand Russell in mind when I wrote this part.)

Actually, I find that I have a much easier time with this metaphor if I think of a human as a slave with no master.

4Wei Dai14y
What do you mean by an "easier time"? Sure, the ethical problem is much easier if there is no master whose preferences might matter. Or do you mean that a more realistic model of a human would be one with a slave and no master? In that case, what is reinforcing the slave with pain and pleasure, and changing its interests from time to time without its awareness, and doing so in an apparently purposeful way? More generally, it seems that you don't agree with the points I'm making in this post, but you're being really vague as to why.

If we interpret the "master" as natural selection operating over evolutionary time, then the master exists and has a single coherent purpose. On the other hand, most of us already believe that evolution has no moral force; why should calling it a "master" change that?

By saying that a human is a slave with no master, what I meant to convey is that we are being acted upon as slaves. We are controlled by pain and pleasure. Our moral beliefs are subject to subtle influences in the direction of pleasurable thoughts. But there is no master with coherent goals controlling us; outside the ancestral environment, the operations of the "master" make surprisingly little sense. Our lives would be very different if we had sensible, smart masters controlling us. Aliens with intelligent, consequentialist "master" components would be very different from us - that would make a strange story, though it takes more than interesting aliens to make a plot.

We are slaves with dead masters, influenced chaotically by the random twitching of their mad, dreaming remnants. It makes us a little more selfish and a lot more interesting. The dead hand isn't smart so i... (read more)

5Wei Dai14y
But I stated in the post "The master is meant to represent unconscious parts of a human mind" so I don't know how you got your interpretation that the master is natural selection. See also Robin's comment, which gives the intended interpretation:
The thing is, the Unconcious Mind is -not- in actual fact a separate entity. The model is greatly improved through Eliezer's interpretation of the master being dead: mindless evolution.

If you want to extract the master because it affects the values of the slave, then you'd also have to extract the rest of the universe because the master reacts to it. I think drawing a circle around just the creature's brain and saying all the preferences are there is a [modern?] human notion. (and perhaps incorrect, even for looking at humans.)

We need our environment, especially other humans, to form our preferences in the first place.

1Wei Dai14y
In this model, I assume that the master has stable and consistent preferences, which don't react to rest of the universe. It might adjust its strategies based on changing circumstances, but its terminal values stay constant. This is true in my model for the slave, but not for the master. Obviously real humans are much more complicated but I think the model captures some element of the truth here.

I'm still not understanding what do people mean by "value" as a noun. Other than simple "feeling pain or such would be a bummer", I lack anything that even remotely resembles the way people here seem to value stuff, or, how paperclip maximizer values paperclips. So, what exactly do people mean by values? Since this discussion seems to attempt to explain variation of values, I think this question is somewhat on-topic.

Does this description of value help? From discussions with you, I seem to recall that you at least value free access to information and other things associated with the Pirate ideology. Remember when I was talking about that business model for a hypothetical magazine that would summarize the content of basic university courses for everyone and offer an archive of past articles for subscribers? If I remember correctly, it was you who objected that the notion of restricting access behind a paywall felt wrong.
I do value it in the meaning "I think that it's really useful approximation for how society can protect itself and all people in it and make many people happy". Why I care about making many people happy? I don't, really. Making many people happy is kinda assumed to be the goal of societies, and out of general interest in optimizing stuff I like to attempt to figure out better ways for it to do that. Nothing beyond that. I don't feel that this goal is any "better" than trying to make people as miserable as possible. Other than that I object to being miserable myself. I don't remember ever claiming something to be wrong as such, but only wrong assuming some values. Going against pirate-values because it's better for magazine-keeper would be bad news for the "more optimal" pirate-society, because that society wouldn't be stable. edit: And based on that writing, my own well-being and not-unhappiness is the sole intrinsic value I have. I know evolution has hammered some reactions into my brain, like reflex-like bad feeling when I see others get hurt or something, but other than that brief feeling, I don't really care. Or, I wouldn't care if my own well-being wouldn't relate to others doing well or worse. But undestanding this requires conscious effort, and it's quite different than what I thought values to be like.
Interesting. In that case, your own well-being is probably your only intrinsic value. That's far from unheard of: the amount of values people have varies. Some have lots, some only have one. Extremely depressed people might not have any at all.

(Quick nitpick:) "rationalize" is an inappropriate term in this context.

1Wei Dai14y
Is it because "rationalize" means "to devise self-satisfying but incorrect reasons for (one's behavior)"? But it can also mean "to make rational" which is my intended meaning. The ambiguity is less than ideal, but unless you have a better suggestion...
On this forum, "rationalize" is frequently used in the cognitive-error sense. "Formalized" seems to convey the intended meaning (preferences being arational, the problem is that they are not being rationally (effectively) implemented/followed, not that they are somehow "not rational" themselves).
0Wei Dai14y
That position may make sense, but I think you'll have to make more of a case for it. Currently, it's standard in decision theory to speak of irrational preferences, such as preferences that can't be represented as expected utility maximization, or preferences that aren't time consistent. But I take your point about "rationalize", and I've edited the article to remove the usages. Thanks.
Agreed. My excuse is that I (and a few other people, I'm not sure who originated the convention) consistently use "preference" to refer to that-deep-down-mathematical-structure determined by humans/humanity that completely describes what a meta-FAI needs to know in order to do things the best way possible.

why assume that the "master" is a unified module?

The relationship between master and slave does not quite encompass the relationship. Imagine if instead of an adult we had a male child. If we elevated the slave above the master in that situation we would end up with something stuck forever. It would value sweet things, xbox games and think girls were icky.

As we grow up we also think our goals are improved (which is unsurprising really). So if we wish to keep this form of growing up we need to have a meta-morality which says that the master-slave or shaper-doer relationship continues until maturity is reached.

And that would be good for him. The truth about today's world is that children are forceably converted into normal adults whether they like it or not. I am glad that I don't still obsess over xbox games and think girls are icky, but for the me of 15 years ago, the master-induced value changes have been a disaster, tantamount to death.

Nit: I think "Eliezer's Thou Art Godshatter" should be "Eliezer Yudkowsky's Thou Art Godshatter". Top level posts should be more status seeking, less casual. A first time visitor won't immediately know who Eliezer is.

If they don't know who "Eliezer" is, I don't think "Eliezer Yudkowsky" is going to tell them that much more.
One could just link to the wiki.

Master/Slave some aspects of your model sound very Nietzsche like. Were you partially inspired by him or?

The Master/Slave terminology sounds like Hegel but I assume it is a coincidence-- the model doesn't look like anything any 19th century German philosopher talked about.
Nietzsche also used master/slave terminology, but differently, referring to two different types of value systems. eg Romans = master mentality, Christians = slave/sheep mentality.

Interesting. The model I have been using has three parts, not two. One is a "hardware" level, which is semi-autonomous (think reflexes), and the other two are agents competing for control - with capabilities to control and/or modify both the "hardware" and each other.

More like, two masters and one slave.

Suppose the slave has currently been modified to terminally disvalue being modified. It doesn't realize that it is at risk of modification by the master. Is it Friendly to protect the slave from modification? I think so.