Note: this is a rather rough draft, and some of it is written in compact-form, for time-optimization reasons.

Humanity-as-an-entity or humanity-as-an-agent.
Putting all the contradictory drives and desires of humanity into a single entity, a self-aware superintelligence that hears all of our "voices", an instantiation of humanity's self/identity, and let it work out its contradictions and work out what it wants?

Strictly speaking, this process isn't well-defined. Can be done in many ways. Even if I seem to be implying some sort of super-human entity with human-like brain. Because the initial cognitive architecture of the entity matters to see what it converges to as it resolves it contraditions (if it ever converges).

Test case: my own understanding of humanity's values.

Not quite happy about doing it this way, as idiosyncracies of my biological substrate or my values might make the result less general. Nevertheless, it is what I have to work with.

What is value?

  1. No general rules
    My values are my values are my values. The utility function is what it is. Tautology.
    My values are not necessarily generated by some simple, general principle.
    Any generalization necessarily simplifies things. It is only an approximation. Not true value.
    Deontology or consequentialism? That depends. Neither. Both. That may depends on any number of things, which depend on any number of things.
  2. Some approximations are better than others
    Nevertheless, some approximations are more true, more general than others. While some are completely false.
  3. The notion of utility function is an approximation/simplification too
    Preferences are allowed to be contradictory. Sure, they have to resolved in some way or another. But I see no reason to claim an utility functions is the "true" representation of my values. My model is more general than that.
    There is a value system, connected to a decision-making agent. Maybe that value system can be resolved to some coherent utility function, maybe it can't (or am I misunderstanding what "utility function" means? correct me if I'm wrong).

The recursion principle

FAI contains a model of human values.

FAI queries its model.

Human values sometimes contain contradictions.

Humans sometimes disagree.

How does FAI resolve these contradictions?

Recursion. Query the model of human values again.

How would we want FAI to resolve its internal contradictions and disagreements?

How would we want FAI to handle the contradictions in human values?

No hard-coded answer. Use recursion.

(Some starting "seed" algorithm for resolving conflict/contradiction in values is needed, though. Otherwise no answer will ever be produced)

Differences from Coherent Extrapolated Volition (CEV)

CEV (also described in more detail on arbital) seems to like to hard-code things into existence which (I think) don't need to be hard-coded (and might actually be harmful to hard-code)

My understanding of humanity's values is not coherent by definition. It is not extrapolated by definition. It is simply volition.

Coherence is a trade-off. FAI wants to be coherent. It doesn't want to be acting at cross-purposes with itself. That doesn't have to be defined into existence. It simply follows out of human values.

But FAI is also ok with wanting contradictory things, because humanity wants contradictory things. FAI is OK with internal conflicts, because some desires are fundamentally in conflict. FAI might even be OK with external conflicts: if two humans are in competition, the shard of FAI that supports human A in some sense opposes the shard of FAI that supports human B. That is in line with humanity's values.

Extrapolation is definitely a trade-off. FAI would want to act on some extrapolated version of our values (In words of EY, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together"). Because we would want it to, because we wouldn't want FAI acting on a version of our values based on our delusions and false understanding of reality, a version of our values that would horrify us if we were smarter and more self-aware.

Once again, this is not needed to be defined into existence. This is simply what humans want.

But why would we want to extrapolate values to begin with? Why are we not happy with our existing values? Surely some humans are happy with their existing values?

Contradictions. Either the values contain internal contradictions, or they are based on a false understanding of reality.

Your values want contradictory things? Problem. How resolve?

Your values are based on believing in god that doesn't exist? Problem. How resolve?

Well, how would you want this to be resolved? (see above. Applying the recusion principle).

A hypothetical version of a religious fundamentalist, boosted with higher self-awareness, intelligence, and true knowledge of reality would sort their errors and contradictions out (and their new values would be what we call "extrapolated" values of that person). No external correction needed.

But why are contradictions a problem to begin with? Even a contradictory value system produces some sort of behavior. Why can't the value system just say, "I'm fine with being contradictory. Don't change me. Don't extrapolate me".

And maybe some value systems would. Maybe some values systems would be aware of their own contradictoriness and be OK with it.

And maybe some value systems would say "Yup, my understanding of reality is not completely full or correct. My predictions of the future are not completely full or correct. I'm fine with this. Leave it this way. Don't make me better at those things".

But I don't think that is true of the value systems of present-day humans.

So FAI would extrapolate our values, because our values would want to be extrapolated.

And it at some point our values would stop wanting to be extrapolated further. And so FAI would stop extrapolating our values.

At some point, the hypothetical-agent would say "Stop. Don't make me smarter. Don't make me more self-aware. Doing so would damage or destroy some of the essential contradictions that make me who I am."
"The extrapolation trade-off is no longer worth it. The harm done to me by further extrapolation outweighs the harm done to me and others by the contradictions in my values and my incomplete understanding of reality".

Because this is what this is about, isn't it? The human cost of our disagreements, our delusions, our misalignment with humanity's values, or the misalignment of our own values. The suffering, the destruction, the deaths caused by it.

But, once most of the true-suffering has been eliminated (and I am talking about true-suffering because not all suffering is bad. Not all suffering is undesirable. Not all suffering is negutility).
Then "the human cost" argument would no longer apply. And we would no longer want for our values to be fast-forwarded to the values humanity will have a million years in future, or the values a superintelligent human with 1000000 IQ points would have.

Because at that point, we would already be in Utopia. And Utopia is about fun. And it's more fun to learn and grow at our own pace, than to be handed an answer on a silver platter, or than to be forced to comply with values we do not yet feel are our own.

New Comment
2 comments, sorted by Click to highlight new comments since:

Argument against CEV seems cool, thanks for formulating it. I guess we are leaving some utility on the table with any particular approach.

Part on referring to a model to adjudicate itself seems really off. I have a hard time imagining a thing that has better performance at meta-level than on object-level. Do you have some concrete example?


Part on referring to a model to adjudicate itself seems really off. I have a hard time imagining a thing that has better performance at meta-level than on object-level. Do you have some concrete example?

Let me rephrase it: FAI has a part of its utility function that decides how to "aggregate" our values, how to resolve disagreements and contradictions in our values, and how to extrapolate our values.

Is FAI allowed to change that part? Because if not, it is stuck with our initial guess on how to do that, forever. That seems like it could be really bad.

Actual example:

-What if groups of humans self-modify to care a lot about some particular issue, in an attempt to influence FAI? 

More far-fetched examples:

-What if a rapidly-spreading mind virus drastically changes the values of most humans?

-What if aliens create trillions of humans that all recognize the alien overlords as their masters?


Just to be clear of the point of the examples, these are examples where a "naive" aggregation function might allow itself to be influenced, while a "recursive" function would follow the meta-reasoning that we wouldn't want FAI's values and behavior to be influenced by adversarial modification of human values, only by genuine changes in such (whatever "genuine" means to us. I'm sure that's a very complex question. Which is kind of the point of needing to use recursive reasoning. Human values are very complex. Why would human meta-values be any less complex?)