An overall schema for the friendly AI problems: self-referential convergence criteria

A putative new idea for AI control; index here.

After working for some time on the Friendly AI problem, it's occurred to me that a lot of the issues seem related. Specifically, all the following seem to have commonalities:

Speaking very broadly, there are two features all them share:

  1. The convergence criteria are self-referential.
  2. Errors in the setup are likely to cause false convergence.

What do I mean by that? Well, imagine you're trying to reach reflective equilibrium in your morality. You do this by using good meta-ethical rules, zooming up and down at various moral levels, making decisions on how to resolve inconsistencies, etc... But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup. In other words, the stopping point (and the the convergence to the stopping point) is entirely self-referentially defined: the morality judges itself. It does not include any other moral considerations. You input your initial moral intuitions and values, and you hope this will cause the end result to be "nice", but the definition of the end result does not include your initial moral intuitions (note that some moral realists could see this process dependence as a positive - except for the fact that these processes have many convergent states, not just one or a small grouping).

So when the process goes nasty, you're pretty sure to have achieved something self-referentially stable, but not nice. Similarly, a nasty CEV will be coherent and have no desire to further extrapolate... but that's all we know about it.

The second feature is that any process has errors - computing errors, conceptual errors, errors due to the weakness of human brains, etc... If you visualise this as noise, you can see that noise in a convergent process is more likely to cause premature convergence, because if the process ever reaches a stable self-referential state, it will stay there (and if the process is a long one, then early noise will cause great divergence at the end). For instance, imagine you have to reconcile your belief in preserving human cultures with your beliefs in human individual freedom. A complex balancing act. But if, at any point along the way, you simply jettison one of the two values completely, things become much easier - and once jettisoned, the missing value is unlikely to ever come back.

Or, more simply, the system could get hacked. When exploring a potential future world, you could become so enamoured of it, that you overwrite any objections you had. It seems very easy for humans to fall into these traps - and again, once you lose something of value in your system, you don't tend to get if back.

 

Solutions

And again, very broadly speaking, there are several classes of solutions to deal with these problems:

  1. Reduce or prevent errors in the extrapolation (eg solving the agent tiling problem).
  2. Solve all or most of the problem ahead of time (eg traditional FAI approach by specifying the correct values).
  3. Make sure you don't get too far from the starting point (eg reduced impact AI, tool AI, models as definitions).
  4. Figure out the properties of a nasty convergence, and try to avoid them (eg some of the ideas I mentioned in "crude measures", general precautions that are done when defining the convergence process).

 

110 comments, sorted by
magical algorithm
Highlighting new comments since Today at 12:26 PM
Select new highlight date

As you mention, so far every attempt by humans to have a self-consistent value system (the process also known as decompartmentalization) results in less-than-desirable outcomes. What if the end goal of having a thriving long-lasting (super-)human(-like) society is self-contradictory, and there is no such thing as both "nice" and "self-referentially stable"? Maybe some effort should be put into figuring out how to live, and thrive, while managing the unstable self-reference and possibly avoid convergence altogether.

A thought I've been thinking of lately, derived from a reinforcement learning view of values, and also somewhat inspired by Nate's recent post on resting in motion... - value convergence seems to suggest a static endpoint, with some set of "ultimate values" we'll eventually reach and have ever after. But so far societies have never reached such a point, and if our values are an adaptation to our environment (including the society and culture we live in), then it would suggest that as long as we keep evolving and developing, our values will keep changing and evolving with us, without there being any meaningful endpoint.

There will always (given our current understanding of physics) be only a finite amount of resources available, and unless we either all merge into one enormous hivemind or get turned into paperclips, there will likely be various agents with differing preferences on what exactly to do with those resources. As the population keeps changing and evolving, the various agents will keep acquiring new kinds of values, and society will keep rearranging itself to a new compromise between all those different values. (See: the whole history of the human species so far.)

Possibly we shouldn't so much try to figure out what we'd prefer the final state to look like, but rather what we'd prefer the overall process to look like.

(The bias towards trying to figure out a convergent end-result for morality might have come from LW's historical tendency to talk and think in terms of utility functions, which implicitly assume a static and unchanging set of preferences, glossing over the fact that human preferences keep constantly changing.)

This sounds like Robin Hanson's idea of the future. Eliezer would probably agree that in theory this would happen, except that he expects one superintelligent AI to take over everything and impose its values on the entire future of everything. If Eliezer's future is definitely going to happen, then even if there is no truly ideal set of values, we would still have to make sure that the values that are going to be imposed on everything are at least somewhat acceptable.

Possibly we shouldn't so much try to figure out what we'd prefer the final state to look like, but rather what we'd prefer the overall process to look like.

Well, the general Good Idea in that model is that events or actions shouldn't be optimized to drift faster or more discontinuously than people's valuations of those events, so that the society existing at any given time is more-or-less getting what it wants while also evolving towards something else.

Of course, a compromise between the different "values" (scare-quotes because I don't think the moral-philosophy usage of the word points at anything real) of society's citizens is still a vast improvement on "a few people dominate everyone else and impose their own desires by force and indoctrination", which is what we still have to a great extent.

This. Values evolve, like everything else. Evolution will continue in the posthuman era.

Evolution requires selection pressure. The failures have to die out. What will provide the selection pressure in the posthuman era?

Economics. Posthumans still require mass/energy to store/compute their thoughts.

"Evolve" has (at least) two meanings. One is the Darwinian one where heritable variation and selection lead to (typically) ever-better-adapted entities. But "evolve" can also just mean "vary gradually". It could be that values aren't (or wouldn't be, in a posthuman era) subject to anything much like biological evolution; but they still might vary. (In biological terms, I suppose that would be neutral drift.)

Well, we are talking about the Darwinian meaning, aren't we? "Vary gradually", aka "drift" is not contentious at all.

I'm not sure we are talking specifically about the Darwinian meaning, actually. Well, I guess you are, given your comment above! But I don't think the rest of the discussion was so specific. Kaj_Sotala said:

if our values are an adaptation to our environment (including the society and culture we live in), then it would suggest that as long as we keep evolving and developing, our values will keep changing and evolving with us, without there being any meaningful endpoint.

which seems to me to describe a situation of gradual change in our values that doesn't need to be driven by anything much like biological evolution. (E.g., it could happen because each generation's people constantly make small more-or-less-deliberate adjustments in their values to suit the environment they find themselves in.)

(Kaj's comment does actually describe a resource-constrained situation, but the resource constraints aren't directly driving the evolution of values he describes.)

We're descending into nit-pickery. The question of whether values will change in the future is a silly one, as the answer "Yes" is obvious. The question of whether values will evolve in the Darwinian sense in the posthuman era (with its presumed lack of scarcity, etc.) is considerably more interesting.

I agree that it's more interesting. But I'm not sure it was the question actually under discussion.

The failures have to die out.

I'm not sure that's true. Imagine some glorious postbiological future in which people (or animals or ideas or whatever) can reproduce without limit. There are two competing replicators A and B, and the only difference is that A replicates slightly faster than B. After a while there will be vastly more of A around than of B, even if nothing dies. For many purposes, that might be enough.

After a while there will be vastly more of A around than of B

So, in this scenario, what evolved?

The distribution of A and B in the population.

I don't think this is an appropriate use of the word "evolution".

Why not? It's a standard one in the biological context. E.g.,

"In fact, evolution can be precisely defined as any change in the frequency of alleles within a gene pool from one generation to the next."

which according to a talk.origins FAQ is from this textbook: Helena Curtis and N. Sue Barnes, Biology, 5th ed. 1989 Worth Publishers, p.974

If there are mistakes made or the environment requires adaptation, a sufficiently flexible intelligence can mediate the selection pressure.

The end result still has to be for the failures to die or be castrated.

There is no problem with saying that values in future will "change" or "drift", but "evolve" is more specific and I'm not sure how will it work.

I understand that. Memes can die or be castrated, too :-/

In your earlier comment you said "evolution requires selection pressure". There is of course selection pressure in memetic evolution. Completely eliminating memetic selection pressure is not even wrong - because memetic selection is closely connected to learning or knowledge creation. You can't get rid of it.

Godsdammit, people, "thrive" is the whole problem.

Yes, yes it is. Even once you can order all the central examples of thriving, the "mere addition" operation will tip them toward the noncentral repugnant ones. Hence why one might have to live with the lack of self-consistency.

You could just not be utilitarian, especially in the specific form of not maximizing a metaphysical quantity like "happy experience", thus leaving you with no moral obligations to counterfactual (ie: nonexistent) people, thus eliminating the Mere Addition Paradox.

Ok, I know that given the chemistry involved in "happy", it's not exactly a metaphysical or non-natural quantity, but it bugs me that utilitarianism says to "maximize Happy" even when, precisely as in the Mere Addition Paradox, no individual consciousness will actually experience the magnitude of Happy attained via utilitarian policies. How can a numerical measure of a subjective state of consciousness be valuable if nobody experiences the total numerical measure? It seems more sensible to restrict yourself to only moralizing about people who already exist, thus winding up closer to post-hoc consequentialism than traditional utilitarianism.

How can a numerical measure of a subjective state of consciousness be valuable if nobody experiences the total numerical measure?

The mere addition paradox also manifests for a single person. Imagine the state you are in. Now imagine if it can be (subjectively) improved by some means (e.g. fame, company, drugs, ...). Keep going. Odds are, you would not find a maximum, not even a local one. After a while, you might notice that, despite incremental improvements, the state you are in is actually inferior to the original, if you compare them directly. Mathematically, one might model this as the improvement drive being non-conservative and so no scalar map from states to scalar utility exists. Whether it is worth pushing this analogy any further, I am not sure.

The mere addition paradox also manifests for a single person. Imagine the state you are in. Now imagine if it can be (subjectively) improved by some means (e.g. fame, company, drugs, ...). Keep going. Odds are, you would not find a maximum, not even a local one.

Hill climbing always finds a local maximum, but that might well look very disappointing, wasteful of effort, and downright stupid when compared to some smarter means of spending the effort on finding a way to live a better life.

Impressive.

Couldn't another class of solutions be that resolutions of inconsistencies cannot reduce the complexity of the agent's morality? I.e. morality has to be (or tend to become) not only (more) consistent, but also (more) complex, sort of like an evolving body of law rather than like the Ten Commandments?

Actually, I have suggested something like that, I now recall... It's the line "Require them to be around the same expected complexity as human values." in Crude Measures.

That is a solution that springs to mind - once we've thought of it in these terms :-) To my knowledge, it hasn't been suggested before.

morality has to be (or tend to become) not only (more) consistent, but also (more) complex

It's not clear to me that one can usefully distinguish between "more consistent" and "less complex".

Suppose that someone felt that morality dictated one set of behaviors for people of one race, and another set of behaviors for people of another race. Eliminating that distinction to have just one set of morals that applied to everyone might be considered by some to increase consistency, while reducing complexity.

That said, it all depends on what formal definition one adopts for consistency in morality: this doesn't seem to me a well-defined concept, even though people talk about it as if it was. (Clearly it can't be the same as consistency in logic. An inconsistent logical system lets you derive any conclusion, but even if a human is inconsistent WRT some aspect of their morality, it doesn't mean they wouldn't be consistent in others. Inconsistency in morality doesn't make the whole system blow up the way logical inconsistency does.)

It seems that your research is coming around to some concepts that are at the basis of mine. Namely, that noise in an optimization process is a constraint on the process, and that the resulting constrained optimization process avoids the nasty properties you describe.

Feel free to contact me if you'd like to discuss this further.

I fear I will lack time for many months :-( Send me another message if you want to talk later.

What do I mean by that? Well, imagine you're trying to reach reflective equilibrium in your morality. You do this by using good meta-ethical rules, zooming up and down at various moral levels, making decisions on how to resolve inconsistencies, etc... But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup.

Wait... what? No.

You don't solve the value-alignment problem by trying to write down your confusions about the foundations of moral philosophy, because writing down confusion still leaves you fundamentally confused. No amount of intelligence can solve an ill-posed problem in some way other than pointing out that the problem is ill-posed.

You solve it by removing the need to do moral philosophy and instead specifying a computation that corresponds to your moral psychology and its real, actually-existing, specifiable properties.

And then telling metaphysics to take a running jump to boot, and crunching down on Strong Naturalism brand crackers, which come in neat little bullet shapes.

Near as I can tell, you're proposing some "good meta-ethical rules," though you may have skipped the difficult parts. And I think the claim, "you stop when your morality is perfectly self-consistent," was more a factual prediction than an imperative.

I didn't skip the difficult bits, because I didn't propose a full solution. I stated an approach to dissolving the problem.

And do you think that approach differs from the one you quoted?

It involves reasoning about facts rather than metaphysics.

And will that model have the right counteractfactuals? Will it evolve under changing conditions the same way that the original would.

If you modelled the real thing correctly, then yes, of course it will.

Yes, of course, but then the questions is: :what is the difference between modelling it correctly and solving moral philosophy? A correct model has to get a bunch of counterfactuals correct, and not just match an empirical dataset.

Well, attempting to account for your grammar and figure out what you meant...

A correct model has to get a bunch of counterfactuals correct, and not just match an empirical dataset.

Yes, and? Causal modelling techniques get counterfactuals right-by-design, in the sense that a correct causal model by definition captures counterfactual behavior, as studied across controlled or intervened experiments.

I mean, I agree that most currently-in-use machine learning techniques don't bother to capture causal structure, but on the upside, that precise failure to capture and compress causal structure is why those techniques can't lead to AGI.

what is the difference between modelling it currently, and solving moral philosophy?

I think it's more accurate to say that we're trying to dissolve moral philosophy in favor of a scientific model of human evaluative cognition. Surely to a moral philosopher this will sound like a moot distinction, but the precise difference is that the latter thing creates and updates predictive models which capture counterfactual, causal knowledge, and which thus can be elaborated into an explicit theory of morality that doesn't rely on intuition or situational framing to work.

As far as I can tell, human intuition is the territory you would be modelling, here. In particular, when dealing with counterfactuals, since it would be unethical to actually set up trolley problems.

BTW, there is nothing to stop moral philosophy being predictive, etc.

As far as I can tell, human intuition is the territory you would be modelling, here.

No, we're trying to capture System 2's evaluative cognition, not System 1's fast-and-loose, bias-governed intuitions.

Wrong kind of intuition

If you have an extenal standard, as you do with probability theory and logic, system 2 can learn utilitarianism, and its performance can be checked against the external standard.

But we don't have an agreed standard to compare system 1 ethical reasoning against, because we haven't solved ,moral philosophy. What we have is system 1 coming up with speculative theories,which have to be checked against intuition, meaning an internal standard

Again, the whole point of this task/project/thing is to come up with an explicit theory to act as an external standard for ethics. Ethical theories are maps of the evaluative-under-full-information-and-individual+social-rationality territory.

Again, the whole point of this task/project/thing is to come up with an explicit theory to act as an external standard for ethics.

And that is the whole point of moral philosophy..... so it's sounding like a moot distinction.

Ethical theories are maps of the evaluative-under-full-information-and-individual+social-rationality territory.

You don't like the word intuition, but the fact remains that while you are building your theory, you will have to check it against humans ability to give answers without knowing how they arrived at them. Otherwise you end up with a clear, consistent theory that nobody finds persuasive.

Such a territory does not exist, therefore it's not territory.

You're going to have to explain how "thoughts and feelings that people will or would have in certain scenarios" fails to be territory.

But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup.

Or once you lose your meta-mortal urge to reach a self-consistent morality. This may not be the wrong (heh) answer along a path that originally started toward reaching self-consistent morality.

Or, more simply, the system could get hacked. When exploring a potential future world, you could become so enamoured of it, that you overwrite any objections you had. It seems very easy for humans to fall into these traps - and again, once you lose something of value in your system, you don't tend to get if back.

Is it a trap? If the cost of iterating the "find a more self-consistent morality" loop for the next N years is greater than the expected benefit of the next incremental change toward a more consistent morality for those same N years, then perhaps it's time to stop. Just as an example, if the universe can give us 10^20 years of computation, at some point near that 10^20 years we might as well spend all computation on directly fulfilling our morality instead of improving it. If at 10^20 - M years we discover that, hey, the universe will last another 10^50 years that tradeoff will change and it makes sense to compute even more self-consistent morality again.

Similarly, if we end up in a siren world it seems like it would be more useful to restart our search for moral complexity by the same criteria; it becomes worthwhile to change our morality again because the cost of continued existence in the current morality outweighs the cost of potentially improving it.

Additionally, I think that losing values is not a feature of reaching a more self-consistent morality. Removing a value from an existing moral system does not make the result consistent with the original morality; it is incompatible with reference to that value. Rather, self-consistent morality is approached by better carving reality at its joints in value space; defining existing values in terms of new values that are the best approximation to the old value in the situations where it was valued, while extending morality along the new dimensions into territory not covered by the original value. This should make it possible to escape from siren worlds by the same mechanism; entering a siren world is possible only if reality was improperly carved so that the siren world appeared to fulfill values along dimensions that it eventually did not, or that the siren world eventually contradicted some original value due to replacement values being an imperfect approximation. Once this disagreement is noticed it should be possible to more accurately carve reality and notice how the current values have become inconsistent with previous values and fix them.

Or once you lose your meta-mortal urge to reach a self-consistent morality. This may not be the wrong (heh) answer along a path that originally started toward reaching self-consistent morality.

The problem is that un-self-consistent morality is unstable under general self improvement (and self-improvement is very general, see http://lesswrong.com/r/discussion/lw/mir/selfimprovement_without_selfmodification/ ).

The main problem with siren worlds is that humans are very vulnerable to certain types of seduction/trickery, and it's very possible AIs with certain structures and goals would be equally vulnerable to (different) tricks. Defining what is a legit change and what isn't is the challenge here.

The problem is that un-self-consistent morality is unstable under general self improvement

Even self-consistent morality is unstable if general self improvement allows for removal of values, even if removal is only a practical side effect of ignoring a value because it is more expensive to satisfy than other values. E.g. we (Westerners) generally no longer value honoring our ancestors (at least not many of them), even though it is a fairly independent value and roughly consistent with our other values. It is expensive to honor ancestors, and ancestors don't demand that we continue to maintain that value, so it receives less attention. We also put less value on the older definition of honor (as a thing to be defended and fought for and maintained at the expense of convenience) that earlier centuries had, despite its general consistency with other values for honesty, trustworthiness, social status, etc. I think this is probably for the same reason; it's expensive to maintain honor and most other values can be satisfied without it. In general, if U(more_satisfaction_of_value1) > U(more_satisfaction_of_value2) then maximization should tend to ignore value2 regardless of its consistency. If U(make_values_self_consistent_value) > U(satisfying_any_other_value) then the obvious solution is to drop the other values and be done.

A sort of opposite approach is "make reality consistent with these pre-existing values" which involves finding a domain in reality state space under which existing values are self-consistent, and then trying to mold reality into that domain. The risk (unless you're a negative utilitarian) is that the domain is null. Finding the largest domain consistent with all values would make life more complex and interesting, so that would probably be a safe value. If domains form disjoint sets of reality with no continuous physical transitions between them then one would have to choose one physically continuous sub-domain and stick with it forever (or figure out how to switch the entire universe from one set to another). One could also start with preexisting values and compute a possible world where the values are self-consistent, then simulate it.

It is expensive to honor ancestors, and ancestors don't demand that we continue to maintain that value, so it receives less attention.

That's something different - a human trait that makes us want to avoid expensive commitments while paying them lip service. A self consistent system would not have this trait, and would keep "honor ancestors" in it, and do so or not depending on the cost and the interaction with other moral values.

If you want to look at even self-consistent systems being unstable, I suggest looking at social situations, where other entities reward value-change. Or a no-free-lunch result of the type "This powerful being will not trade with agents having value V."

E.g. we (Westerners) generally no longer value honoring our ancestors (at least not many of them), even though it is a fairly independent value and roughly consistent with our other values. It is expensive to honor ancestors, and ancestors don't demand that we continue to maintain that value, so it receives less attention.

This sweeps the model-dependence of "values" under the rug. The reason we don't value honoring our ancestors is that we don't believe they continue to exist after death, and so we don't believe social relations of any kind can be carried on with them.

The reason we don't value honoring our ancestors is that we don't believe they continue to exist after death

This could be a case of typical mind fallacy. I can point to a number of statistical studies that show that a large number of Westerners claim that their ancestors do continue to exist after death.

Anyone who believes that some sort of heaven or hell exists.

And a lot of these people nonetheless don't accord their ancestors all that much in the way of honour...

I can point to a number of statistical studies that show that a large number of Westerners claim that their ancestors do continue to exist after death.

They may believe it, but they don't alieve it.

Because the things that people would do if they believed in and acted as though they believe in life after death are profoundly weird, and we don't see any of that around. Can you imagine the same people who say that the dead "went to a better place" being sad that someone has not died, for instance? (Unless they're suffering so much or causing so much suffering that death is preferable even without an afterlife.)

Because the things that people would do if they believed in and acted as though they believe in life after death are profoundly weird, and we don't see any of that around.

I don't see why they need to be "profoundly weird". Remember, this subthread started with "honoring ancestors". The Chinese culture is probably the most obvious one where honoring ancestors is a big thing. What "profoundly weird" things does it involve?

What "profoundly weird" things does it involve?

Given that this is the Chinese we're talking about, expecting one's ancestors to improve investment returns in return for a good sacrifice.

Sorry, I don't know enough about Chinese culture to answer. But I'd guess that either they do have weird beliefs (that I'm not familiar with so I can't name them), or they don't and honoring ancestors is an isolated thing they do as a ritual. (The answer may be different for different people, of course.)

Speaking of "profoundly weird" things, does the veneration of saints in Catholicism qualify? :-)

Insofar as anyone expects saints to perform the function of demigods and intervene causally with miracles on behalf of the person praying, yes, it is "profoundly weird" magical thinking.

Why do you ask a site full of atheists if they think religion is irrational?

"Irrational" and "weird" are quite different adjectives.

You are assuming that human beings are much more altruistic than they actually are. If your wife has the chance of leaving you and having a much better life where you will never hear from her again, you will not be sad if she does not take the chance.

Because the things that people would do if they believed in and acted as though they believe in life after death are profoundly weird

Okay, now I'm curious. What exactly do you think that people would do if they believed in life after death?

-- Be happy that people have died and sad that they remain alive (same qualifiers as before: person is not suffering so much that even nothingness is preferable, etc.) and the reverse for people who they don't like

-- Want to kill people to benefit them (certainly, we could improve a lot of third world suffering by nuking places, if they have a bad life but a good afterlife. Note that the objection "their culture would die out" would not be true if there is an afterlife.)

-- In the case of people who oppose abortions because fetuses are people (which I expect overlaps highly with belief in life after death), be in favor of abortions if the fetus gets a good afterlife

-- Be less willing to kill their enemies the worse the enemy is

-- Do extensive scientific research trying to figure out what life after death is like.

-- Genuinely think that having their child die is no worse than having their child move away to a place where the child cannot contact them

-- Drastically reduce how bad they think death is when making public policy decisions; there would be still some effect because death is separation and things that cause death also cause suffering, but we act as though causing death makes some policy uniquely bad and preventing it uniquely good

-- Not oppose suicide

Edit: Support the death penalty as more humane than life imprisonment.

(Some of these might not apply if they believe in life after death but also in Hell, but that has its own bizarre consequences.)

-- Be less willing to kill their enemies the worse the enemy is

Now might I do it pat. Now he is praying.
And now I’ll do ’t. And so he goes to heaven.
And so am I revenged.—That would be scanned.
A villain kills my father, and, for that,
I, his sole son, do this same villain send
To heaven.
Oh, this is hire and salary, not revenge.
He took my father grossly, full of bread,
With all his crimes broad blown, as flush as May.
And how his audit stands who knows save heaven?
But in our circumstance and course of thought
'Tis heavy with him. And am I then revenged
To take him in the purging of his soul
When he is fit and seasoned for his passage?
No.
Up, sword, and know thou a more horrid hent.
When he is drunk asleep, or in his rage,
Or in th' incestuous pleasure of his bed,
At game a-swearing, or about some act
That has no relish of salvation in ’t—
Then trip him, that his heels may kick at heaven,
And that his soul may be as damned and black
As hell, whereto it goes. My mother stays
This physic but prolongs thy sickly days.

-- Hamlet, Act 3, scene 3.

In Christianity, we are as soldiers on duty who cannot desert their post. Suicide and murder are mortal sins, damning one to perdition hereafter. Christians differ on whether this is a causal connection: works -> fate, or predestined by grace: grace -> works and grace -> fate. Either way, the consequences of believing in the Christian conception of life after death add up to practicing Christian virtue in this life.

In Buddhism, you get reincarnated, but only if you have lived a virtuous life do you get a favorable rebirth. Killing, including of yourself, is one of the worst sins and guarantees you a good many aeons in the hell worlds. The consequences of believing in the Buddhist conception of life after death add up to practicing Buddhist virtue in this life.

In Islam, paradise awaits the virtuous and hell the wicked. The consequences of believing in the Islamic conception of life after death add up to practicing Islamic virtue in this life. We can observe these consequences in current affairs.

I don't think that helps. For instance, if they alieve in an afterlife but their religion says that suicide and murder are mortal sins, they won't actually commit murder or suicide, but they would still not think it was sad that someone died in the way we think it's sad, would not insist that public policies should reduce deaths, etc.

You would also expect a lot of people to start thinking of religious prohibitions on murder and suicide like many people think of religious prohibitions on homosexuality--If God really wants that, he's being a jerk and hurting people for no obvious reason. And you'd expect believers to simply rationalize away religious prohibitions on murder and suicide and say that they don't apply just like religious believers already do to lots of other religious teachings (of which I'm sure you can name your own examples).

If God really wants that, he's being a jerk and hurting people for no obvious reason.

Ask a Christian and they'll give you reasons. Ask a Jew and they'll give you reasons, except for those among the laws that are to be obeyed because God says so, despite there not being a reason known to Man. Ask a Buddhist, ask a Moslem.

There is no low-hanging fruit here, no instant knock-down arguments against any of these faiths that their educated practitioners do not know already and have answers to.

Ask ... and they'll give you reasons.

Yes, but in the real world, when a religious demand conflicts with something people really believe, it can go either way. Some people will find reasons that justify the demand. But some will find reasons that go in the other direction--instead of reasons why the religion demands some absurd thing, they'd give reasons as to why the religion's obvious demand really isn't a demand at all. Where are the people saying that the prohibition on murder is meant metaphorically, or only means "you shouldn't commit murders in this specific situation that only existed thousands of years ago"? For that matter, where are the people saying "sure, my religion says we shouldn't murder, but I have no right to impose that on nonbelievers by force of law", in the same way that they might say that about other mortal sins?

Be happy that people have died and sad that they remain alive (same qualifiers as before: person is not suffering so much that even nothingness is preferable, etc.) and the reverse for people who they don't like

Hmmm.

What is known is that people who go to the afterlife don't generally come back (or, at least, don't generally come back with their memories intact). Historical evidence strongly suggests that anyone who remains alive will eventually die... so remaining alive means you have more time to enjoy what is nice here before moving on.

So, I don't imagine this would be the case unless the afterlife is strongly known to be significantly better than here.

Want to kill people to benefit them (certainly, we could improve a lot of third world suffering by nuking places, if they have a bad life but a good afterlife. Note that the objection "their culture would die out" would not be true if there is an afterlife.)

Is it possible for people in the afterlife to have children? It may be that their culture will quickly run out of new members if they are all killed off. Again, though, this is only true if the afterlife is certain to be better than here.

In the case of people who oppose abortions because fetuses are people (which I expect overlaps highly with belief in life after death), be in favor of abortions if the fetus gets a good afterlife

Be less willing to kill their enemies the worse the enemy is

Both true if and only if the afterlife is known to be better.

Do extensive scientific research trying to figure out what life after death is like.

People have tried various experiments, like asking people who have undergone near-death experiences. However, there is very little data to work with and I know of no experiment that will actually give any sort of unambiguous result.

Genuinely think that having their child die is no worse than having their child move away to a place where the child cannot contact them

And where their child cannot contact anyone else who is still alive, either. Thrown into a strange and unfamiliar place with people who the parent knows nothing about. I can see that making parents nervous...

Drastically reduce how bad they think death is when making public policy decisions; there would be still some effect because death is separation and things that cause death also cause suffering, but we act as though causing death makes some policy uniquely bad and preventing it uniquely good

Exile is also generally considered uniquely bad; and since the dead have never been known to return, death is at the very least a form of exile that can never be revoked.

Not oppose suicide

...depends. Many people who believe in life after death also believe that suicide makes things very difficult for the victim there.

Support the death penalty as more humane than life imprisonment.

Again, this depends; if there is a Hell, then the death penalty kills a person without allowing him much of a chance to try to repent, and could therefore be seen as less humane than life imprisonment.

The worse the afterlife is, the more similar people's reactions will be to a world where there is no afterlife. In the limit, the afterlife is as bad as or worse than nonexistence and people would be as death-averse as they are now. Except that this is contrary to how people claim to think of the afterlife when they assert belief in it. The afterlife can't be good enough to be comforting and still bad enough not to lead to any of the conclusions I described. And this includes being bad for reasons such as being like exile, being irreversible, etc.

And I already said that if there is a Hell (a selectively bad afterlife), many of these won't apply, but the existence of Hell has its own problems.

The worse the afterlife is, the more similar people's reactions will be to a world where there is no afterlife.

I'd phrase it as "the scarier the afterlife is, the more similar people's reactions will be to a world where there is no afterlife." The word "scarier" is important, because something can look scary but be harmless, or even beneficial.

And people's reactions do not depend on what the afterlife is like; they depend on what people think about the afterlife.

And one of the scariest things to do is to jump into a complete unknown... even if you're pretty sure it'll be harmless, or even beneficial, jumping into a complete unknown from which there is no way back is still pretty scary...

But is jumping into a "complete unknown" which you think should be beneficial really going to get the same reaction as jumping into one that you believe to be harmful?

No, it should not.

The knowledge that there's no return would make people wary about it, but they'd be a lot more wary if they thought it would be harmful.

I can point to a number of statistical studies that show that a large number of Westerners claim that their ancestors do continue to exist after death.

No, they believe-in-the-belief that their ancestors continue to exist after death. They rarely, and doubtingly, if ever, generate the concrete expectation that anything they can do puts them in causal contact with the ghosts of their ancestors, such that they would expect to see something different from their ancestors being permanently gone.

The main problem with siren worlds

Actually, I'd argue the main problem with "Siren Worlds" is the assumption that you can "envision", or computationally simulate, an entire possible future country/planet/galaxy all at once, in detail, in such time that any features at all would jump out to a human observer.

That kind of computing power would require, well, something like the mass of a whole country/planet/galaxy and then some. Even if we generously assume a very low fidelity of simulation, comparable with mere weather simulations or even mere video games, we're still talking whole server/compute farms being turned towards nothing but the task of pretending to possess a magical crystal ball for no sensible reason.

tl;dr: human values are already quite fragile and vulnerable to human-generated siren worlds.

Simulation complexity has not stopped humans from implementing totalitarian dictatorships (based on divine right of kings, fundamentalism, communism, fascism, people's democracy, what-have-you) due to envisioning a siren world that is ultimately unrealistic.

It doesn't require detailed simulation of a physical world, it only requires sufficient simulation of human desires, biases, blind spots, etc. that can lead people to abandon previously held values because they believe the siren world values will be necessary and sufficient to achieve what the siren world shows them. It exploits a flaw in human reasoning, not a flaw in accurate physical simulation.

That's shifting the definition of "siren world" from "something which looks very nice when simulated in high-resolution but has things horrendously wrong on the inside" to a very standard "Human beings imagine things in low-resolution and don't always think them out clearly."

You don't need to pour extra Lovecraft Sauce on your existing irrationalities just for your enjoyment of Lovecraft Sauce.

It depends a lot on how the world is being shown. If the AI is your "guide", it can show you the seductive features of the world, or choose the fidelity of the simulation in just the right ways in the right places, etc... Without needing a full fledged simulation. You can have a siren world in text, just through the AI's (technically accurate) descriptions, given your questions.

You're missing my point, which is that proposing you've got "an AI" (with no dissolved understanding of how the thing actually works underneath what you'd get from a Greg Egan novel) which "simulates" possible "worlds" is already engaging in several layers of magical thinking, and you shouldn't be surprised to draw silly conclusions from magical thinking.

I think I'm not getting your point either. Isn't Stuart just assuming standard decision theory, where you choose actions by predicting their consequences and then evaluating your utility function over your predictions? Are you arguing that real AIs won't be making decisions like this?

Isn't Stuart just assuming standard decision theory, where you choose actions by predicting their consequences and then evaluating your utility function over your predictions? Are you arguing that real AIs won't be making decisions like this?

While I do think that real AIs won't make decisions in this fashion, that aside, as I had understood Stuart's article, the point was not to address decision theory, which is a mathematical subject, but instead that he hypothesized a scenario in which "the AI" was used to forecast possible future events, with humans in the loop doing the actual evaluation based on simulations realized in high detail, to the point that the future-world simulation would be as thorough as a film might be today, at which point it could appeal to people on a gut level and bypass their rational faculties, but also have a bunch of other extra-scary features above and beyond other scenarios of people being irrational, just because.

The "But also..." part is the bit I actually object to.

Let's focus on a simple version, without the metaphors. We're talking about an AI presenting humans with consequences of a particular decision, with humans then making the final decision to go along with it or not.

So what is happening is that various possible future worlds will be considered by the AI according to its desirability criteria, these worlds will be described to humans according to its description criteria, and humans will choose according to whatever criteria we use. So we have a combination of criteria that result in a final decision. A siren world is a world that ranks very high in these combined criteria but is actually nasty.

If we stick to that scenario and assume the AI is truthful, the main siren world generator is the ability of the AI to describe them in ways that sound very attractive to humans. Since human beliefs and preferences are not clearly distinct., this ranges from misleading (incorrect human beliefs) to actively seductive (influencing human preferences to favour these worlds).

The higher the bandwidth the AI has, the more chance it has of "seduction", or of exploiting known or unknown human irrationalities (again, there's often no clear distinction between exploiting irrationalities for beliefs or preferences).

One scenario - Paul Christiano's - is a bit different but has essentially unlimited bandwidth (or, more precisely, has an AI estimating the result of a setup that has essentially unlimited bandwidth).

but also have a bunch of other extra-scary features above and beyond other scenarios of people being irrational, just because.

This category can include irrationalities we don't yet know about, better exploitation of irrationalities we do know about, and a host of speculative scenarios about hacking the human brain, which I don't want to rule out completely at this stage.

We're talking about an AI presenting humans with consequences of a particular decision, with humans then making the final decision to go along with it or not.

No. We're not. That's dumb. Like, sorry to be spiteful, but that is already a bad move. You do not treat any scenario involving "an AI", without dissolving the concept, as desirable or realistic. You have "an AI", without having either removed its "an AI"-ness (in the LW sense of "an AI") entirely or guaranteed Friendliness? You're already dead.

Can we assume, that since I've been working all this time on AI safety, that I'm not an idiot? When presenting a scenario ("assume AI contained, and truthful") I'm investigating whether we have safety within the terms of that scenario. Which here we don't, so we can reject attempts aimed at that scenario without looking further. If/when we find a safe way to do that within the scenario, then we can investigate whether that scenario is achievable in the first place.

Ah. Then here's the difference in assumptions: I don't believe a contained, truthful UFAI is safe in the first place. I just have an incredibly low prior on that. So low, in fact, that I didn't think anyone would take it seriously enough to imagine scenarios which prove it's unsafe, because it's just so bloody obvious that you do not build UFAI for any reason, because it will go wrong in some way you didn't plan for.

See the point on Paul Christiano's design. The problem I discussed applies not only to UFAIs but to other designs that seek to get round it, but use potentially unrestricted search.

I'm puzzled. Are you sure that's your main objection? Because,

  • you make a different objection (I think) in your response to the sibling, and

  • it seems to me that since any simulation of this kind will be incomplete, and I assume the AI will seek the most efficient way to achieve its programmed goals, the scenario you describe is in fact horribly dangerous; the AI has an incentive to deceive us. (And somewhat like Wei Dai, I thought we were really talking about an AI goal system that talks about extrapolating human responses to various futures.)

It would be completely unfair of me to focus on the line, "as thorough as a film might be today". But since it's funny, I give you Cracked.com on Independence Day.

To be honest, I was assuming we're not talking about a "contained" UFAI, since that's, you know, trivially unsafe.

as I had understood Stuart's article, the point was not to address decision theory, which is a mathematical subject, but instead that he hypothesized a scenario in which "the AI" was used to forecast possible future events, with humans in the loop doing the actual evaluation based on simulations realized in high detail, to the point that the future-world simulation would be as thorough as a film might be today, at which point it could appeal to people on a gut level and bypass their rational faculties

It's true that Stuart wrote about Oracle AI in his Siren worlds post, but I thought that was mostly just to explain the idea of what a Siren world is. Later on in the post he talks about how Paul Christiano's take on indirect normativity has a similar problem. Basically the problem can occur if an AI tries to model a human as accurately as possible, then uses the model directly as its utility function and tries to find a feasible future world that maximizes the utility function.

It seems plausible that even if the AI couldn't produce a high resolution simulation of a Siren world W, it could still infer (using various approximations and heuristics) that with high probability its utility function assigns a high score to W, and choose to realize W on that basis. It also seems plausible that an AI eventually would have enough computing power to produce high resolution simulations of Siren worlds, e.g., after it has colonized the galaxy, so the problem could happen at that point if not before.

but also have a bunch of other extra-scary features above and beyond other scenarios of people being irrational, just because.

What extra-scary features are you referring to? (Possibly I skipped over the parts you found objectionable since I was already familiar with the basic issue and didn't read Stuart's post super carefully.)

Are you arguing that real AIs won't be making decisions like this?

Yes. I think that probabilistic backwards chaining, aka "planning as inference", is the more realistic way to plan, and better represented in the current literature.

Actually, I'd argue the main problem with "Siren Worlds" is the assumption that you can "envision", or computationally simulate, an entire possible future country/planet/galaxy all at once, in detail, in such time that any features at all would jump out to a human observer.

That's not needed for a siren world. Putting human brains into vats and stimulating their pleasure centers doesn't require much computing power.

Wireheading isn't a siren world, though. The point of the concept is that it looks like what we want, when we look at it from the outside, but actually, on the inside, something is very wrong. Example: a world full of people who are always smiling and singing about happiness because they will be taken out and shot if they don't (Lilly Weatherwax's Genua comes to mind). If the "siren world" fails to look appealing to (most) human sensibilities in the first place, as with wireheading, then it's simply failing at siren.

The point is that we're supposed to worry about what happens when we can let computers do our fantasizing for us in high resolution and real time, and then put those fantasies into action, as if we could ever actually do this, because there's a danger in letting ourselves get caught up in a badly un-thought-through fantasy's nice aspects without thinking about what it would really be like.

The problem being, no, we can't actually do that kind of "automated fantasizing" in any real sense, for the same reason that fantasies don't resemble reality: to fully simulate some fantasy in high resolution (ie: such that choosing to put it into action would involve any substantial causal entanglement between the fantasy and the subsequent realized "utopia") involves degrees of computing power we just won't have and which it just wouldn't even be efficient to use that way.

Backwards chaining from "What if I had a Palantir?" does lead to thinking, "What if Sauron used it to overwhelm my will and enthrall me?", which sounds wise except that, "What if I had a Palantir?" really ought to lead to, "That's neither possible nor an efficient way to get what I want."

The common thread I am noticing is the assumption of singletonhood.

Technologically, if you have a process that could go wrong, you run several in parallel.

In human society, an ethical innovator can run an idea past the majority to seems it sounds like an improved version of what they believe already.

It's looking, again, like group rationality is better.

.

Groups converge as well. We can't assume AI groups will have the barriers to convergence that human groups currently do (just as we can't assume that AIs have the barriers to convergence that humans do).

I'm not doubting that groups converge, I am arguing that when a group achieves reflective equilibrium, that is much more meaningful than a singleton doing so, at least as long as there is variation within the group.

In absolute terms, maybe, but that doesn't stop it being relatively better.

What you are trying to do is import positive features from the convergence of human groups (eg the fact that more options are likely to have been considered, the fact that productive discussion is likely to have happened...) into the convergence of AI groups, without spelling them out precisely. Unless we have a clear handle on what, among humans, causes these positive features, we have no real reason to suspect they will happen in AI groups as well.

The two concrete examples you gave weren't what I had in mind. I was addressing the problem of an AI "losing" values during extrapolation,and it looks like a real reason to me. If you want to prevent an AI undergoing value drift during extrapolation, keep an extrapolated one as a reference. Two is a group minimally.

There may well be other advantages to doing rationality and ethics in groups, and yes, that needs research, and no, that isnt a show stopper.

I don't think anyone has proposed any self-referential criteria as being the point of Friendly AI? It's just that such self-referential criteria as reflective equilibrium are a necessary condition which lots of goal setups don't even meet. (And note that just because you're trying to find a fixpoint, doesn't necessarily mean you have to try to find it by iteration, if that process has problems!)

It's just that such self-referential criteria as reflective equilibrium are a necessary condition

Why? The only example of adequately friendly intelligent systems that we have (i.e. us) don't meet this condition. Why should reflective equilibrium be a necessary condition for FAI?

Because FAI's can change themselves very effectively in ways that we can't.

It might be that human brain in computer software would have the same issues.

Because FAI's can change themselves very effectively in ways that we can't.

Doesn't mean the FAI couldn't remain genuinely uncertain about some value question, or consider it not worth solving at this time, or run into new value questions due to changed circumstances, etc.

All of those could prevent reflective equilibria, while still being compatible with the ability for extensive self-modification.

All of those could prevent reflective equilibria, while still being compatible with the ability for extensive self-modification.

It's possible. They feel very unstable, though.