Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Thanks to Rebecca Gorman for help with this post.

Morally underdefined situations

I recently argued that full understanding of value extrapolation[1] was necessary and almost sufficient for solving the AI alignment problem.

In it, I introduced situations beyond the human moral training distribution, where we aren't sure how to interpret their moral value. I gave a convoluted example of an AI CEO engineering the destruction of its company and the ambiguously-moral creation of personal assistants, all in order to boost the value of its shareholders. In the past, I've also given examples of willing slave races and Croatian, communist, Yugoslav nationalists in the 1980s. We could also consider what happens as children develop moral instincts in a world they realise is more complicated, or ordinary humans encountering counter-intuitive thought-experiments for the first time. We also have some examples from history, when situations changed and new questions appeared[2].

I won't belabour the point any further. Let's call these situations "morally underdefined".

Morally underdefined can be terrible

Most of the examples I gave above are rather mild: yes, we're unsure what the right answer is, but it's probably not a huge disaster if the AI gets it wrong. But morally underdefined situations can be much worse than that. The easiest examples are ones that trade off a huge potential good against a huge potential bad, and thus deciding the wrong way could go extremely wrong; we need to carefully sort out the magnitude of the positives and the negatives before making any decision.

The repugnant conclusion is a classical example of that; we wouldn't want to get to the "huge population with lives barely worth living, filled with musak and potatoes" and then only discover after that there was an argument we had missed against total utilitarianism.

Another good example is if we developed whole brain emulations (ems), but they were imperfect. This would resolve the problem of death and might resolve most of the problem of suffering. But what price would we accept to pay? What if our personalities were radically changed? What if our personalities were subtly manipulated? What if we lost our memories over a year of subjective time - we could back these up in classical computer storage and access these as images and videos, but our internal memories would be lost? What if we became entities that were radically different in seemingly positive ways, but we hadn't had time to think through the real consequences? How much suffering would we still allow - and of what sort?

Or what about democracy among ems? Suppose we had a new system that allowed some reproduction, as long as no more than of the "parents'" values were in the new entities? We'd probably want to think through what that meant before accepting.

Similarly, what would we be willing to risk to avoid possible negatives - would we accept to increase the risk of human extinction by , in order to avoid human-brain-cell-teddies, willing slave races, the repugnant conclusion, and not-quite-human-emulations?

So the problem of morally underdefined situations is not a small issue for AI safety; the problem can be almost arbitrarily huge.


  1. A more precise term than "model splintering". ↩︎

  2. My favourite example might be the behaviour of Abraham Lincoln in the early days of the US civil war. The US constitution seemed to rule out secession; but did it empower the president to actively prevent secession? The answer is a clear "it had never come up before and people hadn't decided", and there were various ways to extend precedents. Lincoln chose one route that was somewhat compatible with these precedents (claiming war powers to prevent succession). His predecessor had chosen another one (claiming succession was illegal but that the federal government couldn't prevent it). ↩︎

New to LessWrong?

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 3:31 PM

Are you talking about bounded consequentialism, where you are hit with unknown unknows? Or about known consequences whose moral status evaluates to "undefined"?

The second one (though I think there is some overlap with the first).

Are these things under-defined, or "just" outside of our heuristic/intuition domain?  I'd argue that reality is perfectly defined - any obtained configuration of the universe is a very precise thing.  And this extends to obtainable/potential configurations (where there's uncertainty over which will actually occur, but we have no modeled data that contradicts the possibility).

So the situation is defined, what part isn't?  I think it's our moral intuition about whether a given situation is "better" than another.  I don't have a solution to this, but I want to be clear that the root problem is trying to "align" to non-universal, often unknown, preferences.

It is indeed our moral intuitions that are underdefined, not the states of the universe.

Similarly, what would we be willing to risk to avoid possible negatives - would we accept to increase the risk of human extinction by 1%, in order to avoid

The things avoided seem like they increase, not decrease risk.

The point is what it's not obvious whether we'd want an AI to gamble with human extinction in order to avoid morally questionable outcomes, and that this is an important question to get right.

That point is more easily made when it doesn't involve things that might risk extinction, like the human brain cell teddies, the differing ems, etc.

The things avoided seem like they increase, not decrease risk.

Yes, but it might be that the means needed to avoid them - maybe heavy-handed AI interventions? - could be even more dangerous.