Values determined by "stopping" properties


The lotus-eaters are examples of humans who have followed hedonism all the way through to its logical conclusion. In contrast, the "mindless outsourcers" are a possible consequence of the urge to efficiency: competitive pressures making uploads choose to destroy their own identity.

In my "Mahatma Armstrong" version of Eliezer's CEV, a somewhat altruistic entity ends up destroying all life, after a series of perfectly rational self-improvements. And in many examples where AIs are supposed to serve human preferences, these preferences are defined by a procedure (say, a question-answer process) that the AI can easily manipulate.

Stability and stopping properties

Almost everyone agrees that human values are under-determined (we haven't thought deeply and rigorously about every situation) and changeable by life experience. Therefore, it makes no sense to use "current human values" as a goal; this concept doesn't even exist in any rigorous sense.

So we need some way of extrapolating true human values. All the previous examples could be considered examples of extrapolation, and they all share the same problem: they are defined by their "stopping criteria" more than by their initial conditions.

For example, the lotus eaters have reached a soporific hedonism they don't want to wake out of. There no longer is "anyone there" to change anything in the mindless outsourcers. CEV is explicitly assumed to be convergent: convergent to a point where the idealised entity no longer sees any need to change. The AI example is a bit different in flavour, but the "stopping criteria" are whatever the human /chooses/is tricked into/is forced into/ saying. This means that the AI could be an optimisation process pushing the human to say whatever it wants us to.

Importantly, all these stopping criteria are local: they explicitly care only about the situation when the stopping criteria is reached, not about the journey there, nor the initial conditions.

Processes can drift very far from their starting point, if they have local stopping criteria, even under very mild selection pressure. Consider the following game: each of two players is to name an number, and , between and . The player with the highest number gets that much in euros, and the one with the strictly lowest one gets that much plus two in euros. Each player starts at , and each in turn is allowed to adjust their numbers until they don't want to any more.

Then if both players are greedy and myopic, one player will start by dropping to , followed by the next player dropping theirs to , and so on, going back and forth between the players until one stands at and the other at . Obviously if the could be chosen from a larger range, there's no limit to the amount of loss that such a process could generate.

Similarly, if our process of extrapolating human values have local stopping criteria, there's no limit to how bad they could end up being, or how "far away" in the space of values they could go.

This, by the way, explains my intuitive dislike for some types of moral realism. If there are true objective moral facts that humans can access, then whatever process counts as "accessing them" becomes... a local stopping condition for defining value. So I don't tend to focus on arguments about how correct or intuitive that process is; instead, I want to know where it ends up.

Barriers to total drift

So, how can we prevent local stopping conditions from shooting far across the landscape of possible values? This is actually a problem I've been working on for a long time; you can see this in my old paper "Chaining God: A qualitative approach to AI, trust and moral systems". I would not recommend reading that paper - it's hopelessly amateurish, anthropomorphisising, and confused - but it shows one of the obvious solution: tie values to their point of origin.

There seem roughly three interventions that can be done to overcome the problem of local stopping criteria.

  • I. The first is to tie the process to the starting point, as above. Now, initial human values are not properly defined; nevertheless it seems possible to state that some values are further away from this undefined starting point than others (paperclipers are very far, money-maximiser quite far, situations where recognisably human beings do recognisably human stuff are much closer). Then the extrapolation process gets a penalty for wandering too far afield. The stopping conditions are no longer purely local.
  • II. If there is an agent-like piece in the extrapolation process, we can remove rigging (previously called bias) or influence, so that the agent can't manipulate the extrapolation process. This is a partial measure: it replaces a targeted extrapolation process with a random walk, which removes one major issue but doesn't solve the whole problem.
  • III. Finally, it is often suggested that constraints be added to the extrapolation process. For example, if the human values are determined by human feedback, then we can forbid the AI from coercing the human in any way, or restrict it to only using some methods (such as relaxed conversation). I am dubious about this kind of approach. It firstly assumes that concepts like "coercion" and "relaxed conversation" can be defined - but if that were the case, we'd be closer to solving the issue directly. And secondly, it assumes that restrictions that apply to humans also apply to AIs: we can't easily change the core values of fellow humans with conversation, but super-powered AIs may be able to do so.

In my methods, I'll normally be using mostly interventions of type I and II.