Disagreement with Paul: alignment induction

[-]Gordon Seidoh Worley7yΩ240

I agree that power and capability concerns seem important here. Even if we accept A[n+1] is not going to fail on C due to being more powerful than A[n], it seems likely to me that A[n] will not be capable enough to sufficiently assess A[n+1] such that we can get good enough guarantees that C holds for A[n+1].

If we look at the probability of C holding over the whole chain of induction I think things look even worse. Let's say there is a 99.9% likelihood that C will hold on any step of induction. then over many iterations as we multiply the probability that C holds we find probability that C holds over the whole process falls with a lower probability of C holding overall the more iterations needed.

Taken together this suggests a serious challenge, because to minimize risk of power differentials between successors to increase the chance of C holding we would want many small iterations, but this risks increasing the risk that small probabilities of failure in each iteration will compound such that it eventually becomes likely that C does not hold.

To meet this harkens back to a lesson we long ago learned in engineering: the more moving parts in your system the more likely it will fail.

[-]DanielFilan7yΩ230

The C does induct in a non-trivial way, the result is friendly, but only one or two steps of the induction are actually needed.

I'm curious what you're imagining here - I don't really know why this would happen or what it would look like. Is it something like "this agent makes a successor that is fully friendly and powerful given resource constraints"?

[-]Stuart_Armstrong7yΩ350

I'm thinking something like "this utility function is friendly, once we have solved these n specific problems; let's create a few levels of higher intelligence, to solve these specific problems, using certain constraints (physical or motivational) to prevent things going wrong during this process".

[-]Rohin Shah7yΩ120

If I imagine each level A[n] as maximizing the expected value of some simple utility function, I agree that it would be surprising if the result was not one of your first three cases. Intuitively, either we already have all of the friendly utility function, and we didn't need induction, or we didn't and bad things happen, which corresponds to cases 1 and 3.

But it seems like one of the main points of iterated amplification is that at least the initial levels need not be maximizing the expected value of some simple utility. In that case, there seems to be a much wider space of possible designs.

For example, we could have a system that has the epistemic state of wanting to help humans but knowing that it doesn't know how best to do that, and so asking humans for feedback and deferring to them when appropriate. Such a system with amplification might eventually learn the friendly utility function and start maximizing that, but it seems like there could be many iterations before that point, during which it is corrigible in the sense of deferring to humans and not maximizing its current conception of what is best.

I don't have a strong sense at the moment what would happen, but it seems plausible that the induction will go through and will have "actually mattered".

[-]avturchin7y10

A possible example: an AI is not aligned about the amount of energy it will consume, and after each iteration of self-improvement, it consumes 10 times more, starting from 1 watt. At first 10 stages it will be not a problem, but after its consumption will become 10 Gigawatts, it clearly becomes a problem, and at 10 Billion gigawatts it is a catastrophe.

[-]Dagon7y00

H is, in fact, A[-1], and their parents are A[-2). There is no valid start to this induction.

LESSWRONG
LW

LESSWRONG
LW

31

Disagreement with Paul: alignment induction

31

Ω 9

31

Ω 9