Saturating the Difficulty Levels of Alignment

Johannes C. Mayer

TL;DR

We don't know exactly how hard alignment is, therefore it seems good to have people work on different solutions, as one axis in which solutions can differ is in, what level of difficulty they break. Some solutions would work in worlds where alignment is easy. Others would work in worlds where alignment is hard. One heuristic is to have the work people do be distributed according to our uncertainty over how hard the problem of alignment is.

How hard is alignment? I don't know for sure, but my model says that it is very, very hard.

A thing that Buck has recently been doing, is thinking about concrete system setups that make it less likely that doom would happen if we would run a deceptive but still not superintelligent model (I am heavily simplifying). This seems like the kind of thing that works in worlds where alignment is not super hard but still hard enough that it isn't just solved by default. Even in worlds where alignment is hard, we want to do this kind of thing to delay doom. Maybe we are even able to get a warning shot.

Having multiple people work on agendas that would work out at different levels of "how difficult solving alignment is" seems generally good. Maybe we live in a world where we are saved by the approach Buck is suggesting. Then it would be very dumb if nobody is even trying that approach because they are working on solutions that attempt to solve alignment at different levels of difficulty. They might think that Buck's approach is doomed, or that it is overkill. In either case, they would not take action to implement it.

My agenda, and many others, aim to work in a world where alignment is very hard. A good property of Buck's thing is that it is pretty straightforward, and something that we can basically do right now, without major theoretical breakthroughs. Agendas that try to solve worst-case alignment are likely to take too long to have a meaningful impact. But again it would be pretty dumb if we lived in a world where we need to solve worse-case alignment, but nobody is even attempting to do it. I am considering mainly the world where we would be able to solve worst-case alignment with a particular agenda, but only if we have people working on that agenda right now, up until the point when the agenda results are needed.

So it seems that in general as long as we are at least somewhat uncertain how hard alignment is, we should have people work on different approaches that would work out at different levels of difficulty. A good heuristic might be to distribute the work we do based on our probability distribution over the difficulty levels. E.g. if we have 75% that alignment is hard, 20% that it is easy, and 5% that it is solved by default, we would want 75% of intellectual labor to go into solving worst-case alignment.

In practice, we want to combine this heuristic with other factors such as timelines, personal fit, etc. If timelines are very long, we might just go with solving worst-case alignment, such that we can be sure that it will work out, no matter how hard it actually is (what a nice fantasy world). Imagine there are people who are able to make progress on solving alignment in a model where solving alignment is easy. The same people would not be able to make any progress on solving worst-case alignment. In that case, having these people work on worst-case alignment would be worse than having them work on solving alignment in a model where alignment is easy.

A general version of this applies to arbitrary properties of alignment plans. For any property P, we can have a probability distribution. That distribution estimates how many solutions, that work in the real world, have value v for P. We then want to distribute cognitive labor according to our distribution's overall properties. The above talks about the property of "this plan work to solve alignment at difficulty level x".

LESSWRONG
LW

Saturating the Difficulty Levels of Alignment

6

New to LessWrong?

6