Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
4 comments, sorted by Click to highlight new comments since: Today at 10:44 AM
New Comment

In the Romeo and Juliet example, the final summary gets a key fact disastrously wrong: 

Romeo buys poison to kill Juliet at her grave.

(In the original he buys it to kill himself).

It looks like a single snippet of the original got blown up until it was 10% of the final summary, and the surrounding context was not sufficient to fix it.

Come, cordial and not poison, go with me 

To Juliet’s grave; for there must I use thee.

I'd heard about this before, but not the alignment spin on it. This is more interesting to me from a capabilities standpoint than an alignment standpoint, so I had assumed that this was motivated by the normal incentives for capabilities research. I'd be interested if I'm in fact wrong, or if it seems more alignment-y to other people.

From an alignment perspective the main point is that the required human data does not scale with the length of the book (or maybe scales logarithmically). In general we want evaluation procedures that scale gracefully, so that we can continue to apply them even for tasks where humans can't afford to produce or evaluate any training examples.

The approach in this paper will produce worse summaries than fine-tuning a model end-to-end. In order to produce good summaries, you will ultimately need to use more sophisticated decompositions---for example, if a character appears in the 2nd chapter of a book who is not mentioned in the first chapter's summary, you want to answer "what is this character's deal?" and have it answered by a model who has read the first chapter. And long run you need to handle a lot of issues that don't decompose so cleanly.

If you don't have more sophisticated decompositions, and you go past the scale where humans can provide oversight directly, then you will be forced to use proxies that are available end-to-end, resulting in traditional safety problems. For example, you have to evaluate "did this plan result in high reward when we ran it?" rather than "does this plan look good before we run it?" because the plan is too complex and vast for a human to understand.

So I think this is a good problem and first baseline for studying recursive decompositions and trying to have them be competitive with end-to-end supervision. You could also hope to learn something about e.g. how errors propagate in recursive supervision schemes, and generally what happens when you use ML to approximate large collaborations of humans on tasks that are too large for humans to handle them directly.

This in turn seems like an important problem to start working on given that most concrete plans for overseeing powerful models appear to require this kind of recursive oversight (and most safety concerns can be cashed out as limitations with that kind of recursive oversight). So in general I think you should be excited about people pushing these methods as far as they can.

I had assumed that this was motivated by the normal incentives for capabilities research. I'd be interested if I'm in fact wrong, or if it seems more alignment-y to other people.

It was primarily motivated as a first experiment in doing recursive oversight. In retrospect I think we should probably have gone straight for more complicated decompositions rather than starting with the simple baseline.

The fact that they apply a model with a 2k-token context to books is more about capabilities than alignment, although it still raises many of the same questions about e.g. cascading errors and how to deal with decomposition in practice. I think from an alignment perspective you'd want to work on overseeing the longest context that the model can handle (if you want decompositions to be interesting that likely means working with longer contexts than GPT-3; this is a general instance of the phenomenon that it's hard to practice overseeing superhuman systems while all of our systems are subhuman on most axes).

Ah, yeah. I guess this connection makes perfect sense if we're imagining supervising black-box-esque AIs that are passing around natural language plans.

Although that supervision problem is more like... summarizing Alice in Wonderland if all the pages had gotten ripped out and put back in random order. Or something. But sure, baby steps