One of the most common counterarguments against a lot of alignment research that I hear sounds something like this: “Making current AI do X with Y won’t help, because AGI will RSI to ASI and break everything”. And it.. sounds convincing, but also like an ultimate counterargument? I mean, I agree that aligning ASI takes much more than just RLHF or fine-tuning or data cleaning or whatever else, but does this mean that it’s all pointless? I propose a (probably not novel at all) way of thinking about this.

If we imagine that there is a class of books from the future, such that any of them is a complete “alignment handbook” that contains all the schematics to build aligned AGI, with step-by-step explanations, tips and tricks, etc. Is there another class of books, “catalyst books”, that if presented to us, will increase our chances of writing at least one “alignment handbook”? Because if there are none of them, then yeah, any research that won’t give us “all the right answers” is pointless (also, we are extra fucked). And btw, any “catalyst book” that somehow increases our chances of writing an “alignment handbook” to like 99%, is an “alignment handbook”, practically.

I think there is a whole spectrum of those, from x1.01 books to x100 books. And probably even x0.95 “inhibitor books” as well. Almost all of the books in the world are x1 “neutral books”. So maybe instead of saying “It won’t help at all, because ASI will break everything” (and I almost always agree that it will) it would be better to see every direction as the potential way of producing good, neutral or bad “catalyst book”?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 12:46 AM

Nice frame!

This is still compatible with arguments of the form “Making current AI do X with Y won’t help, because AGI will RSI to ASI and break everything”.

As an analogy, consider what things historically were "catalyst books" on the way to a modern understanding of optics. The most important "catalyst" steps would have been things like:

  • The fact that changing electric fields induce magnetic fields, and vice versa, and mathematical descriptions of these phenomena. (These led to Maxwell's unification of electromagnetism and optics.)
  • Discrete lines in emission/absorption spectra of some gases. (These, and some related observations, led to the recognition of quantization of photons.)

Point is: historically, the sorts of things which were "catalyst" knowledge for the big steps in optics did not look like marginal progress in understanding of optics itself, or creation of marginal new optical tools/applications. It looked like progress in adjacent topics which gave a bunch of evidence about how optics generalizes/unifies with other phenomena.

Back to AGI: the core hypothesis behind argument of the form “Making current AI do X with Y won’t help, because AGI will RSI to ASI and break everything” is that there's a huge distribution shift between current AI and AGI/ASI, such that our strong prior should be that techniques adapted to current AI will completely fail to generalize. Insofar as that's true, intermediate progress/"catalyst" knowledge for alignment of AGI/ASI won't look like making current AI do X with Y. Rather, it will look like improved understanding of other things which we do expect to be relevant somehow to AGI/ASI - for instance, mathematical results about agents or the limits of agents are a natural candidate.

Well, continuing your analogy: to see discrete lines somewhere at all, you will need some sort of optical spectrometer, which requires at least some form of optical tools like lenses and prisms, and they have to be good enough to actually show the sharp spectra lines, and probably easily available, so that someone smart enough eventually will be able to use them to draw the right conclusions.

At least that's how it seems to be done in the past. And I think we shouldn't do exactly this with AGI: like open-source every single tool and damn model, hoping that someone will figure out something while building them as fast as we can. But overall, I think building small tools/ getting marginal results/ aligning current dumb AI's could produce a non-zero cumulative impact. You can't produce fundamental breakthroughs completely out of thin air after all.