Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We might be able to train AI alignment assistants that massively accelerate/improve the alignment research that gets done. These assistants need not have (strongly) superhuman capabilities or be highly agentic, they just need to be capable and aligned enough to allow us to offload most work on alignment, safety, and related problems to them.

This seems to be a key part of OpenAI's alignment strategy. The best explanation of this strategy that I've seen is maybe A minimal viable product for alignment, with an excellent discussion in the comment section.

If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing (see, e.g. Beth Barnes's ideas here, or my upcoming post on "non-scalable oversight"---i.e. pragmatic improvements to oversight that would help with training alignment assistant but which cannot directly scale to oversight of superhuman systems).

I haven't seen any deep treatment of the viability of this strategy and its implications. I think such an analysis would be pretty useful.

I could potentially provide funding for such an analysis or help with obtaining funding.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 3:06 PM

Mostly agree. For some more starting points, see posts with the AI-assisted alignment tag. I recently did a rough categorization of strategies for AI-assisted alignment here.

If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing.

Not totally sure about this, my impression (see chart here) is that much of the community already considers some form of AI-assisted alignment to be our best shot. But I'd still be excited for more in-depth categorization and prioritization of strategies (e.g. I'd be interested in "AI-assisted alignment" benchmarks that different strategies could be tested against). I might work on something like this myself.

Cool that you have an interest in this topic, and want to contribute towards making progress on it!

I myself am writing on a sequence called AGI-assisted alignment, which also looks into this kind of thing (it's a work in progress).

If I were to recommend just one post from that sequence it would be Alignment with argument-networks and assessment-predictions.