LESSWRONG
LW

AI-Assisted AlignmentAI
Frontpage

16

Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

by JanB
1st Dec 2022
AI Alignment Forum
1 min read
3

16

Ω 9

AI-Assisted AlignmentAI
Frontpage

16

Ω 9

Research request (alignment strategy): Deep dive on "making AI solve alignment for us"
8elifland
2Tor Økland Barstad
2JanB
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 9:26 PM
[-]elifland3y87

Mostly agree. For some more starting points, see posts with the AI-assisted alignment tag. I recently did a rough categorization of strategies for AI-assisted alignment here.

If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing.

Not totally sure about this, my impression (see chart here) is that much of the community already considers some form of AI-assisted alignment to be our best shot. But I'd still be excited for more in-depth categorization and prioritization of strategies (e.g. I'd be interested in "AI-assisted alignment" benchmarks that different strategies could be tested against). I might work on something like this myself.

Reply
[-]Tor Økland Barstad3y20

Cool that you have an interest in this topic, and want to contribute towards making progress on it!

I myself am writing on a sequence called AGI-assisted alignment, which also looks into this kind of thing (it's a work in progress).

If I were to recommend just one post from that sequence it would be Alignment with argument-networks and assessment-predictions.

Reply
[-]JanB3yΩ220

There is now also this write-up by Jan Leike: https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach

Reply
Moderation Log
More from JanB
View more
Curated and popular this week
3Comments

We might be able to train AI alignment assistants that massively accelerate/improve the alignment research that gets done. These assistants need not have (strongly) superhuman capabilities or be highly agentic, they just need to be capable and aligned enough to allow us to offload most work on alignment, safety, and related problems to them.

This seems to be a key part of OpenAI's alignment strategy. The best explanation of this strategy that I've seen is maybe A minimal viable product for alignment, with an excellent discussion in the comment section.

If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing (see, e.g. Beth Barnes's ideas here, or my upcoming post on "non-scalable oversight"---i.e. pragmatic improvements to oversight that would help with training alignment assistant but which cannot directly scale to oversight of superhuman systems).

I haven't seen any deep treatment of the viability of this strategy and its implications. I think such an analysis would be pretty useful.

I could potentially provide funding for such an analysis or help with obtaining funding.

Mentioned in
17Take 2: Building tools to help build FAI is a legitimate strategy, but it's dual-use.