LESSWRONG
LW

AI
Frontpage

25

OpenAI Superalignment: Weak-to-strong generalization

by Dalmert
14th Dec 2023
1 min read
3

25

This is a linkpost for https://openai.com/research/weak-to-strong-generalization
AI
Frontpage

25

OpenAI Superalignment: Weak-to-strong generalization
4Wei Dai
3Signer
2keith_wynroe
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 12:33 PM
[-]Wei Dai2y*42

Crossposting from X

You assume that you don't need to solve hard philosophical problems. But the superhuman researcher model probably will need to, right? Seems like a very difficult instance of weak-to-strong generalization, and I'm not sure how you would know whether you've successfully solved it.

(I'm referring to G.3 ALIGNMENT PLAN ASSUMPTIONS which says "We assume we do not need to solve hard philosophical questions of human values and value aggregation before we can align a superhuman researcher model well enough that it avoids egregiously catastrophic outcomes.")

Here's a previous discussion between @janleike and me on the topic of philosophical problems in AI alignment for anyone interested in more details on our perspectives https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach?commentId=pu3SJfqAZDSskQiyo

Reply
[-]Signer2y30

But the superhuman researcher model probably will need to, right?

Maybe not, if the goal of the plan is not to achieve full singularity, but just to use superhuman researcher for uncontroversial problems like life extension and making money.

Reply
[-]keith_wynroe2y20

I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn't actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that's already there

Reply
Moderation Log
More from Dalmert
View more
Curated and popular this week
3Comments