3

Intuitively, it seems to me that a not-that-powerful AI could do a really good job at interpreting other neural nets via some sort of human feedback for how "easy to understand" an explanation is. I would like to hear why this is right or wrong.

New to LessWrong?

Getting Started

FAQ

Library

How optimistic should we be about AI figuring out how to interpret itself?

25th Jul 2022

2Charlie Steiner

New Answer

New Comment

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 3:45 AM

[-]Charlie Steiner2y20

Can I interest you in reading the ~117-page ELK report? It really might answer your question.

Ultimately, how hard this is will depend on how high your standards are. If you want explanations to be good even in new and weird contexts, or even when humans can't figure out how to check the accuracy of outputted explanations, or even when the AI being interpreted is adversarially trying to hide its motivations, the problem can get pretty hard.

Moderation Log

LESSWRONG
LW

[ Question ]

How optimistic should we be about AI figuring out how to interpret itself?

3

New to LessWrong?