Here is a video I made 2 months ago. It gives a mediocre at best explanation for an important foundational argument:

It is normally possible to make progress using empirical methods, as long as you can measure how good a particular change was. That holds even when you don't really understand <what you are doing/the systems you are building>. This explains why researchers can advance capabilities in ML even though they basically do not understand the internals of the current deep learning systems at all.

I also argue that any progress in understanding can be dangerous, because it often improves the frontier of things that can be effectively explored through empirical methods. A corollary of this is that mechanistic interpretability can make it easier to advance capabilities.

The argument generalizes very far.

New to LessWrong?

New Comment