raphael — LessWrong

LESSWRONG
LW

raphael — LessWrong

Replying toInterpretability through two lenses: biology and physics

Interpretability through two lenses: biology and physics

I agree! LLMs are in many ways self-organized complex systems. And they are very easy to experiment on.
Which questions about LLM interpretability do you think theoretical physics can help investigate?

Replying toInterpretability through two lenses: biology and physics

raphael6mo

Interpretability through two lenses: biology and physics

Thank you for the comment, I appreciate this perspective.
A couple of things to clarify:

As I quoted from "Emergence of brains", I think the more traditional approach of biology is to embrace the complexity of living systems. In contrast, physics tends to seek more and more general and unifying principles (symmetries, order, curvature) that apply broadly across the universe, whether it's matter or living systems, or even abstract phenomena like financial markets.

In the context of interpretability, I use this analogy to describe two (main) distinct approaches: one is self-proclaimed "biological", and nowadays is mostly about breaking down models into features and circuits, and the other is more emergentist, and seek to find, for... (read more)

Interpretability through two lenses: biology and physics

raphael

6mo

Interpretability is the nascent science of making the vast complexity of billion-parameter AI models more comprehensible to the human mind. Currently, the mainstream approach is reductionist: dissecting a model into many smaller components, much like a biologist mapping cellular pathways. Here, I describe and advocate for the complementary perspective of seeking emergent simplicities^[1]: underlying principles, following physics' march towards universality.

Large language models are not engineered, they are grown -- like a sourdough starter or a bonsai^[2]. The analogy has become a trope among the (expanding) circle of researchers digging into the inner workings of LLMs, attempting to elucidate how new words are computed from an input sequence of tokens. It is certainly... (read 1114 more words →)