Behavioral and mechanistic definitions (often confuse AI alignment discussions)

[-]Neel Nanda3yΩ352

Really nice post! I think this is an important point that I've personally been confused about in the past, and this is a great articulation (and solid work for 2 hours!!)

[-]LawrenceC3yΩ350

Thanks!

(As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)

[-]Joseph Bloom3y32

Thanks, Lawrence! I liked this post which I think I'm going to bookmark and refer back to when I'm trying to think/write about analysis. I think t being disciplined about the distinction between these two types of definitions is crucial to think clearly.

I recently heard Buck use the terms "Model Psychology" and "Model Neuroscience" to distinguish types of analysis with small models. My understanding of his position in that public discussion was that people should distinguish between the two because (now using your terms) we shouldn't confuse behavioural insights with mechanistic insights and this is a trap that people fall into, leading to some amount of miscalibrated confidence about how well we understand models. I suspect I have also fallen into this trap, so having this post to refer back to seems especially valuable.

[-]davidad3yΩ230

In computer science this distinction is often made between extensional (behavioral) and intensional (mechanistic) properties (example paper).

[-]Ben Amitay3yΩ110

This is an important distinction, that show in its cleanest form in mathematics - where you have constructive definitions from the one hand, and axiomatic definitions from the other. It is important to note though that is is not quite a dichotomy - you may have a constructive definition that assume aximatically-defined entities, or other constructions. For example: vector spaces are usually defined axiomatically, but vector spaces over the real numbers assume the real numbers - that have multiple axiomatic definitions and corresponding constructions.

In science, there is the classic "are wails fish?" - which is mostly about whether to look at their construction/mechanism (genetics, development, metabolism...) or their patterns of interaction with their environment (the behavior of swimming and the structure that support it). That example also emphasize that we natural language simplly don't respect this distinction, and consider both internal structure and outside relations as legitimate "coordinates in thingspace" that may be used together to identify geometrically-natural categories.

^{^}

Detailed epistemic status: Written in ~2 hours total for the SERI MATS writing workshop I ran. Accordingly, I’m probably missing a bunch of related work, and all the examples are just things I happened to remember at the time.

^{^}

Interestingly enough, it’s not clear to me whether or not these definitions are behavioral or mechanistic!

^{^}

Note that the Anthropic In-context Learning paper does make it clear that they’re defining induction heads behaviorally, as opposed to the mechanistic definition in the Mathematical Framework paper!

Mechanistic analysis of weights and eigenvalue analysis are much more complicated in large models with MLP’s, so for this paper we choose to define induction heads by their narrow empirical sequence copying behavior (the [A][B]...[A]→[B]), and then attempt to show that they (1) also serve a more expansive function that can be tied to in-context learning, and (2) coincide with the mechanistic picture for small models.

(Source)

^{^}

In fact, in the post they focus primarily on induction heads implemented using K-composition, as their simplified two-layer attention-only transformer could not implement Q-composition due to an inability to put positional information in the residual stream.

^{^}

In particular, certain heads can be both "induction heads" for one circuit while serving other functionality in other circuits. See for example head 1.4 in A circuit for Python docstrings in a 4-layer attention-only transformer, which is a both fuzzy previous token head and an induction head

^{^}

The third definition concerns inner/outer alignment in cases where there is no clean train/test split, which is implicitly assumed in the first two definitions.

^{^}

For an example of how related the concepts are see David Krueger’s question: “Is ‘gears-level’ just a synonym for “mechanistic?”.

^{^}

I also think that the reputation of black-box models has been a bit too unfairly maligned in the community, but that claim deserves a post in itself.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

33

Behavioral and mechanistic definitions (often confuse AI alignment discussions)

33

Ω 20

33

Ω 20

Introduction:

Three examples of behavioral and mechanistic definitions

Induction heads

Different framings of inner and outer alignment

Definitions of fire

The pros and cons of behavioral versus mechanistic definitions

Pros of Behavioral Definitions

Pros of Mechanistic Definitions

Relation to “gears-level” models

That being said...

Acknowledgments