jenny — LessWrong

Impact stories for model internals: an exercise for interpretability researchers

I had cached impressions that AI safety people were interested in auditing, ELK, and scalable oversight.

A few AIS people who volunteered to give feedback before the workshop (so biased towards people who were interested in the title) each named a unique top choice: scientific understanding (specifically threat models), model editing, and auditing (so 2/3 were unexpected for me).

During the workshop, attendees (again, biased, as they self-selected into the session) expressed excitement most about auditing, unlearning, MAD, ELK, and general scientific understanding. I was surprised at the interest in MAD and ELK, I thought there would be more skepticism around those; though I can see how they might be aesthetically appealing for the slightly more academic audience.

A comparison of causal scrubbing, causal abstractions, and related methods

jenny2yΩ110

Not sure if I'm fully responding to your q but...

there might be no canonical topology for the original computation

This sounds right to me, and overall I mostly think of treeification as just a kind of extensional rewrite (plus adding more inputs).

these hypotheses can't be understood as making precise claims about the original computation anymore

I think of the underlying graph as providing some combination of 1) causal relationships, and 2) smaller pieces to help with search/reasoning, rather than being an object we inherently care about. (It's possibly useful to think of hypotheses more as making predictions about the behavior but idk.)

I do agree that in some applications you might want to restrict which rewrites (including treeification!) are allowed. e.g., in MAD for ELK we might want to make use of the fact that there is a single "diamond" (which may be ~distributed, but not ~duplicated) upstream of all the sensors.

A comparison of causal scrubbing, causal abstractions, and related methods

jenny2yΩ460

This is a nice comparison. I particularly like the images :) and drawing the comparisons setting aside historical accidents.

A few comments that came to mind as I was reading:

Perform an interchange intervention on the treeification of L such that the corresponding intervention in the treeification of H would not change any values.

As far as I saw, you don’t mention how causal scrubbing specifies selecting the interchange intervention (the answer is: preserving the distribution of inputs to nodes in H, see e.g. the Appendix post). I think this is an important point: causal scrubbing provides an opinion on which interventions you should do in order to judge your hypothesis, not just how you should do them.

We need some way of turning a neural network into a graph L, i.e. we need to decide what the individual nodes should be. We won’t discuss that problem in this post since it is orthogonal to the main algorithms we're comparing.

I actually think this is reasonably relevant, and is related to treeification. Causal scrubbing encourages writing your graph in whatever way you want: there is no reason to think the “normal” network topology is privileged, e.g. that heads are the right unit of abstraction. For example, in causal scrubbing we frequently split the output of a head in different subspaces, or even write it as computing a function plus an error term.

TBC other methods could also operate on a rewritten, treeified graph, but they don’t encourage it and idk if authors/proponents would endorse.

Treeification is the one way in which causal scrubbing is stricter than all the other methods.

Related to the above comment: I actually don’t think of treefication as making it stricter, rather just more expressive. It allows you to write down a hypotheses from a richer space to reflect what you actually think the network is doing (e.g. head 0 in layer 0 is only relevant for head 5 in layer 1, otherwise it’s unimportant).

Recall that causal scrubbing only allows interventions that don't change any of the values in the explanation H.

IMO this isn’t a fundamental property of causal scrubbing (I agree this isn’t mentioned anywhere, so you’re not wrong in pointing out this difference; but I also want to note which are the deepest differences and which are more of “no one has gotten around to writing up that extension yet”).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments