Should we publish mechanistic interpretability research?

LawrenceC

Naively there are so few people working on interp, and so many people working on capabilities, that publishing is so good for relative progress. So you need a pretty strong argument that interp in particular is good for capabilities, which isn't borne out empirically and also doesn't seem that strong.

In general, this post feels like it's listing a bunch of considerations that are pretty small, and the 1st order consideration is just like "do you want people to know about this interpretability work", which seems like a relatively straightfoward "yes".

I also seperately think that LW tends to reward people for being "capabilities cautious" more than is reasonable, and once you've made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don't matter ex ante.

[-]Lucius Bushnaq3y*2213

So you need a pretty strong argument that interp in particular is good for capabilities, which isn't borne out empirically and also doesn't seem that strong.

I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment.

If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don't think they help much at all with alignment. There's a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I'm saying. I doubt there's many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you'd be able to figure out which ones those are in advance. Often, I expect you'd only know years after the insight has been published and the field has figured out all of what can be done with it.

I think it's all one tech tree, is what I'm saying. I don't think neural network theory neatly decomposes into a "make strong AGI architecture" branch and a "aim AGI optimisation at a specific target" branch. Just like quantum mechanics doesn't neatly decompose into a "make a nuclear bomb" branch and a "make a nuclear reactor" branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation.

By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it's lower in the tech tree.

[-]habryka3y188

I also seperately think that LW tends to reward people for being "capabilities cautious" more than is reasonable, and once you've made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don't matter ex ante.

But isn't most of the interpretability research happening from people who have not made this commitment? Anthropic, which is currently the biggest publisher of interp-research, clearly does not have a commitment to not work towards advancing capabilities, and it seems important to have thought about what things Anthropic works on do maybe substantially increase capabilities (and which things they should hold off on).

I also separately don't buy that just because you aren't aiming to specifically work towards advancing capabilities that therefore publishing any of your work is fine. Gwern seems to not be aiming specifically towards advancing capabilities, but nevertheless seems to have had a pretty substantial effect on capability work, at least based on talking to a bunch of researchers in DL who cite Gwern as having been influential on them.

[-]Rohin Shah3y141

Why are you considering Anthropic as a unified whole here? Sure, Anthropic as a whole is probably doing some work that is directly aimed towards advancing capabilities, but this just doesn't seem true of the interp team. (I guess you could imagine that the only reason the interp team exists at Anthropic is that Anthropic believes interp is great for advancing capabilities, but this seems pretty unlikely to me.)

(Note that the criterion is "not specifically work towards advancing capabilities", as opposed to "try not to advance capabilities".)

[-]habryka3y12-1

I have found much more success modeling intentions and institutional incentives at the organization level than the team level.

My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work. In-general I've found arguments of the type of "this team in this org is working towards totally different goals than the rest of the org" to have a pretty bad track record, unless you are talking about very independent and mostly remote teams.

[-]RobertM3y32

My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work

I would be somewhat surprised if this was true, assuming you mean a strong form of this claim (i.e. operationalizing "help with capabilities work" as relying predominantly on 1st-order effects of technical insights, rather than something like "help with capabilities work by making it easier to recruit people", and "pressure" as something like top-down prioritization of research directions, or setting KPIs which rely on capabilities externalities, etc).

I think it's more likely that the interpretability team(s) operate with approximately full autonomy with respect to their research directions, and to the extent that there's any shaping of outputs, it's happening mostly at levels like "who are we hiring" and "org culture".

[-]habryka3y52

The pressure here looks more like "I want to produce work that the people around me are excited about, and the kind of thing they are most excited about is stuff that is pretty directly connected to improving capabilities", whereby I include "getting AIs to perform a wider range of economically useful tasks" as "improving capabilities".

I definitely don't think this is the only pressure the team is under! There are lots of pressures that are acting on them, and my current guess is that it's not the primary pressure, but I would be surprised if it isn't quite substantial.

[-]James Payor3y10

I don't think that the interp team is a part of Anthropic just because they might help with a capabilities edge; seems clear they'd love the agenda to succeed in a way that leaves neural nets no smarter but much better understood. But I'm sure that it's part of the calculus that this kind of fundamental research is also worth supporting because of potential capability edges. (Especially given the importance of stuff like figuring out the right scaling laws in the competition with OpenAI.)

(Fwiw I don't take issue with this sort of thing, provided the relationship isn't exploitative. Like if the people doing the interp work have some power/social capital, and reason to expect derived capabilities to be used responsibly.)

[-]Mark Xu3y40

I think it's probably reasonable to hold off on publishing interpretability if you strongly suspect that it also advances capabilities. But then that's just an instance of a general principle of "maybe don't advance capabilities", and the interpretability part was irrelevant. I don't really buy that interpretability is particularly likely to increase capabilities that you should have a sense of general caution around this. If you have a specific sense that e.g. working on nuclear fission could produce a bomb, then maybe you shouldn't publish (as has historically happen with e.g. research on graphene as a neutron modulator I think), but generically not publishing physics stuff because "it might be used to build a bomb, vaguely" seems like it basically won't matter.

I think Gwern is an interesting case, but also idk what Gwern was trying to do. I would also be surprised if Gwerns effect was "pretty substantial" by my lights (e.g. I don't think Gwern explained > 1% or even probably 0.1% variance in capabilities, and by the time you're calling 1000 things "pretty substantial effects on capabilities" idk what "pretty substantial" means).

[-]habryka3y1518

I think Gwern is an interesting case, but also idk what Gwern was trying to do. I would also be surprised if Gwerns effect was "pretty substantial" by my lights (e.g. I don't think Gwern explained > 1% or even probably 0.1% variance in capabilities, and by the time you're calling 1000 things "pretty substantial effects on capabilities" idk what "pretty substantial" means).

This feels a bit weird. Almost no individual explains 0.1% of the variance in capabilities. In-general it seems like the effect size of norms and guidelines like the ones discussed in the OP could make on the order of 10% difference in capability speeds, which depending on your beliefs about p(doom) can go into the 0.1% to 1% increase or decrease in the chance of literally everyone going extinct. It also seems pretty reasonable for someone to think this kind of stuff doesn't really matter at all, though I don't currently think that.

I don't really buy that interpretability is particularly likely to increase capabilities that you should have a sense of general caution around this.

Hmm, I don't know why you don't buy this. If I was trying to make AGI happen as fast as possible I would totally do a good amount of interpretability research and would probably be interested in hiring Chris Olah and other top people in the field. Chris Olah's interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering, and honestly, if I was trying to develop AGI as fast as possible I would find it a lot more interesting and promising to engage with than 95%+ of academic ML research.

I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah's publications in the top 10 and top 100.

I don't understand why we should have a prior that interpretability research is inherently safer than other types of ML research?

[-]Mark Xu3y148

I don't really want to argue about language. I'll defend "almost no individual has a pretty substantial affect on capabilities." I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that's bad-on-net for x-risk.

Chris Olah's interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering

I think this is false, and that most ML classes are not about making people good at ML engineering. I think Olah's stuff is disproportionately represented because it's interesting and is presented well, and also that classes really love being like "rigorous" or something in ways that are random. Similarly, probably like proofs of the correctness of backprop are common in ML classes, but not that relevant to being a good ML engineer?

I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah's publications in the top 10 and top 100.

I would be surprised if lots of ML engineers thought that Olah's work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.

I don't understand why we should have a prior that interpretability research is inherently safer than other types of ML research?

Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It's just like, if you're not trying to eek out more oomph from SGD, then probably the stuff you're doing isn't going to allow you to eek out more oomph from SGD, because it's kinda hard to do that and people are trying many things.

[-]habryka3y20

I don't really want to argue about language. I'll defend "almost no individual has a pretty substantial affect on capabilities." I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that's bad-on-net for x-risk.

Yep, makes sense. No need to argue about language. In that case I do think Gwern is a pretty interesting datapoint, and seems worth maybe digging more into.

I would be surprised if lots of ML engineers thought that Olah's work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.

I would take a bet at 2:1 in my favor for the top 10 thing. Top 10 is a pretty high bar, so I am not at even odds.

Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It's just like, if you're not trying to eek out more oomph from SGD, then probably the stuff you're doing isn't going to allow you to eek out more oomph from SGD, because it's kinda hard to do that and people are trying many things.

Hmm, yeah, I do think I disagree with the generator here, but I don't feel super confident and this perspective seems at least plausible to me. I don't believe it with enough probability to make me think that there is negligible net risk, and I feel like I have a relatively easy time coming up with counterexamples from science and other industries (the nuclear scientists working on nuclear fission did indeed not work on making weapons, and many people were working on making weapons).

Not sure how much it's worth digging more into this here.

[-]Arthur Conmy3y30

Which sorts of works are you referring to on Chris Olah's blog? I see mostly vision interpretability work (which has not helped with vision capabilities), RNN stuff (which essentially does not help capabilities because of transformers) and one article on back-prop, which is more engineering-adjacent but probably replaceable (I've seen pretty similar explanations in at least one publicly available Stanford course).

[-]habryka3y20

I've seen a lot of the articles here used in various ML syllabi: https://distill.pub/

The basic things studied here transfer pretty well to other architectures. Understanding the hierarchical nature of features transfer from vision to language, and indeed when I hear people talk about how features are structured in LLMs, they often use language borrowed from what we know about how they are structured in vision (i.e. having metaphorical edge-detectors/syntax-detectors that then feed up into higher level concepts, etc.)

[-]JakubK3y2-4

Anthropic, which is currently the biggest publisher of interp-research, clearly does not have a commitment to not work towards advancing capabilities

This statement seems false based on this comment from Chris Olah.

[-]habryka3y23

I am not sure what you mean. Anthropic clearly is aiming to make capability advances. The linked comment just says that they aren't seeking capability advances for the sake of capability advances, but want some benefit like better insight into safety, or better competitive positioning.

[-]JakubK3y30

Oh I see; I read too quickly. I interpreted your statement as "Anthropic clearly couldn't care less about shortening timelines," and I wanted to show that the interpretability team seems to care.

Especially since this post is about capabilities externalities from interpretability research, and your statement introduces Anthropic as "Anthropic, which is currently the biggest publisher of interp-research." Some readers might conclude corollaries like "Anthropic's interpretability team doesn't care about advancing capabilities."

[-]habryka3y20

Makes sense, sorry for the confusion.

[-]Marius Hobbhahn3y53

Just to get some intuitions.

Assume you had a tool that basically allows to you explain the entire network, every circuit and mechanism, etc. The tool spits out explanations that are easy to understand and easy to connect to specific parts of the network, e.g. attention head x is doing y. Would you publish this tool to the entire world or keep it private or semi-private for a while?

[-]Mark Xu3y80

I think this case is unclear, but also not central because I'm imagining the primary benefit of publishing interp research as being making interp research go faster, and this seems like you've basically "solved interp", so the benefits no longer really apply?

[-]Mark Xu3y0-3

Similarly, if you thought that you should publish capabilities research to accelerate to AGI, and you found out how to build AGI, then whether you should publish is not really relevant anymore.

[-]the gears to ascension3y20

agreed, but also, interpretability is unusually impactful capabilities work

[-]Daniel Murfet3y2411

The set of motivated, intelligent people with the relevant skills to do technical alignment work in general, and mechanistic interpretability in particular, has a lot of overlap with the set of people who can do capabilities work. That includes many academics, and students in masters and PhD programs. One way or another they're going to publish, would you rather it be alignment/interpretability work or capabilities work?

It seems to me that speeding up alignment work by several orders of magnitude is unlikely to happen without co-opting a significant number of existing academics, labs and students in related fields (including mathematics and physics in addition to computer science). This is happening already, within ML groups but also physics (Max Tegmark's students) and mathematics (e.g. some of my students at the University of Melbourne).

I have colleagues in my department publishing stacks of papers in CVPR, NeurIPS etc., which this community might call capabilities work. If I succeeded in convincing them to do some alignment or mechanistic interpretability work, they would do it because it was intrinsically interesting or likely to be high status. They would gravitate towards the kinds of work that are dual-use. Relative to the status quo that seems like progress to me, but I'm genuinely interested in the opinion of people here. Real success in this recruitment would, among other things, dilute the power of LW norms to influence things like publishing.

On balance it seems to me beneficial to aggressively recruit academics and their students into alignment and interpretability.

[-]James Payor3y*Ω102015

To throw in my two cents, I think it's clear that whole classes of "mechansitic interpretability" work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.

And I think this points strongly against publishing this stuff, especially if the goal is to "make this whole field more prestigious real quick". Insofar as the prestige is coming from folks who work on AI capabilities, that's drinking from a poisoned well (since they'll grant the most prestige to the work that helps them accelerate).

One relevant point I don't see discussed is that interpretability research is trying to buy us "slack", but capabilities research consumes available "slack" as fuel until none is left.

What do I mean by this? Sometimes we do some work and are left with more understanding and grounding about what our neural nets are doing. The repeated pattern then seems to be that this helps someone design a better architecture or scale things up, until we're left with a new more complicated network. Maybe because you helped them figure out a key detail about gradient flow in a deep network, or let them quantize the network better so they can run things faster, or whatnot.

Idk how to point at this thing properly, my examples aren't great. I think I did a better job talking about this over here on twitter recently, if anyone is interested.

But anyhow I support folks doing their research without broadcasting their ideas to people who are trying to do capabilities work. It seems nice to me if there was mostly research closure. And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.

[-]Neel Nanda3yΩ193

And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.

I'm surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)

[-]Noosphere893y99

I disagree with James Payor on people overestimating publishing interpretability work, and I think it's the opposite: People underestimate how good publishing interpretability work is, primarily because a lot of people on LW view interpretability work as being solved by a single clean insight, when this is usually not the case.

To quote 1a3orn:

One way that people think about the situation, which I think leads them to underestimate the costs of secrecy, is that they think about interpretability as a mostly theoretical research program. If you think of it that way, then I think it disguises the costs of secrecy.

But an addition, to a research program, interpretability is in part about producing useful technical artifacts for steering DL, i.e., standard interpretability tools. And technology becomes good because it is used.

It improves through tinkering, incremental change, and ten thousand slight changes in which each increase improves some positive quality by 10% individually. Look at what the first cars looked like and how many transformations they went through to get to where they are now. Or look it the history of the gun. Or, what is relevant for our causes, look at the continuing evolution of open source DL libraries from TF to PyTorch to PyTorch 2. This software became more powerful and more useful because thousands of people have contributed, complained, changed one line of documentation, added boolean flags, completely refactored, and so on and so forth.

If you think of interpretability being "solved" through the creation one big insight -- I think it becomes more likely that interpretability could be closed without tremendous harm. But if you think of it being "solved" through the existence of an excellent torch-shard-interpret package used by everyone who uses PyTorch, together with corresponding libraries for Jax, then I think the costs of secrecy become much more obvious.

Would this increase capabilities? Probably! But I think a world 5 years hence, where capabilities has been moved up 6 months relative to zero interpretability artifacts, but where everyone can look relatively easily into the guts of their models and in fact does so look to improve them, seems preferable to a world 6 months delayed but without these libraries.

I could be wrong about this being the correct framing. And of course, these frames must mix somewhat. But the above article seem to assume the research-insight framing, which I think is not obviously correct.

In general, I think interpretability research is net positive because capabilities will probably differentially progress towards more understandable models, where we are in a huge bottleneck right now for alignment.

[-]James Payor3y10

I think the issue is that when you get more understandable base components, and someone builds an AGI out of those, you still don't understand the AGI.

That research is surely helpful though if it's being used to make better-understood things, rather than enabling folk to make worse-understood more-powerful things.

I think moving in the direction of "insights are shared with groups the researcher trusts" should broadly help with this.

[-]James Payor3y10

Hm I should also ask if you've seen the results of current work and think it's evidence that we get more understandable models, moreso than we get more capable models?

[-]James Payor3yΩ452

I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc.

I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).

The transformer circuits work strikes me this way, so does a bunch of others.

Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.

[-]Neel Nanda3yΩ250

Interesting, thanks for the context. I buy that this could be bad, but I'm surprised that you see little upside - the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works

[-]James Payor3yΩ7149

I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.

Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.

I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and there's a lot more attention/people/incentives for capabilities.

I think there are more targeted things that would be better for getting more good work to happen. Like research workshops or unconferences, where you choose who to invite, or building community with more aligned folk who are looking for interesting and alignment-relevant research directions. This would come with way less potential harm imo as a recruitment strategy.

[-]Owain_Evans3yΩ242

Can you describe how the "local cluster" thing would work outside of keeping it within a single organization? I'd also be very interested in some case studies where people tried this.

[-]James Payor3yΩ352

I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs.

I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well.

This isn't something I can visualize working, but maybe it has components of an answer.

[-]sudo3y129

Strong upvoted.

I’m excited about people thinking carefully about publishing norms. I think this post existing is a sign of something healthy.

Re Neel: I think that telling junior mech interp researchers to not worry too much about this seems reasonable. As a (very) junior researcher, I appreciate people not forgetting about us in their posts :)

[-]Arthur Conmy3y62

Thank you for providing this valuable synthesis! In reference to:

4. It is a form of movement building.

Based on my personal interactions with more experienced academics, many seem to view the objectives of mechanistic interpretability as overly ambitious (see Footnote 3 in https://distill.pub/2020/circuits/zoom-in/ as an example). This perception may deter them from engaging in interpretability research. In general, it appears that advancements in capabilities are easier to achieve than alignment improvements. Together with the emphasis on researcher productivity in the ML field, as measured by factors like h-index, academics are incentivized to select more promising research areas.

By publishing mechanistic interpretability work, I think the perception of the field's difficulty can be changed for the better, thereby increasing the safety/capabilities ratio of the ML community's output. As the original post acknowledges, this approach could have negative consequences for various reasons.

[-]Max H3y3-3

Another consideration: mechanistic interpretability might resolve some longstanding disagreements about the nature and difficulty of alignment, but not make the actual problem any easier. So, in the world where alignment turns out to be hard, interpretability research might make AI governance more effective and more possible.

I think this is a consideration mostly in favor of doing and publishing interpretability research, at least for now. But if interpretability ever advances to the point where interpretability results on their own are enough to update previously-skeptical LWers towards higher p(doom), it's probably worth re-evaluating.

[-]JakubK3y20

Thus, we decided to ask multiple people in the alignment scene about their stance on this question.
Richard

Any reason you're not including people's last names? To a newcomer this would be confusing. "Who is Richard?"

[-]Marius Hobbhahn3y50

People could choose how they want to publish their opinion. In this case, Richard chose the first name basis. To be fair though, there aren't that many Richards in the alignment community and it probably won't be very hard for you to find out who Richard is.

[-]Joshua Clancy2y*10

I have a mechanistic interpretability paper I am working on / about to publish. It may qualify. Difficult to say. Currently, I think it would be better to be in the open. I kind of think of it as if... we were building bigger and bigger engines in cars without having invented the steering wheel (or perhaps windows?). I intend to post it to LessWrong / Alignment Forum. If the author gives me a link to that google doc group, I will send it there first. (Very possible it's not all that, I might be wrong, humans naturally overestimate their own stuff, etc.)

[-]RussellThor3y10

"Most people who work on ML do not care about alignment" I am a bit surprised by this, if it was true, are you sure it still is?

^{^}

This likely depends a lot on how the “solution to superposition” looks like. A sparse coding scheme is less likely to be capabilities advancing than a fundamental insight into transformers that allows us to decode superposed features everywhere in the network.

^{^}

Note that this goes both ways – just because mech interp. has not been particularly useful for alignment right now, does not mean that future work won’t!

LESSWRONG
LW

LESSWRONG
LW

106

Should we publish mechanistic interpretability research?

106

Ω 47

106

Ω 47

Richard

The basic case for publishing

Capabilities externalities of mechanistic interpretability

Other possible reasons why publishing could be net negative

Misc. considerations

Counterfactual considerations

Differential publishing

Preliminary suggestions

Opinions

Richard

Stephen Casper

Lucius

Neel

Nate Soares