EIS V: Blind Spots In AI Safety Interpretability Research

[-]xuan3yΩ5107

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas. While I don't work on interpretability per se, I see similar things happening with value learning / inverse reinforcement learning approaches to alignment.

[-]xuan3y*Ω22-1

Regarding causal scrubbing in particular, it seems to me that there's a closely related line of research by Geiger, Icard and Potts that it doesn't seem like TAISIC is engaging with deeply? I haven't looked too closely, but it may be another example of duplicated effort / rediscovery:

The importance of interventions
Over a series of recent papers (Geiger et al. 2020, Geiger et al. 2021, Geiger et al. 2022, Wu et al. 2022a, Wu et al. 2022b), we have argued that the theory of causal abstraction (Chalupka et al. 2016, Rubinstein et al. 2017, Beckers and Halpern 2019, Beckers et al. 2019) provides a powerful toolkit for achieving the desired kinds of explanation in AI. In causal abstraction, we assess whether a particular high-level (possibly symbolic) mode H is a faithful proxy for a lower-level (in our setting, usually neural) model N in the sense that the causal effects of components in H summarize the causal effects of components of N. In this scenario, N is the AI model that has been deployed to solve a particular task, and H is one’s probably partial, high-level characterization of how the task domain works (or should work). Where this relationship between N and H holds, we say that H is a causal abstraction of N. This means that we can use H to directly engage with high-level questions of robustness, fairness, and safety in deploying N for real-world tasks.

Source: https://ai.stanford.edu/blog/causal-abstraction/

[-]LawrenceC3yΩ673

We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops.

Hopefully this will be fixed with the forthcoming arXiv paper!

[-]xuan3yΩ220

Great to know, and good to hear!

[-]David Reber3yΩ012

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas

Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me.

I see similar things happening with causality generally, where it seems to me that (as a 1st order heuristic) much of alignment forum's reference for causality is frozen at Pearl's 2008 textbook, missing what I consider to be most of the valuable recent contributions and expansions in the field.

Example: Finite Factored Sets seems to be reinventing causal representation learning [for a good intro, see Schölkopf 2021], where it seems to me that the broader field is outpacing FFS on its own goals. FFS promises some theoretical gains (apparently to infer causality where Pearl-esque frameworks can't) but I'm no longer as sure about the validity of this.
Counterexample(s): the Causal Incentives Working Group, and David Krueger's lab, for instance. Notably these are embedded in academia, where there's more culture (incentive) to thoroughly relate to previous work. (These aren't the only ones, just 2 that came to mind.)

[-]Alexander Gietelink Oldenziel3yΩ120

I was intrigued by your claim that FFS is already subsumed by work on academia. I clicked the link you provided but from a quick skim it doesn't seem to do FFS or anything beyond the usual pearl causality story as far as I can tell. Maybe I am missing something - could you provide an specific page where you think FFS is being subsumed?

[-]David Reber3yΩ010

Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn't you agree that asking "how do we do causality when we don't even know what level abstraction on which to define causal variables?" is beyond the "usual pearl causality story" as usually summarized in FFS posts? It certainly goes beyond Pearl's well-known works.

[-]David Reber3yΩ010

I don't think my claim is that "FFS is already subsumed by work in academia": as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning).

It was intentional that the linked paper is an intro survey paper to the Pearl-ish approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question "what does it mean to study causality if we don't have pre-defined variables?"

It may be that FFS ends up contributing novel insights above and beyond <Pearl-based causal rep. learning>, but a priori I expect this to occur only if FFS researchers are familiar with the existing literature, which I haven't seen mentioned in any FFS posts.

My line of thinking is: It's hard to improve on a field you aren't familiar with. If you're ignorant of the work of hundreds of other researchers who are trying to answer the same underlying question you are, odds are against your insights being novel / neglected.

[-]Alexander Gietelink Oldenziel3y20

Scott Garrabrant conceived of FFS as an extension & generalization of Pearlian causality that answers questions that are not dealt well with in the Pearlian framework. He is aware of Pearl's work and explicitly builds on it. It's not a distinct approach as much as an extension. The paper you mentioned discusses the problem of figuring out what the right variables are but poses no solution (as far as I can tell). That shouldn't surprise because the problem is very hard. Many people have thought about it but there is only one Garrabrant.

I do agree with your overall perspective that people in alignment are quite insular, unaware of the literature and often reinventing the wheel.

[-][anonymous]3y11

Strong upvote here as well. The points about how even simple terminological differences can isolate research pursuits are especially pertinent, considering the tendency of people on and around LW to coin new phrases/ideas on a dime. Novel terminology is a valuable resource that we have been spending very frivolously.

[-]Richard_Ngo3yΩ440

Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.

E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renaming" seems far too strong.

Same for "we should not expect solving toy MI problems using humans to help with real world MI problems" - there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.

Ramanujan et al. (2020) showed that randomly initialized networks could be “trained” simply by pruning all of the weights that harmed performance on the task of interest. The resulting subnetwork may accomplish a task of interest, but only in a frivolous sense, and it should not be expected to generalize.

I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?

[-]scasper3yΩ221

Thanks for the comment.

In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.

This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.

there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.

No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.

I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?

Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks. So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is.

[-]carboniferous_umbraculum3y43

Re: e.g. superposition/entanglement:

I think people should try to understand the wider context into which they are writing, but I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I'd say this happens all the time and generally people can just hold in their minds that another group has another name for it. Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in favour of different terminology, i.e. even if something is "the same" when boiled down to a formal level, the different names can actually help delineate different interpretations.

In fact it's almost like a running joke in academia that there's always someone grumbling that you didn't cite the right things (their favourite work on this topic, their fellow countryman, them etc.) and because of the way academic literature works, some of the things that you are doing here can be done with almost any piece of work in the literature, i.e. you can comb over it with the benefit of hindsight and say 'hang on this isn't as original as it looked; basically the same idea that was written about here X years before' etc. Honestly, I don't usually think of this as a valuable exercise, but I may be missing something about your wider point or be more convinced once I've looked at more of your series.

Another point when it comes to 'originality' and 'progress' is that it's often unimportant if some idea was generally discussed, labelled, named, or thought about before if what matters is actual results and the lower-level content of these works. i.e. I may be wrong, but looking at what you are saying, I don't think you are literally pulling up an older paper on 'entanglement' that made the exact same points that the Anthropic papers were making and did very similar experiments (Or are you?) And even having said that, reproducing experiments exactly is of course very valuable.

Re: MI and program synthesis:

I understand that your take is that it is closer to program synthesis or program induction and that these aren't all the same thing but in the first subsection of the "TASIC has reinvented..." section, I'm a little confused why there's no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).

[-][anonymous]3y22

The main problem on this site is that despite people have large vary levels of understanding of different subject, nobody wants to look like an idiot on here. A lot of the comments and articles are basically nothing burgers. People often focus on insignificant points to argue about and waste their time in the social aspect of learning than to actually learn about a subject themselves.

This made me wonder do actual researchers who have values and substance to offer and question, do they not participate in online discussions? The closest I've found is wordpress blogs by various people and people have huge comment chains. The only other form of communication seems to be through formal papers, which is pretty much as organized as it gets in terms of format.

I've learned that people who do actually have deeper understanding and knowledge of value to offer, they don't waste their time on here. But I can't find any other platform that these people participate in. My guess is that they don't participate in any public discourse, only private conversations with other people who have things of value to offer and discuss.

[-]scasper3y10

Thanks for the comment and pointing these things out.

---

I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names.

Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.

I don't know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?

---

In fact it's almost like a running joke in academia that there's always someone grumbling that you didn't cite the right things (their favourite work on this topic, their fellow countryman, them etc.)...

Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless.

---

I understand that your take is that it is closer to program synthesis or program induction and that these aren't all the same thing but in the first subsection of the "TASIC has reinvented..." section, I'm a little confused why there's no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).

Thanks for pointing out these posts. They are examples of discussing a similar idea to MI's dependency on programmatic hypothesis generation, but they don't act on it. But they both serve to draw analogies instead of providing methods. The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)

If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.

[-]carboniferous_umbraculum3y21

Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.

Ok it sounds to me like maybe there's at least two things being talked about here. One situation is

A) Where a community includes different groups working on the same topic, and where those groups might use different terminology and have different ways of thinking about the same phenomena etc. This seems completely normal to me. The other situation is

B) Where a group is isolated from the community at large and is using different terminology/thinking about things differently just as a result of their isolation and lack of communication. And where that behaviour then causes confusion and/or wasting of resources.

The latter doesn't sound good, but I guess it looks like to me that some or many of your points are consistent with the former being the case. So when you write e.g. it's not "necessarily a good thing either" or asking for my steelmanned case, this doesn't seem to quite make sense to me. I feel like if something is not necessarily good or bad, and you want to raise it as a criticism, then the onus would be on you to bring the case against TASIC with arguments that are not general ones that could easily apply to both A) and B) above. e.g. It'd be more of an emphatic case if you were able to go into the details and be like "X did this work here and claimed it was new but actually it exists in Y's paper here" or give a real example of needless confusion that was created and could have been avoided. Focussing just on what they did or didn't 'engage with' on the level of general concepts and citations/acknowledgements doesn't bring this case convincingly, in my opinion. Some more vague thoughts on why that is:

Bodies of literature like this are usually very complicated and messy and people genuinely can't be expected to engage with everything.
It's often hard or impossible to tack dependencies of ideas because of all the communication you cannot see and not being able to see 'how' people are thinking of things, only what they wrote.
Someone publishing on the same idea or concept or topic as you is nowhere near the same as someone actually doing the exact same technical thing that you are doing. ime the former is happening all the time; and the latter is much rarer than people often think.
Reinvention, re-presentation and even outright renaming or 'starting from scratch' are all valuable elements of scholarship that help a field move along.

Idk maybe I'm just repeating myself at this point.

On the other point: It may turn out the MI's analogy with reverse software engineering does not produce methods and is just used a high-level analogy,, but it seems too early to say from my perspective - the two posts I linked are from last year. TASIC is still pretty small and experienced researchers in TASIC are fewer and this is potentially a large and difficult research agenda.

[-]Noosphere893yΩ34-3

I strongly downvoted with this post, primarily because contra you, I do actually think reframing/reinventing is valuable, and IMO I think that the case for reframing/reinventing things is strawmanned here.

There is one valuable part of this post, and that interpretability doesn't have good result-incentives. I agree with this criticism, but given the other points of the post, I would strongly downvote it.

[-]scasper3yΩ266

This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?

[-]Noosphere893y10

In this case, one of the steelmanned case for reframing/reinventing being productive is this post:

https://www.lesswrong.com/posts/ZZNM2JP6YFCYbNKWm/nothing-new-productive-reframing

The big reason reframing/reinventing is productive is we are neither logically omniscient, nor are we Bayesian optimal, that is we don't update on all the data we receive, which makes reframings or reinventing things like shortcuts.

Also, reinventing things can give you more bits by learning general processes for how to do something, unlike black boxes which only give you the output.

[-]scasper3y*10

I see the point of this post. No arguments with the existence of productive reframing. But I do not think this post makes a good case for reframing being robustly good. Obviously, it can be bad too. And for the specific cases discussed in the post, the post you linked doesn't make me think "Oh, these are reframed ideas, so good -- glad we are doing redundant work in isolation."

For example with polysemanticity/superposition I think that TAISIC's work has created generational confusion and insularity that are harmful. And I think TAISIC's failure to understand that MI means doing program synthesis/induction/language-translation has led to a lot of unproductive work on toy problems using methods that are unlikely to scale.

[-]Charlie Steiner3yΩ130

I think it's a big stretch to say that deception is basically just trojans. There are similarities, but the regularities that make deception a natural category of behavior that we might be able to detect are importantly fuzzier than the regularities that trojan-detecting strategies use. If "deception" just meant acting according to a wildly different distribution when certain cues were detected, trojan-detection would have us covered, but what counts as "deception" depends more heavily on our standards for the reasoning process, and doean't reliably result in behavior that's way different than non-deceptive behavior.

[-]scasper3yΩ121

Thanks. See also EIS VIII.

Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.

[-]Charlie Steiner3yΩ230

I'm slowly making my way through these, so I'll leave you a more complete comment after I read post 8.

[-]4gate1y10

Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.

On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a "program synthesis problem". For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context-specific languages (not necessarily formal, but with some elements of formality such as constraints on what relations are allowed, etc...) that map onto things such as biology, psychology, etc...

I think it's fair to call these programs, since they are tools you use in a sort of causal way to say what will happen. Usually, you imagine certain objects that follow certain rules to do things, thereby changing the state of the world. They are things you could write as programs or instructions.

The art here is to be able to formalize a language that has the right parametrization to describe and predict the desired phenomena well, while being expressive enough to grow in a useful way, as we discover more.

But anyways, there are sort of two questions that naturally arise here:

Why is MI more closely related to program synthesis than any other field that wishes to explain a process that can be thought of as a program (i.e. it has causal components that happen over time)?
I was under the impression that MI is in the business of trying to establish the right language and concepts to use to describe the information processing done by deep learning models. The field has not really cracked the "art" here yet AFAIK. With that said, I'm guessing that the program synthesis literature and tooling has a slightly different goal and therefore carries certain baggage of how one goes about thinking about these problems (i.e. maybe more of a lean towards symbolic methods). But the program synthesis literature probably doesn't actually create the right language to have a 10x conceptual framework for the science of deep learning information processing because otherwise we would have a lot more solved problems than we do. So in this sense, a new start is not necessarily bad. You can think about this in some sense fuzzily analogous to how Galois invented (if you can say that) a new branch of math to solve the so-far-unsolved 5th degree polynomial root-finding problem. It was not by using the already-existent tools that this problem was solved. You can also think of this as a society-level de-biasing strategy. If DL is ever to have an explanatory framework on par with, say, Classical Mechanics, it appears that we need a conceptual 10x-ing. Do you agree with this framing? If so, what do you think is a healthy amount of rediscovery?

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

58

EIS V: Blind Spots In AI Safety Interpretability Research

58

Ω 25

58

Ω 25

The importance of interventions

TAISIC has reinvented, reframed, or renamed several paradigms

Mechanistic interpretability requires something much like program synthesis, program induction, and/or programming language translation

Causal scrubbing, compression, and frivolous subnetworks

Polysemanticity and superposition = entanglement

Deceptive alignment ≈ trojans

Unsupervised contrast consistent search = self-supervised contrastive probing

Why so little work on intrinsic interpretability?

Questions