I've lately been running into the idea that interpretability research might be bad because it could speed up capabilities more than alignment with increasing frequency.  I find this idea interesting because it points to a potentially productive mine of buried confusion.

Stepping back a little bit: much discussion about e.g. differential progress on alignment vs. capabilities collapses a lot of complexity into single-dimensional quantities to be maximized.  There is also a common argument of the form "there's a lot of overlap between alignment and capabilities work".[1]  That argument doesn't make the exact same reduction but buries most of the same complexity.

What are capabilities?

Actually, the question is something like "how should we model capabilities, such that we can usefully distinguish capabilities advancements are more likely to be downstream of certain kinds of alignment research, vs. less likely?".

The abstraction I'm playing around with right now is within-paradigm improvements vs. new-paradigm discovery.

Within-paradigm improvements might look like interpretability tooling which enables one-off or manually discovered improvements to hyperparameter tuning or other slight tweaks to existing architectures.

New-paradigm discovery could be anything from "this tooling let us figure out [something generalizable as applied to existing architectures, i.e. scaling laws]" to "this tooling let us come up with a brand new and improved model architecture".

What is alignment interpretability?

"Alignment" is too wide a bucket to be usefully broken down, because the categories at the "next" level of abstraction are still too wide to usefully reason about with respect to their potential impact on capabilities.  Let's look at an individual category: intepretability.

The question starts to break down.  How likely are insights and tooling focused on transformer circuits to advance capabilities?[2]  Shapley values?  Superposition?  ELK?  (Please keep in mind that if you do come to have a model by which some specific interpretability technique can be used to make rapid capabilities advancements, it is not obvious to me that a public LessWrong post is the best way to warn other interpretability researchers of this fact.) 

What else?

It's difficult to confidently answer questions of the form "how likely is [interpretability technique] to lead to the discovery of a new paradigm in ML?".  I expect somone deeply embedded in the field might have intuitions that are more reliable than noise for where there might be low hanging fruit, but that is not a super high bar.  Questions are revealed as counterfactuals.  "How much effort will go into exploiting my new technique for capabilities research, and what do the results look like compared to the world where I don't publish?"

My current intuition[3] is that most interpretability research isn't very well-suited for within-paradigm capabilities advancements, since the "inner loop" of such research is too fast for current tooling to be very useful.  habryka points out the counterexample that e.g. training visualization tooling often works to directly shorten that "inner loop".

I am less certain about the potential for paradigmatic changes.  A failure mode I can vividly imagine here is something like this:

  1. an "interpretability" team comes out with an interesting result that has implications[4] for improving model performance
  2. the rest of the field, which had been happily hill-climbing their way up the capabilities slope and mostly considered interpretability to be an academic curiosity, notices this, and descends upon "interpretability" like a pack of ravenous hyenas
  3. "interpretability" turns out to be a goldmine for "capabilities advancements", compared to the field's previous research methodology

An important thing to keep in mind is that it's very difficult to evaluate the counterfactual impact in step 3.  Some situations that look like the story above may actually represent less capabilities advancement than the naive counterfactual.

Anyways, I don't have much of a point here, except that we should try to (carefully) unpack claims like "interpretability research might be bad because it could speed up capabilities more than alignment".  They can mean many different things, with wildly different implications for e.g. publishing/secrecy norms, the value of open-sourcing interpretability tooling, attempts to raise awareness, etc.

  1. ^

    I agree with the literal claim more than I did a few months ago, though I'm not sure I have the same underlying model/generator as those that I've previously seen advancing arguments that sound like this.

  2. ^

    Remember that this is not a single question, but as many questions as "divisions" you want to make within the notion of "capabilities".  Each of those breaks down to multiple questions when you break down "circuits".  All abstractions are leaky.

  3. ^

    I'm very far from a domain expert.  Take with big grain of salt.

  4. ^

    Which may or may not be obvious to the authors, but are obvious to enough of the audience that someone succesfully operationalizes them.

New Comment
7 comments, sorted by Click to highlight new comments since:

For a discussion of capabilities vs safety, I made a video about it here, and a longer discussion is available here.

The post was pretty off-the-cuff, so there wasn't a good way to fit it in, but this comment by Paul describes some of these concerns pretty well, as well as some others I didn't touch on (i.e. techniques can help unblock commercialization of current architectures).

My working model for unintended capabilities improvement focuses more on second- and third-order effects: people see promise of more capabilities and invest more money (e.g. found startups, launch AI-based products), which increases the need for competitive advantage, which pushes people to search for more capabilities (loosely defined). There is also the direct improvement to the inner research loop, but this is less immediate for most work.

Under my model, basically any technical work that someone could look at and think "Yes, this will help me (or my proxies) build more powerful AI in the next 10 years" is a risk, and it comes down to risk mitigation (because being paralyzed by fear of making the wrong move is a losing strategy):

I think that interpretability researchers with concern about safety of their discoveries should consider option "don't publish potentially dangerous results".

Yes, that is a concern I share:

(Please keep in mind that if you do come to have a model by which some specific interpretability technique can be used to make rapid capabilities advancements, it is not obvious to me that a public LessWrong post is the best way to warn other interpretability researchers of this fact.) 

To rephrase: I think the default mode for discoveries in interpretability should be "don't publish" and publishing should happen after a careful weighing of upsides and downsides. Researchers need to train themselves to the unconditional mental motion "do I really want everybody to know about this?"

Yep, that's a distinct claim from the one I was making.  It's not a crazy position to have as an ideal to strive for, but I'm not confident that it ends up being net-upside in the current regime absent other concurrent changes.  Need to think about it more.