Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Part 13 of 12 in the Engineer’s Interpretability Sequence.

TL;DR

On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it underperformed my expectations. I am beginning to be concerned that Anthropic’s recent approach to interpretability research might be better explained by safety washing than practical safety work. 

Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates. 

Reflecting on predictions

See my original post for 10 specific predictions about what today’s paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 and obviously did not do 4, 5, 7, 8, 9, and 10. Meanwhile, I think that their experiments to identify specific and safety-relevant features should count for 3 (proofs of concept for a useful type of task) but definitely do not count for 6 (*competitively* finding and removing a harmful behavior that was represented in the training data).

Thus, my assessment is that Anthropic did 1-3 but not 4-10. I have been wrong with mech interp predictions in the past, but this time, everything I predicted with >50% probability happened, and everything I predicted with <50% probability did not happen. 

The predictions were accurate in one sense. But overall, the paper underperformed expectations. If you scored the paper relative to my predictions by giving it (1-p) points when it did something that I predicted it would do with probability p and -p points when it did not, the paper would score -0.74. 

A review + thoughts

I think that Anthropic’s new SAE work has continued to be like lots of prior high-profile work on mechanistic interpretability – it has focused on presenting illustrative examples, streetlight demos, and cherry-picked proofs of concept. This is useful for science, but it does not yet show that SAEs are helpful and competitive for diagnostic and debugging tasks that could improve AI safety

I feel increasingly worried about how Anthropic motivates and sells its interpretability research in the name of safety. Today’s paper makes some major Motte and Bailey claims that oversell what was accomplished like “Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer,” “Sparse autoencoders produce interpretable features for large models,” and “The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.” The paper also made some omissions of past literature on interpretability illusions (e.g., Bolukbasi et al., 2021), which their methodology seems prone to. Normally, problems like this are mitigated by peer review, which Anthropic does not participate in. Meanwhile, whenever Anthropic puts out new interpretability research, I always see a laundry list of posts from the company and employees to promote it. They always seem to claim the same things – that some ‘groundbreaking new progress has been made’ and that ‘the model was even more interpretable than they thought’ but that ‘there remains progress to be made before interpretability is solved’. I won’t link to any specific person’s posts, but here are Anthropic’s posts from today and October 2023

The way that Anthropic presents its interpretability work has real-world consequences. For example, it led to this viral claim that interpretability will be solved and that we are bound for safe models. It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved. Meanwhile today, it seems that Anthropic orchestrated a New York Times article to be released alongside the paper, claiming to the public that exciting progress has been made (although the article also made helpful critical commentary on limitations!).

If interpretability is ever going to be helpful for safety, it will need to be useful and competitive in practical applications. This point has been made consistently for the better part of a decade (e.g. Ananny and Crawford, 2016Lipton, 2016Doshi-Velez and Kim, 2017Miller, 2018Krishnan, 2020Rauker et al., 2022). Despite this, it seems to me that Anthropic has so far not applied its interpretability techniques to practical tasks and show that they are competitive. Instead of testing applications and beating baselines, the recent approach has been to keep focusing on streetlight demos and showing lots of cherry-picked examples. I hope to see this change soon.

I don't think that SAE research is misguided. In my post, I pointed out 6 things that I think they could be useful for. Meanwhile, some good recent work has demonstrated proofs of concept that SAEs can be useful on certain non-cherry-picked tasks of practical value and interest (Marks et al., 2024). I think that it's very possible that SAEs and other interpretability techniques can be lenses into models that can help us find useful clues and insights. However, Anthropic's research on SAEs has yet to demonstrate practical usefulness that could help engineers in real applications. 

I know that members of the Anthropic interpretability team have been aware of this critique. Meanwhile, Anthropic and its employees consistently affirm that their work is motivated by safety in the real world. But is it? I am starting to wonder about the extent to which the interpretability team’s current agenda is better explained by practical safety work versus doing sophistical safety washing to score points in social medianews, and government

Thanks to Ryan Greenblatt and Buck Shlegris. I did not consult with them on this post, but they pointed out some useful things in a Slack thread that I put in here.


 

New Comment
16 comments, sorted by Click to highlight new comments since:
[-]starship006Ω133019

It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved.

Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism - I assume you are referring specifically to the statement in their paper that:

Although advocates for AI safety guidelines often allude to the "black box" nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sector have resolved this issue, thereby ensuring the integrity of open-source code models.

I think Anthropic's interpretability team, while making maybe dubious claims about the impact of their work on safety, has been clear that mechanistic interpretability is far from 'solved.' For instance, Chris Olah in the linked NYT article from today:

“There are lots of other challenges ahead of us, but the thing that seemed scariest no longer seems like a roadblock,” he said.

Also, in the paper's section on Inability to Evaluate:

it's unclear that they're really getting at the fundamental thing we care about

I think they are overstating how far/useful mechanistic interpretability is currently. However, I don't think this messaging is close to 'mechanistic interpretability solves AI Interpretability' - this error is on a16z, not Anthropic. 

[-]Neel NandaΩ265556

+1, I think the correct conclusion is "a16z are making bald faced lies to major governments" not "a16z were misled by Anthropic hype"

[-]scasperΩ7139

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements. 

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this. 

It seems at least somewhat reasonable to ask people to write defensively to guard against their statements being misused by adversarial actors. I recognize this is an annoying ask that may have significant overhead, perhaps it will turn out to not be worth the cost.

I appreciate this post. Emphasizing a couple things and providing some other commentary/questions on the paper (as there doesn't seem to be a better top level post for it) (I have not read paper deeply and could be missing things):

  • I find the Twitter vote brigading to be annoying and slightly bad for collective epistemics. I do not think this paper was particularly good, and it did not warrant the attention it got. (The main flaws IMO are a lack of (empirical) comparison to other methods — except a brief interlude in the appendix; and lack of any benchmarking — for example testing if clamping sycophancy features affects performance on sycophancy benchmarks)
  • At an object level, one concerning-to-me result is that there doesn't appear to be a clean gradient in the presence of a feature over the range of activation values. You might hope that if you take the AI risk feature[1], and look at dataset examples that span its activation values (as the tool does), you would see highly activating text be very related to AI risk and low activating text be only slightly related. I think that pattern is weak — there are at least some low-activation examples that are highly related to AI risk, such as '..."It's what they're programmed to do." "Destroy all technology other than their own"' (cherrypicked by me). This is related to sensitivity, which the paper mentions is difficult to study in this context (before mentioning one cherry-picked result). I care about this because: one way to use SAEs for safety is as a classifier for malicious behavior (be checking if model activations correspond to dangerous features); this would really benefit from having a nice smooth relationship between feature activation magnitude and actual feature presence, and it pretty much needs to have high sensitivity. Given the existence of highly-feature-related samples in the bottom activation interval, I feel fairly worried that sensitivity is poor, and that it will be hard to do magnitude-based thresholds — it pretty much looks like 0 is the reasonable threshold given these results. 
  1. ^

    In the paper this is labeled with "The concept of an advanced AI system causing unintended harm or becoming uncontrollable and posing an existential threat to humanity"

[-]Ruby107

Curated. I like this post for several reasons: Making predictions about future research seems neat and valuable – I could see the habit of doing this, especially if predicting results, helping one build skill in prioritizing research. As scasper says, interpretability isn’t yet practically helpful, and even if that’s been said a lot, it’s worth continuing to say that, especially as mech interp continues to be one of the most accessible/hottest AI safety tech work paths. And I like this work for being a review. LessWrong’s annual review works to elicit reviews like this, but they’re valuable to have immediately to help people put things in context and orient them.

This criticism feels a bit strong to me. Knowing the extent to which interpretability work scales up to larger models seems pretty important. I could have imagined people either arguing that such techniques would work worse on larger models b/c required optimizations or better because less concepts would be in superposition. Work on this feels quite important, even though there's a lot more work to be done.

Also, sharing some amount of eye-catching results seems important for building excitement for interpretability research.

Update: I skipped the TLDR when I was reading this post b/c I just read the rest. I guess I'm fine with Anthropic mostly focusing on establishing one kind of robustness and leaving other kinds of robustness for future work. I'd be more likely to agree with Steven Casper if there isn't further research from Anthropic in the next year that makes significant progress in evaluating the robustness of their approach. One additional point: independent researchers can run some of these other experiments, but they can't run the scaling experiment.

Note that scasper said:

Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights,

I (like scasper) think this work is useful, but I share some of scasper's concerns.

In particular:

  • I think prior work like this from the anthropic interp team has been systematically overrated by others and the anthropic interp team could take actions to avoid this.
    • IMO, buzz on twitter systematically overrates the results of this paper and their importance.
  • I'm uncertain, but I think I might prefer less excitement for this style of interp research all else equal.
  • Heuristically, it seem bad if people systematically overrate a field where one of the core aims is to test for subtle and dangerous failure modes.
  • I'd be excited for further work focusing on developing actually useful MVPs and this seems more important than more work like this.
    • I think the theory of change commonly articulated by various people on the Anthropic interp team (enumerative safety to test for deceptive alignment), probably requires way harder core technology and much more precise results (at least to get more than a bit or two of evidence). Additional rigor and trying to assess the extent to which you understand things seems important for this. So, I'd like to see people try on this and update faster. (Including myself: I'm not that sure!)
    • I think other less ambitious theories of change are more plausible (e.g. this recent work), and seeing how these go seems more informative for what to work on than eyeballing SAE's IMO.

I don't think the post is saying the result is not valuable. The claim is that it underperformed expectation. Stock prices fall if they underperformed expectation, even if they are profitable. That does not mean they made loss.

it seems to me that Anthropic has so far failed to apply its interpretability techniques to practical tasks and show that they are competitive

Do you not consider the steering examples in the recent paper to be a practical task, or do you think that competitiveness hasn't been demonstrated (because people were already doing activation steering without SAEs)? My understanding of the case for activation steering with unsupervisedly-learned features is that it could circumvent some failure modes of RLHF.

[-]scasperΩ13229

Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems. 

First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them. 

Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models' behaviors. Ideally, we'd want SAEs to be competitive with them. Unfortunately,  good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines. 

Further context about the "recent advancements in the AI sector have resolved this issue" paragraph:

[-]Jai2-3

Hype is a useful social mechanism for eliciting acute criticism and exposing flaws. If you want to know what your weaknesses are, you could do worse than to paint a giant target on your back.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Why was this downvoted?

[-]kave411

Some people find the messages annoying. I personally don't love the large amount of vertical space they take up. Looks like someone went through and downvoted a bunch of recent comments by the bot