CallumMcDougall

# Sequences

Monthly Algorithmic Problems in Mech Interp

# Wiki Contributions

Sorted by

Oh yeah this is great, thanks! For people reading this, I'll highlight SLT + developmental interp + mamba as areas which I think are large enough to have specific exercise sections but currently don't

Thanks!! Really appreciate it

Thanks, really appreciate this (and the advice for later posts!)

Yep, definitely! If you're using MSE loss then it's got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you're interested, I think Redwood's paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature

Good question! In the first batch of exercises (replicating toy models of interp), we play around with different importances. There are some interesting findings here (e.g. when you decrease sparsity to the point where you no longer represent all features, it's usually the lower-importance features which collapse first). I chose not to have the SAE exercises use varying importance, although it would be interesting to play around with this and see what you get!

As for what importance represents, it's basically a proxy for "how much a certain feature reduces loss, when it actually is present." This can be independent from feature probability. Anthropic included it in their toy models paper in order to make those models truer to reality, in the hope that the setup could tell us more interesting lessons about actual models. From the TMS paper:

Not all features are equally useful to a given task. Some can reduce the loss more than others. For an ImageNet model, where classifying different species of dogs is a central task, a floppy ear detector might be one of the most important features it can have. In contrast, another feature might only very slightly improve performance.

If we're talking features in language models, then importance would be "average amount that this feature reduces cross entropy loss". I open-sourced an SAE visualiser which you can find here. You can navigate through it and look at the effect of features on loss. It doesn't actually show the "overall importance" of a feature, but you should be able to get an idea of the kinds of situations where a feature is super loss-reducing and when it isn't. Example of a highly loss-reducing feature: feature #8, which fires on Django syntax and strongly predicts the "django" token. This seems highly loss-reducing because (although sparse) it's very often correct when it fires with high magnitude. On the other hand, feature #7 seems less loss-reducing, because a lot of the time it's pushing for something incorrect (maybe there exist other features which balance it out).

Winner = highest-quality solution over the time period of a month (solutions get posted at the start of the next month, along with a new problem).

Note that we're slightly de-emphasising the competition side now that there are occasional hints which get dropped during the month in the Slack group. I'll still credit the best solution in the Slack group & next LW post, but the choice to drop hints was to make the problem more accessible and hopefully increase the overall reach of this series.

Thanks for the latter point, glad you got that impression!

These are super valid concerns, and it's true that there's lots of information we won't have for a while. That said, we also have positive evidence from the first iteration of ARENA (which is about a year old now). There were only 5 full-time participants, and they've all gone on to do stuff I'm excited about, including the following (note that obviously some of these 5 have done more than one of the stuff on this list):

• internships at CHAI,
• working with Owain Evans (including some recent papers),
• building a community around open-source interpretability tooling,
• employment by EleutherAI for interp,
• participating in SERI MATS streams,
• work trialling at LEAP labs,
• being funded to work on independent research on ELK.

I'd also point to programs like MLAB which have similar goals and (as far as I'm aware) an even higher success rate of getting people into alignment work. Not saying that nobody from these programs goes on to do capabilities (I imagine at least a few do), but I'd be very surprised if this outweighs the positive effect from people going on to do alignment work.

One last point here - a big part of the benefit from programs like MLAB / ARENA is the connections made with people in alignment, feeling a sense of motivation & community, not just the skilling up (anecdotally, I quit my job and started working in alignment full-time after doing MLAB2, despite not then being at a point where I could apply for full-time jobs). I also get this impression from conversations w/ people who participated in & ran MLABs in the past. It's not as simple as "go into upskilling programs, become super competent at either alignment or capabilities work, then choose one or the other" - there's a myriad of factors which I expect to update people towards work in alignment after they go through programs like these.

If check-ins with ARENA 1.0 or 2.0 participants (or indeed MLAB participants) a year more from now reveal that a nontrivial fraction of them are working in capabilities then I'd certainly update my position here, but I'll preregister that this doesn't seem at all likely to me. It's true that alignment can be a messy field with limited opportunities and clear paths, but this is becoming less of a problem as the years go on.