Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Mech Interp Challenge: November - Deciphering the Cumulative Sum Model

1hillz

2CallumMcDougall

New Comment

2 comments, sorted by Click to highlight new comments since: Today at 12:17 PM

Winner = highest-quality solution over the time period of a month (solutions get posted at the start of the next month, along with a new problem).

Note that we're slightly de-emphasising the competition side now that there are occasional hints which get dropped during the month in the Slack group. I'll still credit the best solution in the Slack group & next LW post, but the choice to drop hints was to make the problem more accessible and hopefully increase the overall reach of this series.

I'm writing this post to discuss solutions to the October challenge, and present the challenge for this November.If you've not read the first post in this sequence, I'd recommend starting there - it outlines the purpose behind these challenges, and recommended prerequisite material.

## November Problem

The problem for this month is interpreting a model which has been trained to classify the cumulative sum of a sequence.

The model is fed sequences of integers, and is trained to classify the cumulative sum at a given sequence position. There are 3 possible classifications:

For example, if the sequence is:

Then the classifications would be:

The model is

not attention only. It has one attention layer with a single head, and one MLP layer. It doesnothave layernorm at the end of the model. It was trained with weight decay, and an Adam optimizer with linearly decaying learning rate.I don't expect this problem to be as difficult as some of the others in this sequence, however the presence of MLPs does provide a different kind of challenge.

You can find more details on the Streamlit page. Feel free to reach out if you have any questions!

## October Problem - Solutions

In the second half of the sequence, the attention heads perform the algorithm "attend back to (and copy) the first token which is larger than me". For example, in a sequence like:

we would have the second 3 token attending back to the first 5 token (because it's the first one that's larger than itself), the second 5 attending back to 7, etc. The SEP token just attends to the smallest token.

Some more refinements to this basic idea:

`x < y < z`

where the three numbers are close together,`x`

will often attend to`z`

rather than to`y`

. So why isn't this an adversarial example, i.e. why does the model still correctly predict`y`

follows`x`

?

, we also boost things slightly less thns

, and suppress things slightly more thans

.s`x < y < z`

, we have:

will boosty

a lot, and suppressy`z`

a bit.

will boostz

a lot, and boostz

a bit.y

gets slightly more attention thanz`y`

, it might still be the case that`y`

gets predicted with higher probability.## Best Submissions

We received more submissions for this month's problem than any other in the history of the series, so thanks to everyone who attempted! The best solution to this problem was by

Vlad K, who correctly identified the model's tendency to produce unexpected attention patterns when 3 numbers are close together, and figured out how the model manages to produce correct classifications anyway.Best of luck for this and future challenges!