I completed this project for my bachelor's thesis and am now writing it up 2-3 months later. I think I found some interesting results that are worth sharing here. This post might be especially interesting for people who try to reverse-engineer OthelloGPT in the future.
The other sections will build on top of this section
Let 𝑅 denote the set of rules. Each rule 𝑟 ∈ 𝑅 is defined by:
A rule 𝑟(𝑥) evaluates to true for a residual stream 𝑥 when 𝑛 "mine" tiles need to be flipped in the specified direction, before reaching tile 𝑡, which is yours.
I edited the Flipped direction to be orthogonal to the Yours direction. The effect of the Yours/Mine direction is stronger on the rim of the board, but I don't have a visualization on hand
0.17 is roughly the minimum of the GELU function. So a mean activation difference suggests that the neuron has a positive activation when the rule is true and a negative evaluation otherwise.