The king token

A while back I did a short foray into interpretability research analyzing the king token of my chess move predicting transformer. At the time the result seemed simultaneously unsurprising and disappointing, and I went back to work on more substantial matters, though revisiting this little project now, I am not sure why.

The chess move predicting transformer outputs a probability distribution over actions conditioned on a state. This is different enough from large language models to be interesting in its own right: Do the activations of a token representing a piece encode affordances? Typical tactical motives (checks, skewers, double attacks…)? Strategic structures (isolanis, passed pawns, Maroczy bind…)? Specific details of a class of positions (bishop pair, damaged king position, queenless middlegame … )? Higher level abstractions (king safety, initiative, zugzwang)?

A priori this is very unclear to me. The reason I started the analysis was to find out whether typical human concepts are encoded in the transformer representations. I looked specifically at the king token to see whether king safety is a represented concept. And, in a way it actually is. When I looked at the positions that led to the largest absolute values in the principal component, they all showed the king under attack and in check by heavy pieces.

I realized that this was profoundly unsurprising because the king in these positions absolutely had to move and therefore it made sense for the activations of the king token to be large. This fact alone did not mean any subtle or interesting concepts were encoded in the principal component of the king token.

Now, eight months later I decided to dig a little deeper. All the following analyses are done on the output tokens of the final transformer block right before these are fed into the dense connected layer that generates move predictions.

Steering away from king's moves

My earlier results did imply a steering possibility and indeed, dividing the king token by 10 generally drops the probability of a king’s move in positions with an initial probability above 10% to under 1%. Dividing by 100 drops the probability even more. Setting it directly to zero didn’t change much beyond that.

However, there were three kinds of exceptions, positions where the probability of a king’s move didn’t drop as much as in most cases and where there generally was a lower threshold above 1% that couldn’t be broken through:

Positions, where the king was in check and had to move. (Well, there are no other legal moves, makes some sense.)
Positions, where only the king was left. (In that case, there aren’t even any other pieces to move.)
Positions, where castling was really a very natural move. (Wait, what?)

Here are some typical examples, the probabilities below the positions are initial probability of a king's move, probability after dividing the king token by 10, probability after dividing it by 100 and probability after setting it to zero:

In many of the castling positions the probability didn’t drop much more than half often staying above 20 or even 30%. Initially I thought that this means that the information about which piece to move must be more spread out over the context of the position than I originally thought – but then I realized that castling is a double move and half of the probability mass must be provided by the rook token that we don’t modify.

What do the different dimensions encode?

I then proceeded to analyze the different dimensions of the king token by looking at positions that led to the largest absolute values in these dimensions.

Single dimensions of the king token seem to encode specific details of the position. Often these are different kind of checks, the very first dimension for example gets its highest activations for diagonal queen checks from the upper left side of the board. Sometimes different patterns are encoded: An enemy knight who might check with its next move, black to move and retake a piece, the king being tucked in at h2, the king blocking a protected black pawn, etc. These patterns tend to occur in most of the positions with the highest activations, but by all means not all of them.

The positions that elicit the highest and the lowest values seem to not stand in any discernable relationship to each other. This might be a difference to language models – very specific details do not have opposites, therefore the detail encoded by negative values is something relatively unrelated to the detail encoded by positives values.

I encountered one suggestive pair of details that were encoded by dimension 10: The positive values encoded a knight check, while the negative values encoded a knight that might check with its next move. While these can be true at the same time that would still be rather unusual. Other pairs where “queen check” vs “knight check” or “rook check” vs “queen check”. These can also happen in the same position via a discovered attack, but this co-occurrence should also be rare. So, it seems to be the case that the details encoded by the same dimension are at least very unlikely to co-occur.

The 128 dimensions of the king token are not orthogonal – there are strong correlations between many of them. This makes sense if they encode specific details, which often co-occur systematically.

The strongest activations also happened in positions that where sparser, while the weakest activations (those closest to zero) happen in the more typical dense middlegame positions. If sparser positions have fewer prominent details and the overall token norm is independent from the position’s density that checks out.

I haven't yet looked at all dimensions nor any other pieces (I'm wondering what information the empty squares might integrate from the context), but it seems right now that tokens collect specific details from the context that are relevant to whether the piece they represent is going to move or not and also from which square to which square it might move.

Some are concrete and forcing like a diagonal check, some are more abstract like the threat of checkmate or the closeness of a knight, some convey the kings square like positions after long castles, king h2 positions or positions with the king stranded in the middle.

In hindsight all of this makes sense but it was still interesting to see.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

17

17

Steering away from king's moves

What do the different dimensions encode?

17