NB: I'm surprised I couldn't find anything on this, but I also didn't spend very long looking. If there is a result from linear algebra that solves this problem then I'd love to hear it. Purpose of the post: getting feedback + ability to reference this in the future.I'm most interested in these forms of feedback:
While not strictly necessary as a prerequisite, I'd recommend reading at least the introduction to transformer circuits.
When investigating attention-only transformer models, a natural question to ask is "how much do attention heads in different layers 'communicate' with each other via the residual stream?" If there were no communication at all, then an N layer transformer with H heads per layer would simply be equivalent to a single layer transformer with N⋅H heads, which seems empirically not to be the case. (By equivalent I mean something like "learns as efficiently and effectively".) So there has to be some communication between layers.
Specifically, we are usually interested in how the output of a head in a previous layer affects the keys, queries, and values of a head in a later layer. This is called Q-, K-, and V-composition respectively.
In this post, I want to play around with how to formalize and measure composition. It's not going to be super novel work, but I haven't seen a write-up of this so far.
My first attempt at formalizing composition turns out to be not fine-grained enough for real systems, but I think it was still useful in forming my thoughts about this topic.
For the rest of the post, let B∈Rm×n,A∈Rn×m and we are interested in the product BA and how strongly A and B compose via this product. For instance, B could be the query-key matrix of a head in layer 2 and A the output-value matrix of a head in layer 1.
A reads from a subspace Rm and writes to a subspace of Rn. If rank(A) = n, then A writes to the full space, similarly if rank(A) = m, then A reads from the full space. B reads from a subspace of Rn and writes to a subspace of Rm. If the read-subspace of B overlaps with write subspace of A, then they can 'communicate' with each other via this overlap, since information passes through A to B.
Let's suppose we have a measure for composition, called C that maps two matrices of fitting shapes to the composition score C(A,B)
Why does this statement make sense? It says that the composition score should be 1, if and only if the only vector that is output by A that gets mapped to 0 by B is the zero vector itself, i.e. B does something non-trivial with all vectors that it can potentially get from A.
The reasoning is very similar to Axiom 1.
C should vary 'smoothly' and 'monotonically' in some sense, i.e. if there is 'more' overlap then C should be larger and vice versa.
One possible way to formalize this in the idealized case is to use 
This formula clearly satisfies axioms 1 and 2 and at least is monotonically increasing in the overlap. The main issue it has is that it is not continuous (ergo not differentiable) and not fine-grained enough to be applicable to real-world systems.
Unfortunately, in the real world, we are working with matrices that are learned by some messy process like SGD and represented by finite-precision floating-point numbers.
As a corollary, it is very likely to happen that rk(A),rk(B)=n. This means that Axiom 1 would imply a composition score of 1, regardless of the matrices, as long as they have full rank. However, we would still like to be able to differentiate between grades of composition in this real-world case.
This suggests that our three axioms as stated above are not the right tool to think about composition in real systems.
For a different approach, let's start by decomposing A,B into their respective singular value decompositions A=UAΣAVTA and likewise for B. U,V are square, real and orthogonal. Σ is diagonal (but not necessarily square, since it is m×n (or reversed) ). The diagonal entries of Σ are non-negative and are called the singular values of the matrix. The great thing about the SVD is that every matrix has one, unlike other decompositions, like the eigenvalue or Cholesky decomposition.
The SVD allows for a different perspective of a matrix. In the language of the SVD, the matrix 'reads' from the input by 'measuring' it (dot-product-ing) with the right singular vectors (V), weighting each dot product with the respective singular value and then using the results as the weights of the linear combinations of the left singular vectors (U).
You can make this even more explicit by writing the matrix in bra-ket notation or tensor product notation:
(This statement feels somehow more elegant than the original SVD formulation)
We could now write any vector x in the basis of the vi and thus simplify the notation in the matrix-vector multiplication. Note that the xj are the coordinates of x in the basis of the vi, which is in general not the canonical basis that we are used to.
If we now write the product in this decomposition
We can see that the interaction magic happens in the middle four terms on the right hand side.We can write the elements of this middle part as
Where θij is the angle between vectors vBi and uAj. That means that the middle part is the matrix of singular value products, weighted by the similarity of the corresponding singular vectors. We can also write the whole product as follows, showcasing why only the middle part is relevant for measuring composition.
Initially I thought that we would like VB and UA to contain the same set of vectors, i.e. that A and B should use the same basis. However, that doesn't quite work, since for any singular value with multiplicity > 1, there are infinitely many orthonormal bases that you could choose. A better formulation could be:
There is a one-to-one correspondence ϕ of left SV-spaces of A to right SV-spaces of B, where an SV-space is the subspace spanned by all left/right singular vectors which correspond to the same singular value. Furthermore, ϕ(U)≅U for all left SV-spaces U of A.
In particular, if all singular values have multiplicity 1, then this reduces to UA and VB containing the same set of basis vectors.
Since they each form an orthogonal basis of Rn, this means the SV-spaces are either isomorphic or orthogonal. This means that, following our derivation from above, the 'interaction' part would have a block diagonal structure (in the special case of multiplicity = 1 for all SVs, we'd get a diagonal structure). Also, each block is symmetric, as the vectors share the same SV and the dot product is symmetric.
Any metric that we come up with should be invariant to the basis we choose for any particular SV-space (or block).
Anthropic proposes to use the following composition measure
where ∥A∥F denotes the Frobenius norm.
Note that 0≤C(A,B)≤1, because the Frobenius norm is submultiplicative, i.e. ∥AB∥F≤∥A∥F∥B∥F.
The Frobenius norm can be characterized either via the sum of squares of matrix elements or as the sum of squares of the singular values.
Let's use our derivation above to write out the Frobenius norm of BA. First note that the Frobenius norm is invariant under rotations, which means that
In the special case of identical SV-spaces, we can use our knowledge about the block diagonal structure of this interaction part by letting Sk be the k-th block, resulting in
In the even more special case where all singular values have multiplicity 1, we get
Remember this, as we will get back to it later!
Assuming a multiplicity of 1 (which is very likely in the finite-precision case), we get weak composition intuitively when directions which get disproportionally magnified by A get disproportionally squashed by B and vice versa (meaning that singular value pairs with high singular vector similarity have very different relative magnitudes).
Nick Turner and Jack Strand used that there is a tighter bound on the sum of powers of the singular values of BA (screenshot from Topics in Matrix Analysis). Setting p = 2 gives a bound on the Frobenius norm that is tighter than the original submultiplicativity bound.
The only downside is that it is more expensive to compute, since we actually have to compute the singular values of A and B rather than summing the squares of their entries.
ETA: Note however, that it is more computationally efficient to compute ∥BA∥F via the SVDs of A,B, in which case the tighter bound is not more costly to compute.
Putting this together they propose
Where the element-wise product is taken only over the first q elements (otherwise it would not be defined).
Coming back to our analysis of the strong composition case, we see that this bound is realized in the case of multiplicity 1 and identical bases!
But wait, there is one more assumption we need. In our previous derivation, we simply re-ordered the bases at will to make them match, but to realize this bound not only do the matrices need to have the same bases but also the order of the singular values needs to be the same!
This makes a lot of sense from the POV of wanting to measure compositionality! It's maximal if both matrices agree on a basis and they also agree on the order of importance of these directions.
The trivial way is to set B=I. However, in the real world we will usually not have multiplicity > 1 for non-zero singular values due to the finite precision issue. If we also assume full rank for A,B then there is no singular value = 0.
Note that any distinct singular value > 0 has a unique singular vector, meaning that in this case both sets of singular vectors of UA,VB will be unique.
At this point I don't know how to show that in this case the bound can only be realized in the case of matching SV spaces. Suggestions welcome! As a reminder here's the formula:
You might think that the determinant could be a useful tool since it also measure how volume is distorted by a transformation, similarly to the Frobenius norm which measures the distortion along each singular vector. Unfortunately the determinant is only defined for square matrices and since usually n≪m, and rk(BA)≤n, the determinant of the m×m matrix BA will usually be 0.
I'm currently thinking about a way to formalize composition via information theoretic approaches and might post a follow-up at some point.
Thanks to Leon Lang and Neel Nanda for helpful comments and suggestions for improvements! Thanks to Nick Turner for discussions on the tighter denominator bound.
Note that this argument is not valid for 'normal' transformers which include MLPs, since the MLPs can move information between dimensions of the residual stream.
In real transformers, n≪m and thus reading means projecting from a large space into a small space and writing means projecting from a small space to a large space.
Thanks to Leon Lang for suggesting this.
I'd like to see a formal analysis of this, e.g. by looking at the matrices in GPT-2. Thanks to Neel Nanda for suggesting this. ETA: There is some preliminary work by Nick Turner and Jack Strand that verifies this for GPT-Neo and GPT-J.
More rigorously, VTx is x in the basis given by the columns of V (they form a basis of the full space since they are orthogonal and are n / m vectors respectively.) Σ then re-scales x in this new basis and U projects the result into the output space.
If you want try to prove that this bound is actually tighter. It's pretty straightforward.