Some ML-Related Math I Now Understand Better

Here are some simple Math facts rarely taught in ML & Math lectures:

SVD is decomposing a matrix into a sum of simple read-and-write operations
There is exponentially much room for close vectors in high dimensional space
Layer Normalization is a projection

SVD Is Decomposing a Matrix Into a Sum of Simple Read and Write Operations

Thanks to zfurman, on EleutherAI, which told me about the core idea.

Let's say I want to understand what a weight matrix does. A large table of numbers isn't really helpful, I want something better.

Here is a linear transformation which is much easier to understand than a table of numbers: (where $u$ and $v$ are unit vectors, $λ$ is a non-negative scalar, and $⟨ v, x ⟩$ is the dot product between $v$ and $x$ ). $f$ is reading in the $v$ direction, and writing in the $u$ direction with a magnitude scaled by $λ$ . I can understand this much more clearly and do nice interpretability with it. Therefore, if I know things about my embedding space (for example, using the logit lens), I can tell what $f$ is doing.

Sadly, not all linear transformations can be expressed in this way: it's just a rank-1 transformation, so there is no hope of capturing something as complex as a usual linear transformation.

But what if I allow myself to sum many simple transformations? Then I could just look at each operation independently. More precisely, here is a wishlist to understand the big matrix $M$ of the transformation $g$ :

$g (x) = \sum_{i = 1}^{r} λ_{i} ⟨ v_{i}, x ⟩ u_{i}$ where $u$ and $v$ are unit vectors, and $λ$ is a positive scalar - it's a sum of simple transformations;
$⟨ v_{i}, v_{j} ⟩ = 0$ for all $i \neq j$ - no double read, always read in orthogonal directions;
$⟨ u_{i}, u_{j} ⟩ = 0$ for all $i \neq j$ - no double write, always write in orthogonal directions;
$r = rank (g)$ - the number $r$ of simple transformation summed should be as small as possible, because that would mean less individual transformation to inspect (and I can't hope for fewer than $rank (g)$ operations, since the rank of the sum of $r$ rank-1 linear transformations is at most $r$ ).

In terms of matrices, this is $M = U diag (λ) V^{T}$ where $U$ and $V$ are semi-unitary matrices (orthogonal matrices when they are square matrices). This is exactly SVD.

Understanding matrices & SVD this way helped me find a geometric intuition behind some basic properties of matrices. The questions below can also be quickly answered by "doing the math", but I think it's interesting to have a geometric understanding of why these are true.

What is a geometric interpretation of a symmetric matrix? Why is the inverse of an invertible symmetric matrix always symmetric?
What is the geometric interpretation of transposing a linear transformation ?
Why are $λ_{i}^{2}$ the eigenvalues of $M M^{T}$ and why are its eigenvectors $u_{i}$ ? Why are $λ_{i}^{2}$ the eigenvalues of $M^{T} M$ and why are its eigenvectors $v_{i}$ ?

Feel free to share yours in the comments!

There Is Exponentially Much Room for Almost Orthogonal Vectors in High Dimensional Space

According to this paper, the number of almost orthogonal vectors you can put in $n$ dimensional space is at least $exp (n log (\frac{1}{sin θ}) + o (n))$ where all pairs of vectors have an angle of at least $θ$ between them (and $o (n) - --- \to n \to + \infty 0$ ).^[1]

Ignoring the $o (n)$ , and using $n = 1600$ (the number of dimensions of the embedding space of GPT-2 XL) it means you can fit at least 3000 vectors with pairwise cosine similarities smaller than 0.1 and $10^{14}$ vectors with cosine similarities smaller than 0.2. Using $n = 12288$ (the number of dimensions of the embedding space of GPT-3), you can fit at least $10^{6}$ vectors with pairwise cosine similarities smaller than 0.05 and $10^{26} 10^{26}$ vectors with cosine similarities smaller than 0.1.

This means you can get a lot of vectors in your embedding space if you allow for relatively small cosine similarities. But this does not hold for tiny cosine similarities (e.g. 0.01 for $n = 12288$ , which gives a lower bound of 2 using the formula above). I'm not aware of a lower bound better than $n$ for tiny angles. Therefore, for a neural network to use this space to fit independent concepts, it needs to be resilient to many small-but-not-that-small conflicts.

Layer Normalization Is a Projection

Thanks to Lawrence Chan for telling me about this.

Layer Normalization is a normalization layer often found in LLMs. People usually write it as $y = \frac{x - E [x]}{\sqrt{V a r [x]}}$ (followed by a scaling and a bias term^[2]), which is confusing because we are taking the expected value and the variance of a vector, which is not a common geometric operation.

But it can easily be expressed in terms of two successive projection: $x - E [x]$ is simply $p_{\to 1} (x)$ , the projection on the hyperplane orthogonal to the vector $\to 1 = (1 \dots 1)$ , $p_{\to 1} (x) = x - ⟨ x, \to 1 ⟩ \to 1 / | | \to 1 | |^{2} = x - (\frac{1}{d} \sum_{i} x_{i}) \to 1 = (x_{j} - \frac{1}{d} \sum_{i} x_{i})_{j}$ .

Then $\frac{x - E [x]}{\sqrt{V a r [x]}} = \frac{x - E [x]}{\sqrt{E [(x - E [x])^{2}]}} = \frac{p_{1} (x)}{\sqrt{\frac{1}{d} \sum_{i} p_{1} (x)_{i}^{2}}} = \sqrt{d} \frac{p_{1} (x)}{| | p_{1} (x) | |} = \sqrt{d} p_{S} (p_{1} (x))$ , where $p_{S}$ is the projection on the unit sphere.

This gives us $\frac{x - E [x]}{\sqrt{V a r [x]}} = \sqrt{d} p_{S} (p_{1} (x))$ .

Therefore, LayerNorm is just projecting on the hyperplane orthogonal to $\to 1$ , and then projecting again on the unit sphere (of the hyperplane orthogonal to $\to 1$ ), times $\sqrt{d}$ . These two projections are the same as projecting a point to its closest point on the unit sphere of the hyperplane orthogonal to $\to 1$ ^[3].

For example, in dimension 3, LayerNorm is the same as projecting a point to the point of the black ring below closest to it.

This helps me see geometrically what $LayerNorm (R^{d})$ is: it's the unit sphere of the hyperplane orthogonal to $1$ .

This also gives me a geometrical intuition of why it might help (though the usual formula works just as well), though it is quite speculative. First, projecting on the hypersphere clearly solves the problem of exploding or vanishing activations. Second, the projection on the $\to 1$ vector pushes you away from the quadrants where ReLU is boring. Indeed, the $\to 1$ vector points towards the quadrant where all coordinates are positive, where ReLU = id, and points away from the quadrant where all coordinates are negative, where ReLU = 0. In all other quadrants, ReLU does something a bit more interesting: it zeros out some coordinates and leaves others intact.

What is missing from this geometrical point of view: It's hard to understand high dimensions. In particular, it's not easy to see that the projection on the hyperplane does almost nothing to high dimensional vectors: it only changes one coordinate among thousands. Therefore, the cosine similarity between a vector $x$ and $LayerNorm (x)$ is almost always very close to 1. Indeed, if ${^x}_{i}$ are the coordinates of $x$ in a basis which has the $1$ vector (normalized) as first vector, $\frac{⟨ x, LayerNorm (x) ⟩}{| | x | | | | LayerNorm (x) | |} = \frac{\sum_{i = 2}^{d} {^x}_{i}^{2}}{\sqrt{(\sum_{i = 2}^{d} {^x}_{i}^{2}) (\sum_{i = 1}^{d} {^x}_{i}^{2})}} = \sqrt{1 - \frac{{^x}_{1}^{2}}{\sum_{i = 1}^{d} {^x}_{i}^{2}}} \approx 1$

Appendix: Solutions to the Simple Questions Using the Geometric Interpretation of SVD

What is a geometric interpretation of a symmetric matrix? Why is the inverse of an invertible symmetric matrix always symmetric?

A symmetric matrix is, according to the spectral theorem, a matrix of the form $U diag (λ) U^{T}$ where U is an orthogonal matrix. Therefore, a symmetric matrix is a matrix which can be decomposed as a sum of transformation of the form $x \mapsto λ ⟨ x, u ⟩ u$ . A symmetric matrix is the matrix of a linear transformation, which writes where it reads! Therefore, inverting a symmetric matrix is easy: if the $λ_{i}$ are all non-zero, just read where you just wrote (in the direction of $u_{i}$ ), scale by $1 / λ_{i}$ , and then write back along $u_{i} .$ The transformation I just described to invert the action of a symmetric matrix also writes where it reads, so the inverse of a symmetric matrix is also symmetric. (You can check, that because there is no double read nor double write, what I described is legit, though the formal proof with usual linear algebra is much faster than the formal proof using geometry.)

What is the geometric interpretation of transposing a linear transformation ?

$(U diag (λ) V^{T})^{T} = V diag (λ) U^{T}$ , therefore M transposed is the same as M, except it reads where M writes, and writes where M reads (the simple rank-1 operations are "swapped").

Why are $λ_{i}^{2}$ the eigenvalues of $M M^{T}$ and why are its eigenvectors $u_{i}$ ? Why are $λ_{i}^{2}$ the eigenvalues of $M^{T} M$ and why are its eigenvectors $v_{i}$ ?

$M M^{T}$ is the composition of two linear transformations. A first one (the transposed of M, see above) which reads along $u_{i}$ scales by $λ_{i}$ and writes along $v_{i}$ , and a second one which reads along $v_{i}$ , scales by $λ_{i}$ , and writes along $u_{i}$ . Therefore, $M M^{T}$ reads along $u_{i}$ , scales by $λ_{i}^{2}$ and writes along $u_{i}$ . Therefore, it is symmetric (it reads where it writes), and the $u_{i}$ are eigenvectors with eigenvalues $λ_{i}^{2}$ because if it read $u_{i}$ , it will write $λ_{i}^{2} u_{i}$ (there is no double read nor double writes). The same reasoning works for $M^{T} M$ .

^{^}
Toy Models of Superposition states something similar: "It's possible to have $exp (n)$ many "almost orthogonal" ( $< ϵ$ cosine similarity) vectors in high-dimensional spaces. See the Johnson–Lindenstrauss lemma." But this lemma does not directly state this property, and rather tells us that there is a linear map to a space with $log n$ dimensions which almost preserves L2-distances between points. I prefer the theorem I present in this post which is directly about how many almost orthogonal vectors can fit in high-dimensional spaces.
^{^}
For numerical stability reasons, people also usually add a small $ϵ$ to the denominator.
^{^}
Using the L2 distance, for a given point $x$ , if y is the closest point to $x$ on the unit sphere of the hyperplane $H$ orthogonal to the $1$ vector, and if $p_{H} (x)$ is the orthogonal projection of $x$ on $H$ , then by the Pythagorean theorem, $d (x, y)^{2} = d (x, p_{H} (x))^{2} + d (p_{H} (x), y)^{2}$ , which is minimal when y is the projection of $p_{H} (x)$ on the unit sphere. This proves that projecting on the unit sphere of $H$ is the same as first projecting on $H$ and then projecting on the unit sphere.

[-]ojorgensen2y62

But this does not hold for tiny cosine similarities (e.g. 0.01 for , which gives a lower bound of 2 using the formula above). I'm not aware of a lower bound better than $n$ for tiny angles.

Unless I'm misunderstanding, a better lower bound for almost orthogonal vectors when cosine similarity is approximately $0$ is just $n$ , by taking an orthogonal basis for the space.

My guess for why the formula doesn't give this is because it is derived by covering a sphere with non-intersecting spherical caps, which is sufficient for almost orthogonality but not necessary. This is also why the lower bound of $2$ vectors makes sense when we require cosine similarity to be approximately $0$ , since then the only way you can fit two spherical caps onto the surface of a sphere is by dividing it into $2$ hemispheres.

This doesn't change the headline result (still exponentially much room for almost orthogonal vectors), but the actual numbers might be substantially larger thanks to almost orthogonal vectors being a weaker condition than spherical cap packing.

[-]Fabien Roger2y20

You made me curious, so I ran a small experiment. Using the sum of abs cos similarity as loss, initializing randomly on the unit sphere, and optimizing until convergence with LBGFS (with strong wolfe), here are the maximum cosine similarities I get (average and stds over 5 runs since there was a bit of variation between runs):

It seems consistent with the exponential trend, but it also looks like you would need dim>>1000 to have any significant boost of number of vectors you can fit with cosine similarity < 0.01, so I don't think this happens in practice.

My optimization might have failed to converge to the global optimum though, this is not a nicely convex optimization problem (but the fact that there is little variation between runs is reassuring).

[-]kyscg8mo10

I have a question about the formula. While finding the number of almost orthogonal vectors, did you use $cos θ = 0.1$ etc. to then find $sin θ = \sqrt{1 - {cos}^{2} θ}$ or was it something else?

[-]Fabien Roger8mo20

If I remember correctly that's what I did.

[-]TheManxLoiner2y10

Huge thanks for writing this! Particularly liked the SVD intuition and how it can be used to understand properties of . One small correction I think. You wrote:

$x - E [x]$ is simply $p_{1} (x)$ the projection along the vector $(1 \dots 1)$

I think $E [x]$ is projection along the vector $(1, \dots, 1)$ , so $x - E [x]$ is projection on hyperplane perpendicular to $(1, \dots, 1)$

[-]Fabien Roger2y30

Oops, that's what I meant, I'll make it more clear.

LESSWRONG
is fundraising!
LW