In this post, I shall first describe a new word embedding algorithm that I came up with called a matrix product optimized (MPO) word embedding, and I will prove a theorem that completely interprets this word embedding in the simplest case. While it is probably infeasible to mathematically characterize a word embedding with a mathematical proof when the corpus is something that one encounters in practice, this theorem should be a signal that such a word embedding (or a similar word embedding) should be interpretable and mathematical in other ways as well. This theorem also illustrates the way that MPO word embeddings should behave.

Unlike most word embedding algorithms, MPO word embeddings are matrix-valued so that they map tokens to matrices instead of simply mapping tokens to vectors. In our case, the matrices are not necessarily real matrices as they may be complex or even quaternionic matrices. MPO word embeddings also differ from other word embedding algorithms in that they are not constructed using neural networks though we still use gradient ascent.

Why MPO word embeddings?

Since tokens often have many meanings depending on context, it seems better to represent a token in a form where it is easy or easier to separate the individual meanings of a token. While vectors may be good for representing individual meanings of tokens, it is better to represent a polysemantic token as a matrix instead of a vector. If someone were to give me a task of interpreting a word embedding, I would be much happier if the word embedding were a matrix-valued word embedding that neatly organized each of the meanings of a polysemantic token into a matrix than if the word embedding were a vector-valued word embedding where each of the individual meanings of the token were awkwardly smushed together in a vector.

Spaces of matrices have additional structure that is lacking in vector spaces, and one can use this additional structure to analyze or interpret our word embedding. This additional structure also means that matrix-valued word embeddings should behave more mathematically than vector-valued word embeddings.

MPO word embeddings also satisfy some interesting properties. Let  denote the fitness level of a random MPO word embeddings that we obtain from the set  where  consists of the corpus that we are training our word embedding on along with a couple of hyperparameters. Then the random variable  will often have very low entropy or even zero entropy. Since  often has zero/low entropy, trained MPO word embeddings will not contain any/much random information that was not already present in the training data. It will be easier to interpret machine learning models when the trained model only depends on the training data and hyperparameters and which does not depend on random choices such as the initialization. Quaternionic and complex MPO word embeddings will often become real word embeddings after a change of basis. This means that MPO word embeddings behave quite mathematically, and it seems like machine learning models that behave in ways that mathematicians like would be more interpretable and understandable than other machine learning models.

Quaternionic matrices:

In this section, we shall go over the basics of quaternionic matrices. I assume that the reader is already familiar with ideas in linear algebra up through the singular value decomposition. Much of the basic theory of quaternionic matrices is a straightforward generalization of the theory of complex matrices. I hope you are also familiar with the quaternions, but if you are not, then I will remind you the definition and basic facts about quaternions. We refer the reader to [1] for facts about quaternions. You may skip this section if you simply care about the real and complex cases.

If  is a square real, complex, or quaternionic matrix, then the spectral radius of  is

 where  is any matrix norm (since we are taking the limit, it does not matter which matrix norm we choose). If  is a real or complex matrix, then

. The same fact holds for quaternionic matrices, but we have to be careful about how we define the eigenvalues of quaternionic matrices due to non-commutativity.

The division ring  of quaternions is the 4-dimensional associative but non-commutative algebra over the field of real numbers spanned by elements   (so ) where quaternionic multiplication is the unique associative (but non-commutative) bilinear operation such that . We observe that if , then . It is easy to show that .

Recall that the conjugate and the absolute value of a quaternion are defined by

 and  whenever . If , then  and . Observe that the field of complex numbers  is a subring of .

If  is a quaternionic matrix, then define the adjoint of  to be . While the adjoint of a quaternionic matrix is well-behaved, the transpose of a quaternionic matrix is not very well behaved since we typically have  but  for quaternionic matrices .

We shall now associate -quaternionic matrices with -complex matrices.

Suppose that  are -complex matrices. Then the associated quaternionic complex matrix  of the quaternionic matrix  is the complex matrix 

. We observe that , and , and  whenever  are quaternionic matrices and the operations are defined.

An eigenvalue of a quaternionic matrix  is a quaternion  such that  for some non-zero vector .

Observation: If  is an eigenvalue of a quaternionic matrix  corresponding to the eigenvector   and  is an eigenvalue of a quaternionic matrix  corresponding to the same eigenvector , then , so  is an eigenvector of the quaternionic matrix .

Observation: If  is an eigenvalue of a product of quaternionic matrices  and, then , so if , then  is also an eigenvalue of . In particular,  and  have the same non-zero eigenvalues.

Observation: If  is a quaternionic matrix and  is an eigenvalue of , then whenever  is a non-zero quaternion, the value  is also an eigenvalue of .

Proof: If  is an eigenvalue of , then there is some non-zero quaternionic vector  with . Therefore, set . Then , so  is an eigenvalue of the quaternionic matrix 

Let  denote the group of all quaternions  with . Let  denote the group of all -unitary matrices with determinant . Let  denote the group of all - orthogonal matrices. Then the mapping  is an isomorphism from  to . Let  be the vector subspace of  consisting of all elements of the form  where  are real numbers. Then  is an inner product space where inner product is the standard dot product operation on  where  and where  for quaternions ). Then we can identify  with the linear transformations that preserve this inner product. We may now define a group homomorphism  by letting . Then the homomorphism  is a surjective homomorphism. In fact,  is a connected Lie group with , and the group  is simply connected;  is the universal cover of the lie group . From these facts about quaternions, we may make a few observations:

Observation: If  are quaternions, then there is some  with  if and only if  and .

Observation: Suppose that  is a square quaternionic matrix and  are quaternions with  and . Then  is an eigenvalue of  if and only if  is an eigenvalue of .

Observation: Suppose that  is a quaternionic vector, and  are complex vectors with . Suppose furthermore that  is a complex number. Then  if and only if . In particular,  is an eigenvalue of  if and only if  is an eigenvalue of .

Observation: Let  be a square quaternionic matrix. Then the following quantities are equivalent:

  1. .

We shall therefore define the spectral radius of a square quaternionic matrix  to be  (or any of the quantities in the above observation).

If  is a quaternionic matrix, then we say that  is Hermitian, normal, or unitary respectively if  is Hermitian, normal, or unitary. If  is an -quaternionic matrix, then  is Hermitian iff , and  is normal iff , and  is unitary iff.

If  are quaternionic column vectors, then define the quaternionic inner product  of  by . We observe that the quaternionic inner product is real bilinear and conjugate symmetric ,. The quaternionic inner product preserves right scalar multiplication . We define the real inner product of two quaternionic vectors by setting. We may recover the quaternionic inner product from the real-valued inner product since  for quaternionic vectors .

Observation: If  are quaternionic vectors, then  and  for .

Observation: Suppose  are quaternionic vectors. Then .

Observation: Suppose  are non-zero quaternionic vectors. If , then  for some quaternion .

Counterexample: In general, if  is a non-zero quaternion, and  is a quaternionic vector, then .

Observation: Let  be an  quaternionic matrix, and let  be an -quaternionic matrix. Then the following are equivalent:

  1.  for all quaternionic vectors .
  2.  for all quaternionic vectors .

Proposition: Let  be a quaternionic matrix. Then the following are equivalent:

  1.  for all quaternionic vectors .
  2.  for all quaternionic vectors .
  3.  for all quaternionic vectors .
  4.  is an isometry.

If  is a square quaternionic matrix, then the above statements are all equivalent to the following:

6.  is unitary.

Theorem: (quaternionic singular value decomposition) If  is a quaternionic matrix, then there exists quaternionic unitary matrices  where  is a diagonal matrix  with non-negative real non-increasing diagonal entries.

In the above theorem, the diagonal entries in  are called the singular values values of  and they are denoted by  where .

If  is a real, complex, or quaternionic matrix, and  then define the Schatten -norm of  to be  where we define  for  and .

Observation: If  is a quaternionic matrix with singular values , then  has singular values . In particular, and , so  whenever .

Observation: If  is a quaternionic matrix, then .

If  is an -quaternionic matrix, then define the quaternionic trace of  as

. In general, , so the notion of the quaternionic trace is deficient.

If  is an -quaternionic matrix, then define the real trace of  as We observe that  and whenever  are quaternionic matrices.

Observe that we can recover  from  by using the formula .

If  are -quaternionic matrices, then define the real-valued Frobenius inner product  as .

If  and  where every  is real, then

. Define .

We observe that if  are unitary, then 

.

In particular, if  where  are unitary and  is diagonal with diagonal entries , then 

MPO word embeddings

Let  denote either the field of real numbers, complex numbers, or the division ring of quaternions. Let  be a finite set (in our case,  will denote the set of all tokens). Let  be the set of all functions  such that . Observe that  is compact. Define a function  by letting . We say that  is an MPO word pre-embedding for the string of tokens  if the quantity  is locally maximized. To maximize , the function  must simultaneously satisfy two properties. Since , the matrices  must be spread out throughout all the  dimensions. But we also need for the matrices  to be compatible with  and  and more generally with  for  near .

We say that a collection of vectors  is a tight frame if for some necessarily positive constant . We say that a tight frame  is an equal norm tight frame if .

Theorem: Suppose that . Let  denote either the field of real numbers or complex numbers. Let  denote the unit sphere in . Then the local minimizers of the frame potential  defined by  are global minimizers which are tight frames for . In particular, there exists an equal norm tight frame  whenever .

See [2, Thm 6.9] for a proof.

Theorem: Suppose that  is the field of real numbers, complex numbers, or the division ring of quaternions. Let . Then  and  if and only if there is an equal norm tight frame  with  and  with  where  for all  (where addition in the subscripts is taken modulo ).

Proof: In order to not repeat ourselves, our proof shall apply to all three cases: , but for accessibility purposes, the proof shall be understandable to those are only familiar with real and/or complex matrices.

Suppose that  is an equal norm tight frame with  are elements with  and  for all . Since  and  , we have . Therefore,

Therefore, .

Now, 

. Therefore, we have 

.

Suppose now that . By the arithmetic-geometric mean inequality, we have 

Suppose now that . We observe that  precisely when . From the arithmetic-geometric mean inequality, we have  precisely when . Therefore, each  has rank at most  and   for , so we can set  where  and.

Observe that . Therefore,

, so

 for all .

Therefore, there are quaternions  with  and where for all 

We observe that 

.

Therefore,  is an equal norm tight frame. 

Some thoughts:

It seems easier to prove theorems about MPO word pre-embeddings than it is to prove theorems about other machine learning models simply because MPO word pre-embeddings are more mathematical in nature. Of course, neural networks with ReLU activation are mathematical too, so we can prove theorems about them, but MPO word pre-embeddings are more like the objects that mathematicians like to investigate. And it seems easier to mathematically prove theorems that interpret MPO word pre-embeddings than it is to prove theorems that interpret other machine learning models. On the other hand, we are making a tradeoff here. MPO word pre-embeddings behave more mathematically, but word embeddings are simply the first layer in natural language processing.

Why use the spectral radius?

The matrix  is in general approximately a rank-1 matrix. This means that whenever  are unit vectors. One may be tempted to define the fitness of the  as  or . But mathematical objects are better behaved when taking limits. Instead of considering the string  as our corpus, we may consider the similar corpus  as , and in this case, the fitness of  would be or  which will both converge to  as . The use of the spectral radius simplifies and improves the behavior of the machine learning model for the same reason that integrals such as  behave better than sums .

Why complex numbers and quaternions?

While it takes more time to compute MPO word pre-embeddings for the complex numbers and for the quaternions, the complex and quaternionic MPO word pre-embeddings have some advantages over real MPO word pre-embeddings. It is currently unclear as to whether the advantages of complex and quaternionic MPO word pre-embeddings outweigh the cost of the increased computational complexity of training and using complex or quaternionic word pre-embeddings. Further research is needed on this topic.

Complex and quaternionic matrices provide new ways of testing MPO word pre-embeddings which are not available if we simply used real matrices. For example, if  is a complex or quaternionic MPO word pre-embedding, then the existence of a unitary matrix  where  is a real matrix should be considered evidence that the MPO word pre-embedding  is a high quality machine learning model while the non-existence of .

A complex or quaternionic MPO word pre-embedding  will typically have a higher fitness level than a corresponding real MPO word pre-embedding.

Sometimes, when one trains an MPO word pre-embedding twice to obtain two MPO word pre-embeddings , the fitness levels are equal (), and it is desirable to have equal fitness levels when training the word pre-embedding multiple times. But the probability of obtaining the equality  will depend on the choice of . In many cases, we would have

for . Of course, this probability depends on other factors besides the choice of division ring  such as the initialization. For example, if the matrices of the form  are initialized to have random positive values, then the probability  would be much greater than if the matrices of the form  are initialized to have random real values; if each  initially has random positive values, then I would currently consider the real-valued MPO pre-embeddings to be about as good as the complex-valued MPO pre-embeddings but easier to compute.

The fitness function  is not differentiable everywhere, and the gradient of  has singularities. The singularities of  have real codimension . In the case when  the singularities of the gradient of  disconnect the domain , and gradient ascent has difficulty crossing these singularities to reach a good local maximum.

Disadvantages of MPO word pre-embeddings:

  1. Projectivity: If  is a ring, then the center of  is the collection  of all elements  where  for all . It is easy to show that  is always a subring of . We observe that . If  for , and   for , then  is an MPO word pre-embedding if and only if  is an MPO word pre-embedding. This means that after one trains an MPO word pre-embedding , one needs to find the constants  where the function  defined by  behaves optimally. I may talk about how we can find the constants in another post.
  2. Locality: An MPO word pre-embedding is good at relating tokens to their immediate surroundings in the following sense. Suppose  is an MPO word pre-embedding for the string . Then the value of  will be determined mostly by all the other values  for  together with the multiset of all strings  such that . In other words,  can see the immediate surroundings  of , but  will not be able to see much more than this.
  3. Difficulty utilizing all dimensions without noise: Sometimes MPO word pre-embeddings behave poorly because it will be difficult to maximize the spectral radius while we still have . To ameliorate this problem, we can relax the requirement for  to the condition  for some . But in this case, the MPO word pre-embedding will not use all of the dimensions in  evenly. 

Conclusion:

Spectral methods have already been quite valuable in machine learning for a good reason. I will continue to make more posts about using the spectral radius (and similar objects) to construct machine learning models. 

Another possible advantages of quaternions: Added 9/8/2023

It seems like an increase in the value  provides diminishing returns in the performance of MPO word embeddings since when  is large, MPO word embeddings have difficulty utilizing all  dimensions equally. MPO word embeddings therefore can only give us some information about a token. However, it seems like increasing the index  increases the number of parameters of MPO word embeddings without the word embedding encountering substantially more difficulty in utilizing all  dimensions significantly. Therefore, MPO word embeddings should perform better simply by increasing the index . This means that complex MPO word embeddings should perform better than real MPO word embeddings, and quaternionic MPO word embeddings should perform better than complex MPO word embeddings. On the other hand, we still need to perform experiments to determine whether complex MPO word embeddings are that much better than real MPO word embeddings and quaternionic MPO word embeddings are even better than both real and complex MPO word embeddings.

References:

  1. Quaternions and matrices of quaternions. Fuzhen Zhang.  Linear Algebra and its Applications. Volume 251, 15 January 1997, Pages 21-57
  2. An Introduction to Finite Tight Frames. Shayne F. D. Waldron. July 26, 2017


 

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 6:21 AM

this does look like it might be interesting, but I think you might need to show - possibly visually - why this works. what does an embedding of a 5-word dataset look like on a chart? how does one interpret it on that chart? why do these expressions map to that chart, etc? that would allow introducing the math you're using to anyone who doesn't know it or is rusty, while reducing effort to follow it for those who don't know it. If you're only intending to communicate to those who already get the prereqs, which is a thing people do, then, well, post more posts like this and I'm sure someone who has the particular math background (quaternions?) will run into them eventually. it looks like the math is all here, just too dense for my current level of motivation, so maybe you just need the right eyes.

I personally can't evaluate your idea within the time I would allot to reading this post because you use a lot of expressions I'm not immediately familiar with and I don't see a way to shortcut through them in order to draw the conclusions about improved interpretability you're implying. but it does seem conceivable that it could be pretty cool. I can imagine why having explicitly disambiguated word senses would be useful, if you could get your embedding to be sturdy about them.

I appreciate your input.  I plan on making more posts like this one with a similar level of technical depth. Since I included a proof with this post, this post contained a bit more mathematics than usual. With that being said, others have stated that I should be aware of the mathematical prerequisites for posts like this, so I will keep the mathematical prerequisites in mind.

Here are some more technical thoughts about this.

  1. We would all agree that the problem of machine learning interpretability is a quite difficult problem; I believe that the solution to the interpretability problem requires us not only to use better interpretability tools, but the machine learning models themselves need to be more inherently interpretable. MPO word embeddings and similar constructions have a little bit (but not too much) of difficulty since one needs to get used to different notions. For example, if we use neural networks using ReLU activation (or something like that), then one has less difficulty upfront, but when it comes time to interpret such a network, the difficulty in interpretability will increase since neural networks with ReLU activation do not seem to have the right interpretability properties, so I hesitate to interpret neural networks. And even if we do decide to interpret neural networks, the interpretability tools that we use may have a more complicated design than the networks themselves.
  2. There are some good reasons why complex numbers and quaternions have relatively little importance in machine learning. And these reasons do not apply to constructions like MPO word embeddings.
  3. Since equal norm tight frames are local minimizers of the frame potential, it would help to have a good understanding of the frame potential. For simplicity, it is a good idea to only look at the real case. The frame potential is a potential for a force between a collection of particles on the sphere  where particles are repelled from each other (and from each other's antipodal point) and where the force tries to make all the particles orthogonal to each other. If , then it is possible to make all of the particles orthogonal to each other, and in this case, when we minimize this potential, the equal norm tight frames will simply be orthonormal bases. In the case when , we cannot make all of the particles orthogonal to each other, but we can try to get as close as possible. Observe that unlike the Newtonian and logarithmic potential, the frame potential does not have a singularity for when the two particles over lap. I will leave it to you to take the gradient (at least in the real case) of the frame potential to see exactly what this force does to the particles.
  4. Training an MPO word embedding with the complex numbers of quaternions is actually easier in the sense that for real MPO word embeddings, one needs to use a proper initialization, but with complex and quaternionic MPO word embeddings, an improper initialization will only result in minor deficiencies in the MPO word embedding. This means that the quaternions and complex numbers are easier to work with for MPO word embeddings than the real numbers. In hindsight, the solution to the problem of real MPO word embeddings is obvious, but at the time, I thought that I must use complex or quaternionic matrices.
  5. I like the idea of making animations, but even in the real case where things are easy to visualize, the equal norm tight frames are non-unique and they may involve many dimensions. The non-uniqueness will make it impossible to interpret the equal norm tight frames; for the same reason, it is hard to interpret what is happening with neural networks since if you retrain a neural network with a different initialization or learning rate, you will end up with a different trained network, but MPO word embeddings have much more uniqueness properties that make them easier to interpret. I have made plenty of machine learning training animations and have posted these animations on YouTube and TikTok, but it seems like in most cases, the animation still needs to be accompanied by technical details; with just an animation, the viewers can see that something is happening with the machine learning model, but they need both the animation and technical details to interpret what exactly is happening. I am afraid that most viewers just stick with the animations without going into so many technical details. I therefore try to make the animations more satisfying than informative most of the time.

Is the Latex compiling here?

I made the Latex compile by adding a space. Let me know if there are any problems.

[+][comment deleted]8mo10