What does "matrix multiplication" usually mean? (I know, context.)

I had previously often assumed that it means a matrix to matrix operation, but I now think that it almost never does, but instead it usually means matrix to vector multiplication.

In scientific computing it has to do with solving systems of linear equations.

In AI inference it has to do with weight to activation multiplication.

When I read journalistic texts about special hardware used for Graphics, AI, Simulation, HPC they usually just write that the HW does "matrix multiplications".

Then the term tensor is used too. As an EE I know vector fields where a 4D input vector (place and time) is transformed into a 3D vector of complex numbers (phasor, Poynting vector) output. I suppose that is a tensor. But I do not understand what a tensor is in AI.

So, in what context do they mean what and does the HW actually have to deal with two matrices or one or is it actually just loads of dot products?

New Answer
Ask Related Question
New Comment

3 Answers sorted by

This question has a bunch of different angles:

  • What are the typical problems bundled under "matrix multiplication"?
  • What are the core algorithms for those problems, and where do they overlap?
  • How does special hardware help?
  • Why "tensor" rather than "matrix"?

The first two angles - problems and algorithms - are tightly coupled. Typical matrix problems we want to solve as an end-goal:

  • dot a matrix with a vector (e.g. going from one layer to another in a neural net)
  • solve Ax = b (e.g. regressions, minimization)
  • dot a sparse/structured matrix with a vector (e.g. convolution)
  • solve Ax = b for sparse/structured A (e.g. finite element models)

Both types of dot products are algorithmically trivial. Solving Ax = b is where the interesting stuff happens. Most of the heavy lifting in matrix solve algorithms is done by breaking A into blocks, and then doing (square) matrix multiplications on those blocks. Strassen's algorithm is often used at the block level to improve big-O runtime for large matrix multiplies. By recursively breaking A into blocks, it also gives a big-O speedup for Ax = b.

I'm not a hardware expert, but my understanding is that hardware is usually optimized to make multiplication of square matrices of a certain size very fast. This is largely about caching - you want caches to be large enough to hold the matrices in question. Fancy libraries know how large the cache is, and choose block sizes accordingly. (Other fancy libraries use algorithms which perform close-to-optimally for any cache size, by accessing things in a pattern which plays well with any cache.) The other factor, of course, is just having lots of hardware multipliers.

Solving Ax=b with sparse/structured matrices is a whole different ballgame. The main goal there is to avoid using the generic algorithms at all, and instead use something fast adapted to the structure of the matrix - examples include FFT, (block-)tridiagonal, and (block-)arrowhead/diagonal-plus-low-rank matrix algorithms. In practice, these often have block structure, so we still need to fall back on normal matrix-multiplication-based algorithms for the blocks themselves.

As for "matrix" vs "tensor"... "tensor" in AI/ML usually just means a multidimensional array. It does not imply the sort of transformation properties we associate with "tensors" in e.g. physics, although Einstein's implicit-sum index notation is very helpful for ML equations/code.

Matrix multiplication means multiplying matrices.

A vector can be viewed as a particular sort of a matrix, with one dimension equal to 1. So matrix-vector multiplications are a special case of matrix-matrix multiplications.

A tensor is a possibly-higher-dimensional generalization of a matrix. A scalar is a rank-0 tensor, a vector is a rank-1 tensor, a matrix is a rank-2 tensor, and then there are higher ranks as well.

In actual mathematics, vectors and tensors are not mere arrays of numbers; they are objects that live in "vector spaces" or "tensor products of vector spaces", and the numbers are their coordinates; you can change coordinate system and the numbers will change in certain well-defined ways. But when e.g. Nvidia sell you a GPU with "tensor cores" they just mean something that can do certain kinds of matrix arithmetic quickly.

In e.g. one version of Google's TPUs, there's a big systolic array of multiply-accumulate units, which is good for dot-product-like operations, and you program it with instructions that do things like an Nx256-by-256x256 matrix multiplication, for whatever value of N you choose. If you need to handle arrays of different sizes, you'd build the calculations out of those units.

An example of a systolic algorithm might be designed for matrix multiplication. One matrix is fed in a row at a time from the top of the array and is passed down the array, the other matrix is fed in a column at a time from the left hand side of the array and passes from left to right. Dummy values are then passed in until each processor has seen one whole row and one whole column. At this point, the result of the multiplication is stored in the array and can now be output a row or a column at a time, flowing down or across the array.

If you do a matrix multiplication the obvious way, this results in dot products of rows and columns (one for each element of the resulting matrix). So it seems to me that improving matrix to matrix multiplication performance comes from improving the performance of dot products.

This seems like a decent explanation of Hardware Matrix Multiplication, even if it lacks concrete sources.

As for a tensor, I think these references explain it better that I can at my current level. But the intuition is that it's a generalization of a matrix to high-dimensions, with additional properties when transformed.

To me there is a difference between the hardware for 1xN by Nx1 and MxN by NxM (with N > M > 1). Although any matrix operation is many 1xN by Nx1 dot products, doing them independently would be inefficient.
"If you do a matrix multiplication the obvious way, this results in dot products of rows and columns (one for each element of the resulting matrix). So it seems to me that improving matrix to matrix multiplication performance comes from improving the performance of dot products."
True, but not individual dot products, but the collective of very many dot products. Obviously you do not do it the obvious way as you would have to load the same data over and over again.

2 comments, sorted by Click to highlight new comments since: Today at 4:40 AM

The entire SIMD vector approach is good for many dot products but it is not the same as a systolic array for rank two on rank two multiplication.

If the job would be to multiply two 1024x1024 matrices then a systolic array of 256x256 MACs would be a good choice. It would work four times on 256x1024 by 1024x256 matrices for 1024+256 steps.

Because the SIMD approach is bad for 2D on 2D matrix multiplication NVIDIA has introduced:

Tensor Cores in the Volta architecture.

Article about it: