This question has a bunch of different angles:

- What are the typical problems bundled under "matrix multiplication"?
- What are the core algorithms for those problems, and where do they overlap?
- How does special hardware help?
- Why "tensor" rather than "matrix"?

The first two angles - problems and algorithms - are tightly coupled. Typical matrix problems we want to solve as an end-goal:

- dot a matrix with a vector (e.g. going from one layer to another in a neural net)
- solve Ax = b (e.g. regressions, minimization)
- dot a sparse/structured matrix with a vector (e.g. convolution)
- solve Ax = b for sparse/structured A (e.g. finite element models)

Both types of dot products are algorithmically trivial. Solving Ax = b is where the interesting stuff happens. Most of the heavy lifting in matrix solve algorithms is done by breaking A into blocks, and then doing (square) matrix multiplications on those blocks. Strassen's algorithm is often used at the block level to improve big-O runtime for large matrix multiplies. By recursively breaking A into blocks, it also gives a big-O speedup for Ax = b.

I'm not a hardware expert, but my understanding is that hardware is usually optimized to make multiplication of square matrices of a certain size very fast. This is largely about caching - you want caches to be large enough to hold the matrices in question. Fancy libraries know how large the cache is, and choose block sizes accordingly. (Other fancy libraries use algorithms which perform close-to-optimally for any cache size, by accessing things in a pattern which plays well with any cache.) The other factor, of course, is just having lots of hardware multipliers.

Solving Ax=b with sparse/structured matrices is a whole different ballgame. The main goal there is to avoid using the generic algorithms at all, and instead use something fast adapted to the structure of the matrix - examples include FFT, (block-)tridiagonal, and (block-)arrowhead/diagonal-plus-low-rank matrix algorithms. In practice, these often have block structure, so we still need to fall back on normal matrix-multiplication-based algorithms for the blocks themselves.

As for "matrix" vs "tensor"... "tensor" in AI/ML usually just means a multidimensional array. It does not imply the sort of transformation properties we associate with "tensors" in e.g. physics, although Einstein's implicit-sum index notation is very helpful for ML equations/code.

An example of a systolic algorithm might be designed for matrix multiplication. One matrix is fed in a row at a time from the top of the array and is passed down the array,

the other matrixis fed in a column at a time from the left hand side of the array and passes from left to right. Dummy values are then passed in until each processor has seen one whole row and one whole column. At this point, the result of the multiplication is stored in the array and can now be output a row or a column at a time, flowing down or across the array.https://en.wikipedia.org/wiki/Systolic_array