## Sunday, October 09, 2022

### Will Strassen's Matrix Mult Alg ever be practical?

All time bounds are asymptotic and really O-of.

Recall that Strassen found a clever way to multiply  two 2x2 matrices with 7 mults (and lots of adds)  leading to a matrix mult alg in n^{\log_2 7} = n^{2.87...}

Recently (see here) a deep-mind-AI found a way to multiply  two 4x4 matrices with 47 mults (and lots of adds) leading to a matrix mult alg in n^{\log_4 47} = n^{2.777...}. NOTE ADDED: The new algorithm only works over GF for 4x4 matrices.

Much better is known, see our blog posts here and here.

The more advanced algorithms are complicated and have large constants so will never be practical. But Strassen's result, and now the new algorithm, SEEM to me they could be practical.

(ADDED LATER- many of the comments inform me that Strassen IS practical and IS being used. Great! Now we know!)

Thoughts about Strassen that also apply to the  new algorithm.

1) n has to be large for Strassen to given an improvement. But as we deal with larger data sets the value of n is getting larger.

2) People are mostly interested in sparse matrices for which there are better methods. I've heard that for a while- but is it still true? I thought ML used dense matrices.

3) Strassen is hard to code up. Actually it doesn't look that hard to code up. However, I have never tried to code it up, so maybe there are subtle points there.

4) Strassen only works on matrices of size 2^n x 2^n. You can pad matrices out but that might kill whatever time advantage you get. (The new alg only works on  4^n x 4^n).

5) Strassen uses recursion and there is the hidden cost of recursion. I think that is a think of the past and our younger readers do not know what I am talking about.

6) (This is obvious) the recursion would only go down to a certain level and THEN you would use ordinary Matrix Mult. This may also add time.

I suspect that 2 and 4 are the most important reasons Strassen (or the new algorithm) is not practical BUT I would like to hear your thoughts?

Does any package NOW use Strassen's Algorithm?

Side Note: I like to ask students if they think there is a better-than-cubic algorithm for Matrix Mult. They do not. Then I show it to them and tell them THIS is why LOWER BOUNDS are hard. You have to show that NO, nobody clever will find a trick you hadn't thought of.

1. About #2: Сonvolutional layers in CNN models requires multiple multiplications of matrices of fixed size, this size if defined by kernal size of convolutional layer, typically 3x3, 5x5 or 7x7. And there is no sparsity in it. So the algorithm like Strassen (or the new one) could give a noticible speed-up for CNN inference engines.

2. Strassen's algorithm is routinely used in Computer Algebra; see for instance . There is quite a bit of work on floating-point matrix multiplication using Strassen's algorithm as well (e.g., ), but I don't know if it has made its way into standard packages.

3. Does your "Much better is known" comment and reference mean that the New Scientist article is technically incorrect???

1. The article said that Strassen is the most efficient on most matrix sizes''- I think that can be interpreted correctly as that Strassen works well on matrices of sizes actually encounters. The algorhtms that are much better' that I refer to would only be better for matrices of galactic size.

2. Sorry: What I was trying to ask without grinding my usual axe is: what's the story with the algorithm from the Google AI group? Is it really new? Is it really faster?

3. The fact that you can multiply two 4x4 matrices with 47 mults is NEW. And it was discovered by a program which is interesting.

This leads to a Strassen-like Matrix Mult algorithm algorithm that runs in time n^{2.777...} steps. This does NOT beat the current best-known bounds; however (informally) it is the best SIPLE algorithm known for time. Is it faster in practice? From the comments that tell be Strassen IS used and IS faster than the standard algorithm, its POSSIBLE that the new algorithm is better than Strassen. But since it was just discovered recently I doubt its been tried yet.

4. 4. Your initial assumption is plain wrong! Strassen algorithm IS used, for years, in software libraries. There are two world though: In exact computation (over the integers, finite fields, etc.) Strassen is routinely used, cf Flint, FFLAS-FFPACK, M4RI, etc. In floatint-point computation, Strassen's algorithm may suffer from numerical instability so it is probably less used, but some libraries use it.

This misconception about Strassen's algorithm is very common, I do not know why...

5. For years in my algorithms class, I have given a homework assignment to find the cut-off point in recursion when one should switch from Strassen's algorithm on powers of 2. In languages like C, students would find the cross-over point at 32 or 64 depending on their coding efficiency. More recently in Java with object-oriented programming where matrices are represented as a pointer to a vector of pointers to column or row vectors, the scale and variation of student answers is much larger (128 to 1024 for the cross-over). Strassen's 2008 Knuth Prize citation (https://www.acm.org/media-center/2008/october/acm-sigact-2008-knuth-prize-recognizes-strassen-for-contributions-to-efficient-algorithm-design) you will see the sentence: "It is an intricate yet simple algorithm that remains the method of choice for multiplying dense matrices of size 30 by 30 or more on machines today."

6. (This is a response to all of the comments that pointed out that YES Strassen IS practical and IS used.) THANKS for the info- I should have phrased the post as QUESTIONS is Strassen practical? Is Strassen used?' rather than presuppose that it is not practical or used yet.

B above asked the question of WHY is the misconception about Strassen so widely held? Some speculation and question: WHEN did Strassen begin being practical? Used? I suspect there was a long time when it was NOT practical or used and some people (I am one of them) do not update their `everyone knows X' part of their brain often enough. And the DIVIDE between theory and practice and the lack of communication between the two is also a factor.

Two more questions
1) I thought that Strassen only working on 2^n x 2^n would be a big problem. Why isn't it?

2) Is the new algorihtm enough better than Strassen to replace it i systems? To be used in new systems?

7. Strassen wasn't practical on computers when it was discovered/invented. That's because the computers of the late 60s and early 70s had very limited support for recursion, and it often incurred a major performance hit as well. To borrow a useful term, highly recursive, memory intensive algorithms were "Galactic" on such hardware.

Basically, heavily recursive code would require more RAM at peak use that those early machines had, because the resulting execution stack for the program would grow too deep. Remember, each stack frame of the program stack has a minimum (usually fixed) size, each recursive invocation grows the stack by 1 frame, and these machines had RAMs measured in Kilobytes. Even modestly stateful recursive code could easily "blow the stack".

On many machines there were various tactics for getting your program to work despite the highly limited RAM. But most of these involved refactoring to avoid recursion, and breaking up your program so that only portions of it ever had to be in memory at any given time. Strassen was highly nontrivial to implement at all in such environments, and even if you did (or could), the resulting code wouldn't actually be faster for the inputs the machines could handle.

To borrow a useful term, Strassen's algorithm was "Galactic" when he invented it. And that initial impression apparently persists even decades after it became widely used in production software. It's a great example of how an algorithm's practicality isn't just determined by its inherent space or time complexity, but also the strengths & limitations of the available computers that might run it.

8. My understanding is that 4-by-4 result holds only for fields of characteristic 2. This is an interesting result, but not as general as Strassen or other well-known matrix multiplication algorithms.

9. 1) I thought that Strassen only working on 2^n x 2^n would be a big problem. Why isn't it?

→ I do not really know, but since Strassen's alg really provides some asymptotic improvement, the factor 2 lost by padding with zeroes is maybe not too expensive. Another possibilities is to track down these zeroes to avoid useless computations at the leaves of the recursion tree. (Note that in practice, computation are reverted to classical mult. for small matrices.)

2) Is the new algorihtm enough better than Strassen to replace it i systems? To be used in new systems?

→ The first big caveat is that the new alg. is for GF(2) only. And nobody's really interested in that case, or use different techniques that are not Strassen-like. So I suppose it won't be used. Yet I know for sure that some people working in this area have a look at the new alg to check whether standard techniques may extend it to other field (for instance GF(2^k)) where it may be interesting.

- For the misconception, you may be right that it used to be unpractical (but I was too young!) and that people did not update their knowledge on this. Also, Strassen alg is complicated for numerical computations, this may explain also: If the alg is mostly used in exact computations, many people won't be aware of it. Finally, I've heard that Knuth said (or wrote?) at some point that Strassen alg. is and will remain unpractical. And it is hard to argue against him ;-).
- You write in an answer to DJL that "[the new alg] was discovered by a program which is interesting." I agree that it is interesting. Though it is a pity that the authors suggest that it be the first (or one of the first) time this happens. In this area precisely (small matrix multiplications) and in other, this is routinely done! For instance, this catalogue has a lot of computer-invented algorithms for matrix multiplication: https://fmm.univ-lille.fr/.

(A general impression on the "new" result: There are some quite interesting finding, but not that impressive (cf. the improvement in two days by humans). The technique of deep learning for this task is new, but other computerized methods already existed and the new technique is not really better. It may be even worse: Older methods with the same computational power as DeepMind has may provide better results! I cannot understand why this deserved a paper in Nature...)

1. Fawzi et al.'s new algorithm does work in GF(2^k). In fact it works in any ring of characteristic 2 as it just relies on the property 1+1=0.

10. (I was the previous anonymous.) After more thoughts and digging in the literature and code, a better answer for the problem of 2^n×2^n, which is actually not a problem at all! There are several ways to deal with it:
- standard padding, as mentioned before, and this sort of works;
- virtual padding, introduced in the 1990's, where the padding is not really done, but some data structure is responsible for it → this also sort of works;
- dynamic peeling: each time you have to split your matrix in four, if n is odd you ignore the last rows and columns, and compute the missing part "by hand";
- static peeling (which seems to be the best): for an instance of size n×n, you remove the last columns and rows to get a power of two, and then you compute the missing part (note that you can again do the same thing on this missing part).

11. I believe that Strassen's algorithm is of interest, including in practice, due to its cache-friendly properties, that is, it has better locality of memory accesses. A co-worker building a system used Strassen's algorithm for this reason, to speed up his code. There was (independently) a theoretical analysis, in the framework of cache-oblivious algorithms. This may be the right reference.

Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999, October). Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039) (pp. 285-297). IEEE.

12. For the definitive recent paper on the practical implementation of Strassen's algorithm, see https://scholar.google.com/citations?view_op=view_citation&hl=en&user=62GM5XcAAAAJ&cstart=20&pagesize=80&citation_for_view=62GM5XcAAAAJ:dYRx7efp7U0C

As noted in the abstract:
'We dispel with “street wisdom” regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on …'