## Wednesday, May 13, 2009

### Shaving Logs with Unit Cost

A paper in POPL 08 claims to have broken the long-standing n3 barrier on a problem in programming languages, bringing it down to n3/log n. The algorithm has a lot of details, but the main pillar it stands on is the so-called 'fast set' data structure.

This data structure is used to represent sets as bitmaps of its elements (so a set {1,2} over the domain 0-7 is represented as 01100000), and supports three operations, where one is important for my question: set difference. Set difference can be performed by going through the bits of each set. Here's the claim I have a question about: The author says, assuming an architecture with word size Θ(log n) bits, we can split the bitmaps of each set into (n/log n) chunks of size (log n) bits each, and since the word size is Θ(log n), these chunks will be stored in a word each, and operating on each word is a constant. Hence, set difference can be performed in Θ(n/log n) time. This is the whole gist of the algorithm to break the previous known barrier.

I was appalled by this argument, since my primitive techniques would easily 'prove' that set difference has a lower bound Ω(n), since you have to 'touch' each element of the set. I contacted the author, and he said assuming this word size is standard in algorithms literature, such as saying that depth-first search takes O(e+v) time ignores the log v time needed to spend to retrieve each vertex.

I talked to a few people in my department with varied responses, some said it is cheating since algorithms such as DFS do not depend on the logarithmic word size, it is just ignored; and some said it is probably a good argument.

The simple question is, what do you think as a complexity researcher? Is the complexity given a valid one? I am worried that you will say no, because a lot of people are citing this paper and everybody thinks the n3 barrier for that problem has been broken. I am also worried that you will say yes, as a person not particularly in the complexity field, but thinks he has a decent knowledge and sense of algorithms.

I remember complaining as a student that sorting actually takes O(n log2 n) time since one needs O(log n) time to do a comparison but my complaints fell on deaf ears.

In algorithms you use the unit-cost RAM model where basic register operations over O(log n) bit registers count as a single computation step. There are some good arguments for this: As technology improves for us to handle larger input sizes, the size of the registers tend to increase as well. For example, registers have grown from 8 to 64 bits on microprocessors over the past few decades.

You can do set difference of two register bitmaps A and B by the bitwise AND of A with the bitwise complement of B, all standard register operations. So the author has a legitimate O(n3/log n) time algorithm in the standard algorithmic unit-cost model.

I understand the student's frustration. The algorithm seems like a cheat and would not be in DTIME(O(n3/log n)) on a multi-tape Turing machine as I teach it in an introductory complexity course. But as an algorithmic result on the standard algorithmic model, the author stands on solid ground.

1. "I remember complaining as a student that sorting actually takes O(n log2 n) time since one needs O(log n) time to do a comparison but my complaints fell on deaf ears."

I had the same problem. Suppose you have n words of size k over an alphabet of size 2, how many bit comparisons do you need to sort the n words ?
Do someone have a reference ?

2. An other question that popped into my mind is whether the (n^3)/log n really breaks the n^3 barrier. Isn't it asymptotically n^3 anyway? It is definitely not the same as breaking the n^3 barrier in the same sense as a n^{3-e} would. Please correct me if I am wrong.

3. The problem is that (esp. for parallel) there is no one good machine model. So you have all of this "venue shopping" where you get a slightly better run time by assuming (and then exploiting) a slightly more powerful machine model.

I thought Pointer Machines were one attempt to get away from this (constant size arithmetic but ability to access arbitrary far memory locations). But they are just one more venue to choose from.

4. The assumption that doing operations on words is constant allows us to improve on bounds on lots of algorithms with similar tricks. (Also, making the assumption that these operations do count against your running time slows nearly all algorithms). I believe this is a deeply philosophical argument. It comes down to how we want to model the world. We could also assume that all operations, along with, the word operations are not constant. In that world we'd be right back at at least an n^3 algorithm, if not much worse. It's right along with the assumptions that doing basic integer operations, addition, subtraction, multiplication, and division, are also constant. Of course they aren't really constant! The 'mathematical modeling' is assuming they these operations are constant, and as such it allows you to get away with things like this. Tricks like this are a limitation of the 'time complexity model'. I think technically speaking, as long as he is clear about this, it is fine. However, I am not that impressed, and I think it would be more apt to call things like this 'engineering hacks' than anything else.

5. When measuring the complexity of algorithms using asymptotics, it doesn't make sense to me to have the size of the hardware be a growing function in the input length. We should first fix the hardware, and then look at the complexity for increasing input sizes.

If you do insist on the size of the hardware growing, then it seems arbitrary to assume that it grows as a logarithmic function of the input length. Why not assume that it grows as sqrt(n) ? That makes for an even better result. Perhaps the growth of the hardware should be parameterized... so in this case the algorithm has a n^3/H running time, where H is the hardware growth rate in terms of n.

6. We don't write actual programs on machines with single-bit operations. So, if you have a choice between a purer theory (single-bit operations) and a theory that more closely matches the performance of actual programs (unit cost for logarithmic or larger word size), which do you choose?

I can tell Lance's choice is the purer theory from the tone of his post, but my choice is the theory that actually models something useful. And while I respect his choice for his own research, I am appalled that Lance finds this style of theory appalling in the research of others.

Here's an analogy. If one is a mathematical purist, one should only do theory using finite state machines, because all of our actual machines really do have finite state. But the size of actual computers' memory and cache and external memory is large enough that for the purposes of practical programming it's more accurate to model them as a RAM: many important programming techniques are just not visible at the finite state level of modeling things. Similarly, I feel that, although the wordsize of any particular machine is obviously a constant, it works better to model it as logarithmically growing: this leads one to the study of bit-parallel programming techniques that really do lead to faster programs.

7. There are really two questions here:
(1) What is the "right" model for analyzing complexity of algorithms.
(2) Did the POPL paper "cheat" by using the model it did.

It should be clear that there is no single answer to the first question: the model to use depends on the problem as well as one's goal(s). The main thing is to be up-front about the model being used.

As for the second question, it all depends on what models previous work was using. If all previous work assumed constant-cost word operations, and still was not able to break the n^3 barrier, than the paper is an improvement. If previous work measured complexity in bit operations, and this paper beats the bound by changing models, that is a bit of a cheat. (The result could still be interesting, but I wouldn't call it "breaking a barrier".)

8. I guess on more careful reading that it was Lance's student, not Lance himself, who was appalled by the unit cost model. So sorry, Lance, for misattributing that quote.

Jonathan makes a good point: whether the unit cost model is the right way to analyze an algorithm is a very different question from whether shaving a log by changing models makes for an interesting result.

9. This is an excellent example of how trivial details in our models make trivial differences in the running time ;-) This is exactly why complexity theorists should use a "software-level" approach and reject this kind of "improvement" as uninteresting.

-Cris

10. Perhaps the growth of the hardware should be parameterized... so in this case the algorithm has a n^3/H running time, where H is the hardware growth rate in terms of n.This already happens often, and usually "w" (for "word size") is used.

11. I don't have a strong opinion here, but one way for a 'purist' complexity theorist to come to grips with the unit-cost RAM model might be to make a finer-grained analysis of the operations that're being assumed constant-time.

At a first glance, it seems as though free compares, adds and perhaps multiplies are just stealing a log factor in the running time. But these aren't arbitrary operations taking polylog(n) time, they're somewhat nice ones. In particular, they're all in TC0, and I recall that at least some arithmetic operations are complete for TC0 (see Allender's survey article 'The Division Breakthroughs').

So, is unit-cost RAM complexity equivalent to multitape complexity with a TC0 function oracle for log-sized inputs? I'd have to think more carefully about it. But at least, this perspective suggests fast RAM-model algorithms are potentially useful in that they lead to greater parallelizability.

12. So, if you have a choice between a purer theory (single-bit operations) and a theory that more closely matches the performance of actual programs (unit cost for logarithmic or larger word size), which do you choose?More closely matches the performance of actual programs? Turing machines are a terrible model (neither realistic nor beautiful), what on earth leads to the logarithmic model for word size? Maybe there's some actual argument for it than I'm missing, but it sounds like the worst kind of nonsense.

(1) Word size has increased over time, but it changes with chronological time rather than problem size. It's not like having to solve a big problem magically makes your computer faster.

(2) For small problems, maybe you do have actual choice. (Do you want your embedded processor to be 8-bit or 16-bit?) However, this choice is determined by a cost-benefit analysis, taking into account hardware cost, power consumption, etc., not by taking the logarithm of your problem size.

(3) The use of the logarithm is ridiculous. As far as I can tell, it's just based on a vague reason like "well, it grows, but obviously not very quickly, so let's call it a log." There's nothing wrong with imprecise arguments, but you can't deduce precise conclusions from them. If you actually care about log factors in your running time, then all the log factors in your setup should be well justified (rather than ad hoc choices made arbitrarily).

Similarly, I feel that, although the wordsize of any particular machine is obviously a constant, it works better to model it as logarithmically growing: this leads one to the study of bit-parallel programming techniques that really do lead to faster programs.
I'd be shocked if this had anything to do with the actual use of a logarithm. At best, the relevance of the logarithm is because it happens to match real-world data: the base-2 logarithms of big problem sizes nowadays are between 32 and 64, so they fit nicely with 32 and 64-bit words. However, one data point does not determine a scaling law.

Anup's suggested to use paramatrized complexity should address all these issues. It avoids hard-coding arbitrary assumptions into the setup, while preserving (and even increasing) the ability to model the real world.

Does anyone know the answer to Jonathan's question about whether two different models are being compared here? I don't know, but based on a quick skim of the paper, I would guess yes. For example, the abstract says "All that is necessary for the new bound is a simple adaptation of a known technique." It appears that the known technique is indeed the use of unit-cost word operations of logarithmic size to shave a log factor off set operations. If this is the only contribution the POPL paper is making, then it's in no sense a genuine improvement.

13. Re "the use of the logarithm is ridiculous": it's justified by the fact that a machine that can't address all of its memory is ridiculous. So if you're going to do non-streaming algorithms, and assume that your problem fits into memory, you need a model with at least logarithmic word size.

You might as well say that assuming that we have an amount of memory that depends on the problem size is ridiculous; and yet, people do space complexity analysis all the time.

As for why logarithm rather than something even more quickly growing: it's the minimal necessary assumption. You could (and some people do) assume that the word size is a parameter rather than something that depends on the problem size; that makes more sense mathematically but it also complicates the comparisons of algorithms because the time bounds have more parameters in them.

14. "I remember complaining as a student that sorting actually takes O(n log2 n) time since one needs O(log n) time to do a comparison but my complaints fell on deaf ears. "

But if you say a comparison takes O(log n) steps, shouldn't you also be honest about your input size, and say that the input size is O(n log n) instead of O(n)? Then the algorithm still runs in time O(m log m), where m is the input size.

1. m should be the number of bits for largest number(absolute).

15. If you say a comparison takes O(log n) steps, shouldn't you also be honest about your input size, and say that the input size is O(n log n) instead of O(n)? Then the algorithm still runs in time O(m log m), where m is the input size.This is exactly why the word-RAM is such a good model.

16. Re "the use of the logarithm is ridiculous": it's justified by the fact that a machine that can't address all of its memory is ridiculous. So if you're going to do non-streaming algorithms, and assume that your problem fits into memory, you need a model with at least logarithmic word size.Thanks, I was forgetting about that, so it's at least better justified than I was thinking. On the other hand, in practice lots of big data sets don't all fit into RAM at once. So the assumption that of course the computer will have plenty of RAM to store everything and will have a processor architecture that can address all that RAM with a single word is itself dubious.

I think what upsets me about this sort of thing is that I don't really believe the minor increase in realism of the model improves the real-world applications. Instead, I think it's a function of the sort of researchers who do the work. Practically-minded researchers will come up with practical algorithms in just about any model. If the model has unrealistic aspects, they won't take advantage of them, because they always have in mind real computers (and the model is just a convenient formalization for describing their results, rather than their ultimate goal). By contrast, pure theorists can exploit just about any model to produce unrealistic results, since it's practically impossible to eliminate all loopholes.

Normally, this doesn't cause any problems. Some results are impractical but beautiful illustrations of fundamental principles, while others have immediate applications, and both types of papers are worthwhile. The problem occurs when someone mixes the two, for example by taking advantage of technical details in a model to claim a real-world improvement. That's often a sign of problems.

17. I think what upsets me about this sort of thing is that I don't really believe the minor increase in realism of the model improves the real-world applications.
The fastest general puropse sort algorithm in use today is based on such a model, so you might have to revisit your assumptions about how realistic this model is.

18. Which algorithm are you talking about. If it is general purpose, then it should normally be comparison based.

19. If one is a mathematical purist, one should only do theory using finite state machines, because all of our actual machines really do have finite state.This is nonsense. Mathematical purity has nothing to do with what exists or does not exist as machines. Any mathematical theorem is a statement about a consequence derived from a certain hypothesis. Which hypothesis you prefer is a matter of choice and has nothing much to do with mathematics. Mathematicians by training tend to prefer hypotheses which are simpler and elegant. In the context of complexity theory, a natural choice would be to consider computation over different structures -- finite fields, real numbers, complex numbers etc. -- and prove theorems in each case -- this is the crux of the Blum-Shub-Smale philosophy. What actually models real world computation is another matter of course -- but an extra-mathematical one.

20. Which algorithm are you talking about. If it is general purpose, then it should normally be comparison based.
Non-numerical/non-alphanumerical keys are the most common sorting operation. Sacrificing performance for those cases is the theoretical assumption of little relevance, not the other way around.

21. My comments was not about theoretical considerations. If you take the c++ standard library, the general purpose algorithm is introsort which is a hybrid between quicksort(best average performance) and heapsort (avoid n^2 worst case performance of quicksort). The standard c library implementations generally use quicksort (GNU implementation uses mergesort to avoid worst case scenario of quicksort).

22. To remove any confusion, i am not Anon 16, my first comment was 18.

23. Logarithms often come and go when you switch machine models. This often leads to people wasting time removing logarithms when the problem would be completely different in a more realistic model.

The PRAM model assumes that memory can be accessed in constant time but no other operations can. In practice prefix scan operations can be done as efficiently as memory access and algorithms using them as primitives are often simpler and more efficient. For example see Guy Blelloch's 1989 paper "Scans as Primitive Parallel Operations" available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=42122
.

For example, there was a long series of papers in the 80s in good conferences on finding connected components of undirected graphs in parallel. They gradually eliminated a log factor from the runtime and switched from CRCW (concurrent) PRAM to EREW (exclusive) PRAM. If I recall correctly all of these papers turned out to be pretty useless since the same runtime can be achieved very simply using scans.

I would argue that the problem was not the PRAM model per se, but our obsession with tightness. Striking a logarithm from runtime, or improving the constant in a constant factor approximation algorithm, is valuable but should not be considered a major improvement sufficient for a slot in a top conferences such as STOC/FOCS. I especially dislike replacing a simple and elegant algorithm with a much more complicated algorithm that improves a log factor.

Let's concentrate our limited resources on determining the proper degree of the polynomial, not the exponentially less important task of removing logarithms.

24. More concretely, I would consider the interesting barrier (for the problem discussed in the post) to be Oh-tilde(n^3), not O(n^3), so the barrier was not broken.

25. Let's concentrate our limited resources on determining the proper degree of the polynomial, not the exponentially less important task of removing logarithms.
This depends on the problem. For instance, removing the log from the complexity of FFT will certainly be considered very interesting.

In a certain sense this whole thread is reflective of the tendency in TCS to overhype and oversell results. If the authors just claim that we made an improvement over existing results in this model for whatever it is worth, then there is no room for arguments (as long the proof is deemed correct).
The problem arises when one uses language such as "we broke this barrier" -- as if they broke the sound or light barrier or something. Language such as this should be used with utmost care (if at all) in serious scientific communication (especially of mathematical nature), but seems to be commonplace in TCS today. It is probably the effect of the "conference culture" -- a la Koblitz again.

26. If I recall correctly all of these papers turned out to be pretty useless since the same runtime can be achieved very simply using scans.Some of the ideas from the EREW undirected connectivity algorithms were quite elegant and inspired Trifonov's O(log n loglog n)space deterministic connectivity algorithm.

27. A time complexity estimate is only well-founded to the extent to which different computational models are equivalent to each other. Clearly, a wide variety of models, from a Turing machine with one semi-infinite tape at one end to other models that are much more efficient than that, are polynomial-time equivalent. This inspires the Church-Turing thesis, although ironically a quantum Turing machine is probably not equivalent and leads to the class BQP.

A smaller class of models are equivalent at the fast end, up to logarithmic time factors. The most common model at the fast end is a RAM machine, which I think is the same as a pointer machine. Another model which is so equivalent but much less widely cited is a tree-tape Turing machine. I think that it's a clever definition which is simpler than that of a RAM machine. And this is just for serial computation. For parallel computation things may get even more complicated, although I suspect that you can still obtain equivalence up to log factors.

Resolving the logarithmic factors is then model-specific. It is not really cheating to describe the model precisely, and then say that you can save a log factor in that model. It can be cheating to switch to a different model, to then save a logarithmic factor with an apples-vs-oranges comparison.

28. Ryan (#4) wrote: It's right along with the assumptions that doing basic integer operations, addition, subtraction, multiplication, and division, are also constant. Of course they aren't really constant! The 'mathematical modeling' is assuming they these operations are constant, and as such it allows you to get away with things like this.I suspect the right way to think about this is that there is some bit-level parallelism in the physical hardware, so yes, the set-difference operation can really be constant for inputs of size O(word length). And it's reasonable to model the word length as more than a constant, by 11011110's arguments. As for whether it's just a trick of them model: in principle you don't automatically get an improvement by changing the model ... it could well be that you need a smart algorithm in order to exploit this parallelism. Of course, that might not be the case in this paper, about which I know nothing.

Anon #16 wrote: On the other hand, in practice lots of big data sets don't all fit into RAM at once. So the assumption that of course the computer will have plenty of RAM to store everything and will have a processor architecture that can address all that RAM with a single word is itself dubious.I think 11011110's point still holds. Regardless of what sort of memory you're storing the input in (registers, cache, RAM, local disk, disks across the Internet...), you still need to be able to address it in a conveniently small way, since all the memory that anyone uses these days is random access. Maybe you store the address in two words instead of one, but that's still O(1) word operations.

29. Re Greg's "It can be cheating to switch to a different model, to then save a logarithmic factor with an apples-vs-oranges comparison." I don't see it as cheating to point out that another model is more appropriate and to find tighter time bounds in that model. If (as in the supposed example that started this post) the savings is just a matter of plugging in a trivial bit-parallel set intersection algorithm, then it's not a very interesting result, and the authors' hyperbole about breaking through barriers is unjustified, but it's still not cheating. And in other problems, saving the log can be much less a matter of plugging in known techniques to known algorithms from different models and much more a matter of novel algorithm design.

As for the people who still think Turing machines are a better model than RAMs: this is as baffling to me as the log-shavings must be to the complexity theorists. There are a million and one models of computation that are polynomially equivalent to Turing machines or RAMs or whatever. Why must you stick to a model that's so difficult to program, and that gives you the wrong exponent in your polynomials due to the model-specific work needed to move the head around? If mathematical cleanliness is what you're after, why not use a cellular automaton?

30. Arnold Schonhage of integer multiplication fame has an interesting argument that the log-cost RAM model is not the right model because there is a super-linear time lower bound for storing n bits in the model. See his July 1988 J.ACM paper: "A Nonlinear Lower Bound for Random-Access Machines under Logarithmic Cost".

31. I don't see it as cheating to point out that another model is more appropriate and to find tighter time bounds in that model.What I meant was, it's "cheating" if you obtain a logarithmic acceleration simply by switching to a model in which the acceleration essentially comes for free. Although what I mean by "cheating" is more just that such a result is lame.

It is true that it's fine to switch to a more appropriate model. But what does "more appropriate" even mean? I am happy to recognize any interesting-looking theorem as a work of mathematics, whether the model is appropriate or not. However, real computers are much more complicated than any idealized architecture which is suitable for a theorem.

A real computer architecture is vaguely equivalent, up to log factors, to a RAM machine. It's impossible to avoid overhead factors from considerations such as the address bus, the addressing wires in each memory chip, the distance from the memory chips to the CPU, etc. For a while these overhead factors are sort-of logarithmic. But obviously a very large computer would be more similar to a 3D cellular automaton, which is not polylog equivalent to a RAM machine.

You could even argue on the basis of heat exchange that large computers can't really have 3-dimensional connectivity; rather the effective connectivity has to be closer to 2-dimensional. Certainly that is a problem at the length scale of a CPU.

32. Dear Complexity Blog,

For the record, I did not make the above comment and I do not believe everything in it. Let me request that future Ryans put a last initial or something.

I would be remiss if I did not take the opportunity to also say something now about log factors. Please, shave them. (Within reason.)

Cheers,
ryan williams

33. As the author of the paper in question, let me clarify that the
"offensive" assumption can be avoided in this case, as in case of Four Russians and many related algorithms, using preprocessing/table lookup. The set difference operation can be split into operations on O(log n) sized sets, and in a preprocessing phase, the answers to these "small operations" can be computed for all possible arguments (all of this can be done in time less than the final runtime) and stored in a table in a sparse format (lists in this case). Now set difference over big sets just involves table lookup and list concatenation, and it all works out. Being standard, this implementation was omitted from the paper.

There are assumptions about table lookup here, but without these, the previous dynamic programming algorithms for the problem would run in O(n^3 log n) time. When I say the result is a "simple adaptation of a known technique," I do not mean it is just about changing models, but that a similar technique was used by a different researcher in a related problem. I'd have to be a much bigger moron than I actually am to write this paper otherwise.

Love and peace,
Swarat

34. In answer to the very first comment, O(nk) using radix sort, and this is optimal for the worst case.

35. Building a machine with register size that is log(number of particles in the universe) wouldn't be too hard. These register algorithms are very useful in practice  shaving an order of magnitude off of run times.

However, yes it always has bugged me that people drop log factors. I guess if you had them describe their algorithm on both an 8bit machine, and on a log(number of particles in the universe) machine there wouldn't be much confusion.

 http://blogofbrew.blogspot.com/2009/04/faster-k-mer-packing.html

36. However, yes it always has bugged me that people drop log factors.Does it bother you that they drop constant factors?

37. Yes, if it is a scientific code and not just a theoretical result.

Many times you run up against one algorithm which is O(n log n) and one that is O(n) with constants over one hundred times larger.

A colleague of mine has stressed for a long time that Log(particles in the universe) might as well be treated as a constant for real world applications.