Wednesday, July 07, 2010

Drowning in Data

At a CCC Council meeting last week, a common theme emerged. The Army is "swimming with sensors and drowning in data". Biologists are considering trashing some of the DNA sequences they have already acquired because it will likely be cheaper to resequence in the future maintain the current data. We've been collecting tons of information, what should we do with it and if we don't have the tools yet to deal with all this data, should we bother keep it?

This reminds me of one of my favorite homework problems in complexity: Given two log-space computable functions f and g (the read-only input tape and write-only output tape don't count for space), show that the composition f(g(x)) is also log-space computable. The direct approach doesn't work for you don't have the space to store g(x).

The solution is to recompute the bits of g(x) as you need them. The lesson: When storage is expensive, it is cheaper to recompute what you've already computed. And that's the world we now live in: Storage is pretty cheap but data acquisition and computation are even cheaper.

So who says complexity has nothing to say about the real world?


  1. I'm interested in learning more about the specific examples you gave about the army's sensors and biologists' DNA data... any chance you know a reference for either of these?


  2. I've been experiencing the opposite: storage is cheap, but computation is expensive. When you are dealing with massive data, the size of the data set is very often determined by the amount of computing power available for a certain price.

    With such data, a linear-time algorithm takes O(1) seconds to finish, while a quadratic-time algorithm requires O(n) seconds. But as computing power increases exponentially over time, the quadratic algorithm gets exponentially slower.

    These days, a quadratic algorithm is no longer efficient in any practical sense in massive data applications. An n log n time algorithm still is, as its running time increases linearly by time, and because log n is a moderate-sized constant in practice.

  3. But beware: if you need to recompute in order to figure out what else you wanted to recompute, etc. you can can only do O(1) levels of recursion before you run out of space in your logspace world! :-)

  4. As far as I understand, you don't have the issue right. The problem is not the storage (which is indeed cheap), the problem is deciding what to do with all that data, both on the computer-level (how to analyze it all?) and the human level (what to do with all the analyses being done?). (This is in addition to the time necessary to analyze it all, as pointed out by Jouni.) So I don't see how logspace algorithms help here.

  5. Jonathan Katz' point is spot-on. A lexical search of the Army/Marine counterinsurgency manual (titled FM 3-24: Counterinsurgency) discovers:

    • "narrative": 35 usages
    • "victory": 26 usages
    • "compute": 6 usages

    Hmmm ... it appears that to be useful, military data *must* be embedded in a winning narrative. Conversely, modern military strategy is all about the construction of winning narratives.

    Obviously, constructing good narratives is NP-hard ... `cuz heck, can't mathematical proofs be regarded as a special class of narratives, having particularly rigid constraints, and in consequence an exceptionally compelling logic?

    The point is that narratives don't just happen, rather they are any culture's most prized and carefully elaborated constructs. Heck, haven't many Lance's and GASARCH's recent posts mainly been about narrative too?

    By the way, this post is not written tongue-in-cheek ... FM 3-24 is an exceedingly serious document that is well-worth careful study.

  6. -the military issues I've heard of are more about bandwidth coming out of conflict areas than storage/compute.

  7. There are scenarios where recomputing is preferable to leveraging a cache, but in most cases there are more moving parts (usually more software subsystems involved) and, thus, lower reliability. Unless I had to, I would not recompute.

  8. What is the size of the full checks game tree? Around 10^20 nodes? Is there a zettabyte of hard drive space on the planet?

  9. I'm running into a similar situation right now. Interesting thing is that if you say storage is cheap you have your original data, the data that gets loaded up and the data that get generated and you need to back all of that up. It's not necessarily the data storage itself that's expensive (it can be) it's shuffling it around to back it up that is.

    If conversely you say computation is cheap and you aim to keep your original data and reload. You cut a whole lot of complexity and redundant storage of the same byte.