Friday, May 28, 2010

The Structure (or lack thereof) of Data

Many of the various "proofs" of P≠NP follow a common theme: Define an NP problem with a certain structure. Argue that any algorithm that solves that problem must work in a certain way and any algorithm that works that way must examine an exponential number of possibilities.

When you look at most common algorithmic problems, we state them as a structured object, say as a graph or a formula. But really every input is just a series of bits. Usually when we write an algorithm to solve a problem, the algorithm relies on the structure of the problem. But it doesn't have to, it could manipulate the bits in any way, shape or form. That's what makes separating complexity classes so tough: We can't assume anything about how an algorithm works. Algorithms can ignore the underlying semantic meaning of the input and focus on the syntactic part, the bits themselves. 

Over the past 10-15 years the machine learning community has had some significant success with statistical modeling where little focus on the actual structure of the data. Google recently released its Prediction API that does supervised learning (giving labelled examples) from unstructured data. Google gives the example of learning which language a text comes without telling the API that the data represents language samples. 

There is a constant back and forth in the machine learning community on how much to consider structure when developing learning algorithms. Nevertheless there are many examples of strong learning algorithms that make little or no assumptions on the structure of the data. In the theoretical CS community we haven't seen as many algorithms that work that way (its hard to do worst-cast analysis on such algorithms) but the possibility of such algorithms makes proving general lower bounds all that more challenging.


  1. A partial step would be to considered reasonable restricted models. This would alone will be interesting, e.g. if we have a good model capturing the intuitive notion of a dynamic algorithm and prove that it can not solve SAT in polytime, I would consider it an interesting partial step. These kind of results would show that some ingenuity is required in designing an algorithm (if there is one) for SAT, and would make it much easier to reject the claims from people with an algorithm showing P=NP.

    Would not these ML algorithms to be related to BPP rather than P?

  2. Anonymous: there are all sorts of such restricted models. For example, the model of communication complexity provides such a model. Assuming only oracle access to the input, where the oracle captures the intuitively interesting questions about the input, one can often show unconditional lowerbounds. For example, the problem of maximizing a submodular set function subject to a matroid constraint, using only value oracle access, can be shown to be unconditionally hard. There are many other examples, I believe.