I have been asked to expand upon the spam example from yesterday's
Foundations of Complexity lesson. To do this let me go into a
little background on computational learning theory.
The basic model for learning theory has examples given as strings and
labelled positive or negative. From these labelled examples, one comes
up with a hypothesis that hopefully will classify future examples
well.
In PAC (Probably Approximately Correct) learning you want an
algorithm that takes labelled examples generated by some distribution
and will output some hypothesis that will high confidence will
usually classify future examples drawn from the same distribution.
One simple approach is the Occam algorithm based on the Occam's
Razor principle "when you have two competing theories which make
exactly the same predictions, the one that is simpler is the better."
The Occam algorithm works by finding the smallest representation of a
hypothesis consistent with the labelled examples and using that
hypothesis to predict future examples. There is a theorem in learning
theory that says this algorithm works well with the number of samples
roughly the size of the smallest representation.
Let us focus on the problem of identifying spam using general circuits,
a collection of AND, OR and NOT gates that capture computation. We'll
talk more about circuits in a future lesson but just think of the
running time of an algorithm as roughly the size of the equivalent
circuit.
Given a collection of emails each labelled either as SPAM or NOT SPAM,
the Occam algorithm requires us to find the smallest circuit that
correctly labels all of these emails. In general finding such a
circuit could be hard but under the assumption that P=NP this is an
easy task. By the result mentioned above this circuit will likely
classify SPAM correctly most of the time in the future.
Some caveats about this method.
Why should there be a small circuit that characterizes spam? Well,
I can look at an email and determine whether it is spam and my brain
is just not that complex.
The theorem only holds if the distribution we learn on is the same
as the distribution we apply our circuit to. Spammers might change
their spam to fool our new circuit. The P=NP assumption will make it
easier for them as well.
The P=NP assumption is probably not true.
For more information and details on computational learning theory check out the
web site learningtheory.org
and the resources mentioned on that site.