Friday, December 13, 2002

Learning via Occam's Razor

I have been asked to expand upon the spam example from yesterday's Foundations of Complexity lesson. To do this let me go into a little background on computational learning theory.

The basic model for learning theory has examples given as strings and labelled positive or negative. From these labelled examples, one comes up with a hypothesis that hopefully will classify future examples well.

In PAC (Probably Approximately Correct) learning you want an algorithm that takes labelled examples generated by some distribution and will output some hypothesis that will high confidence will usually classify future examples drawn from the same distribution.

One simple approach is the Occam algorithm based on the Occam's Razor principle "when you have two competing theories which make exactly the same predictions, the one that is simpler is the better."

The Occam algorithm works by finding the smallest representation of a hypothesis consistent with the labelled examples and using that hypothesis to predict future examples. There is a theorem in learning theory that says this algorithm works well with the number of samples roughly the size of the smallest representation.

Let us focus on the problem of identifying spam using general circuits, a collection of AND, OR and NOT gates that capture computation. We'll talk more about circuits in a future lesson but just think of the running time of an algorithm as roughly the size of the equivalent circuit.

Given a collection of emails each labelled either as SPAM or NOT SPAM, the Occam algorithm requires us to find the smallest circuit that correctly labels all of these emails. In general finding such a circuit could be hard but under the assumption that P=NP this is an easy task. By the result mentioned above this circuit will likely classify SPAM correctly most of the time in the future.

Some caveats about this method.

  • Why should there be a small circuit that characterizes spam? Well, I can look at an email and determine whether it is spam and my brain is just not that complex.
  • The theorem only holds if the distribution we learn on is the same as the distribution we apply our circuit to. Spammers might change their spam to fool our new circuit. The P=NP assumption will make it easier for them as well.
  • The P=NP assumption is probably not true.
For more information and details on computational learning theory check out the web site learningtheory.org and the resources mentioned on that site.

No comments:

Post a Comment