Vitaly Feldman gave a talk at Georgia Tech earlier this week on his recent paper Preserving Statistical Validity in Adaptive Data Analysis with Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold and Aaron Roth. This work looks at the problem of reuse of the cross-validation data in statistical inference/machine learning using tools from differential privacy.
Many machine learning algorithms have a parameter that specifies the generality of the model, for example the number of clusters in a clustering algorithm. If the model is too simple it cannot capture the full complexity of what it is learning. If the model is too general it may overfit, fitting the vagrancies of this particular data too closely.
One way to tune the parameters is by cross-validation, running the algorithm on fresh data to see how well it performs. However if you always cross-validate with the same data you may end up overfitting the cross-validation data.
Feldman's paper shows how to reuse the cross-validation data safely. They show how to get an exponential (in the dimension of the data) number of adaptive uses of the same data without significant degradation. Unfortunately their algorithm takes exponential time but sometimes time is much cheaper than data. They also have an efficient algorithm that allows a quadratic amount of reuse.
The intuition and proof ideas come from differential privacy where one wants to make it hard to infer individual information from multiple database queries. A standard approach is to add some noise in the responses and the same idea is used by the authors in this paper.
All of the above is pretty simplified and you should read the paper for details. This is one of my favorite kinds of paper where ideas developed for one domain (differential privacy) have surprising applications in a seemingly different one (cross-validation).
Thanks for calling attention to this paper. You definitely got me interested in it. Great summary - concise, clear and enticing.
ReplyDelete