Vitaly Feldman gave a talk at Georgia Tech earlier this week on his recent paper Preserving Statistical Validity in Adaptive Data Analysis with Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold and Aaron Roth. This work looks at the problem of reuse of the cross-validation data in statistical inference/machine learning using tools from differential privacy.
Many machine learning algorithms have a parameter that specifies the generality of the model, for example the number of clusters in a clustering algorithm. If the model is too simple it cannot capture the full complexity of what it is learning. If the model is too general it may overfit, fitting the vagrancies of this particular data too closely.
One way to tune the parameters is by cross-validation, running the algorithm on fresh data to see how well it performs. However if you always cross-validate with the same data you may end up overfitting the cross-validation data.
Feldman's paper shows how to reuse the cross-validation data safely. They show how to get an exponential (in the dimension of the data) number of adaptive uses of the same data without significant degradation. Unfortunately their algorithm takes exponential time but sometimes time is much cheaper than data. They also have an efficient algorithm that allows a quadratic amount of reuse.
The intuition and proof ideas come from differential privacy where one wants to make it hard to infer individual information from multiple database queries. A standard approach is to add some noise in the responses and the same idea is used by the authors in this paper.
All of the above is pretty simplified and you should read the paper for details. This is one of my favorite kinds of paper where ideas developed for one domain (differential privacy) have surprising applications in a seemingly different one (cross-validation).
Computational Complexity and other fun stuff in math and computer science from Lance Fortnow and Bill Gasarch
Thursday, January 29, 2015
Monday, January 26, 2015
A nice problem from a Romanian Math Problem Book
(All of the math for this problem is here)
My Discrete Math Honors TA Ioana showed me a Romanian Math Problem book
(She is Romanian) and told the following problem:
(All ai in this post are assumed to be natural numbers)
Show that for all n ≥ 6 there exists (a1,...,an) such that 1/a12 + ... + 1/an2 = 1.
(sum of reciprocals squared)
Normally my curiosity exceeds my ego and I would look up the answer.
But it was in Romanian! Normally I would ask her to read the answer to me.
But I was going out of town! Normally I would look it up the answer on the
web. But this is not the kind of thing the web is good at!
So I did the obvious thing- worked on it while watching Homeland Season 2
the first four episodes. And I solved it! Either try to solve it yourself
OR goto the link.
Some possibly open questions come out of this
1) I also prove that, for all k there is an n0=n0(k) such that
all n ≥ n0 there exists (a1,...,an) such that1/a1k+ ... + 1/ank = 1.
(sum of reciprocal kth powers)
We showed above that n0(2) ≤ 6, its easy to show no(2) ≥ 6, so n0(2)=6.
Obtain upper and lower bounds on n0(k).
2) What is the complexity of the following problem:
Given k,n find out if there exists (a1,...,an) such that1/a1/k + ... + 1/ank = 1.
If so then find the values (a1,...,an).
(We know of an example where the Greedy method does not work.)
3) What is the complexity of the following problem: Just as above
but now we want to know HOW MANY solutions.
4) Meta question: How hard are these questions? The original one was
on the level of a high school or college math competition. The rest
might be easy or hard. I suspect that getting an exact formula for
n0(k) is hard. I also suspect that proving that this is hard
will be hard.
Thursday, January 22, 2015
There Should be an Algorithm
My high school daughter Molly was reading her Kindle and said "You know how you can choose a word and the Kindle will give you a definition. There should be an algorithm that chooses the right definition to display depending on the context". She was reading a book that took place in the 60's that referred to a meter man. This was not, as the first definition of "meter" would indicate, a 39 inch tall male. A meter man is the person who records the electric or gas meter at your house. Today we would use a gender neutral term like "meter reader" if technology hadn't made them obsolete.
Molly hit upon a very difficult natural language processing challenge known as word-sense disambiguation with the most successful approaches using supervised machine learning techniques. If anyone from Amazon is reading this post, the Kindle dictionary would make an excellent application of word-sense disambiguation. You don't need perfection, anything better than choosing the first definition would be welcome. Small tweaks to the user interface where the reader can indicate the appropriate definition would give more labelled data to produce better algorithms.
And to Molly: Keep saying "There should be an algorithm". Someday there might be. Someday you might be the one to discover it.
Molly hit upon a very difficult natural language processing challenge known as word-sense disambiguation with the most successful approaches using supervised machine learning techniques. If anyone from Amazon is reading this post, the Kindle dictionary would make an excellent application of word-sense disambiguation. You don't need perfection, anything better than choosing the first definition would be welcome. Small tweaks to the user interface where the reader can indicate the appropriate definition would give more labelled data to produce better algorithms.
And to Molly: Keep saying "There should be an algorithm". Someday there might be. Someday you might be the one to discover it.
Monday, January 19, 2015
The two defintions of Chomsky Normal Form
I have eight textbooks on Formal Lang Theory on my desk. Six of them define a CFG to be in Chomsky Normal Form if every production is of the form either A-->BC or A--> σ (σ a single letter). With that definition one can show that every e-free grammar can be put in Chomsky Normal Form, and using that, show that CFL ⊆ P. There is a very minor issue of what to do if the CFL has e in it.
Two of the books (Sipers and Floyd-Beigel) define a CFG to be in Chomsky Normal Form if every rule is A-->BC or A--> σ OR S-->e and also that S cannot appear as one of the two nonterminals at the right hand side of a production of the form A-->BC. With this definition you can get that every CFL (even those with e in them) has a grammar in Chomsky Normal Form.
The definitions are NOT equivalent mathematically, but they are equivalent in spirit and both aim towards the same thing: getting all CFL's in P (That's why I use them for. What did Chomsky used them for?)
The first definition is correct historically- its what Chomsky used (I am assuming this since it's in the earlier books). The second one could be argued to be better since when you are done you don't have to deal with the e. I still like the first one, but its six of one, half a dozen of the other. One person's floor function is another person's ceiling function.
I don't have a strong opinion about any of this, but I will say that if you use the second definition then you should at least note that there is another definition that is used for the same purpose. Perhaps make a homework out of this.
There are many other cases where definitions change and the new one leads to more elegant theorems and a better viewpoint than the old one. There are even cases where the new definition IS equivalent to the old one but is better. IMHO the (∃) definition of NP is better than the definition that uses nondeterministic Poly Time TM's since this leads to the definition of the poly hierarchy. One drawback- if you want to define NL then I think you DO need nondeterministic TM's.
The only problem I see with changing definitions is if you are reading an old paper and don't quite know which definition they are using.
What examples of a definition changing (or an equivalent one being more used) do you approve of? Disapprove of?
Two of the books (Sipers and Floyd-Beigel) define a CFG to be in Chomsky Normal Form if every rule is A-->BC or A--> σ OR S-->e and also that S cannot appear as one of the two nonterminals at the right hand side of a production of the form A-->BC. With this definition you can get that every CFL (even those with e in them) has a grammar in Chomsky Normal Form.
The definitions are NOT equivalent mathematically, but they are equivalent in spirit and both aim towards the same thing: getting all CFL's in P (That's why I use them for. What did Chomsky used them for?)
The first definition is correct historically- its what Chomsky used (I am assuming this since it's in the earlier books). The second one could be argued to be better since when you are done you don't have to deal with the e. I still like the first one, but its six of one, half a dozen of the other. One person's floor function is another person's ceiling function.
I don't have a strong opinion about any of this, but I will say that if you use the second definition then you should at least note that there is another definition that is used for the same purpose. Perhaps make a homework out of this.
There are many other cases where definitions change and the new one leads to more elegant theorems and a better viewpoint than the old one. There are even cases where the new definition IS equivalent to the old one but is better. IMHO the (∃) definition of NP is better than the definition that uses nondeterministic Poly Time TM's since this leads to the definition of the poly hierarchy. One drawback- if you want to define NL then I think you DO need nondeterministic TM's.
The only problem I see with changing definitions is if you are reading an old paper and don't quite know which definition they are using.
What examples of a definition changing (or an equivalent one being more used) do you approve of? Disapprove of?
Thursday, January 15, 2015
The Impact Factor Disease
The Institute of Science Information (ISI) was founded in 1960 to help index the ever growing collection of scientific journals. The founder of ISI, Eugene Garfield, developed a simple impact factor to give a rough estimate of quality and help highlight the more important journals. Roughly the impact factor of a journal in year x is the average number of citations each article from years x-1 and x-2 receives in year x.
Thomson Scientific bought out ISI in 1992 and turned the data collection into a business. Impact factors are not only being used to measure the quality of journals but of authors (i.e. researchers) and institutions as well. In many parts of the world a person's research quality is being measured strongly or sometimes solely by their ability to publish in high impact factor journals.
This is bad news for computer science since conference proceedings in our field have historically more prestige than journals. We mitigate the ISI factors pretty well in the US but in many other countries this puts computer science at a disadvantage. The need for impact factor publications is one of the reasons conferences are experimenting with a hybrid model.
In a new trend I get many announcements of journals highlighting their ISI impact factor, mostly very general and previously unknown to me. Our old friends WSEAS say "The ISI Journals (with Impact Factor from Thomson Reuters) that publish the accepted papers from our Conferences are now 32" in the subject line of their email.
It's the vanity journal with a twist: publish with us and you'll raise your official numerical prestige. So we get a set of journals whose purpose is to raise the value of researchers who should have their value lowered by publishing in these journals.
Raise your research standing the old fashioned way: Do great research that gets published in the best venues. The numbers will take care of themselves.
Thomson Scientific bought out ISI in 1992 and turned the data collection into a business. Impact factors are not only being used to measure the quality of journals but of authors (i.e. researchers) and institutions as well. In many parts of the world a person's research quality is being measured strongly or sometimes solely by their ability to publish in high impact factor journals.
This is bad news for computer science since conference proceedings in our field have historically more prestige than journals. We mitigate the ISI factors pretty well in the US but in many other countries this puts computer science at a disadvantage. The need for impact factor publications is one of the reasons conferences are experimenting with a hybrid model.
In a new trend I get many announcements of journals highlighting their ISI impact factor, mostly very general and previously unknown to me. Our old friends WSEAS say "The ISI Journals (with Impact Factor from Thomson Reuters) that publish the accepted papers from our Conferences are now 32" in the subject line of their email.
It's the vanity journal with a twist: publish with us and you'll raise your official numerical prestige. So we get a set of journals whose purpose is to raise the value of researchers who should have their value lowered by publishing in these journals.
Raise your research standing the old fashioned way: Do great research that gets published in the best venues. The numbers will take care of themselves.
Tuesday, January 13, 2015
Barrier's for Matrx Mult Lower bound
Matrix Mult:
The usual algorithm is O(n^3).
Strassen surprised people by showing an O(n^{2.81}) algorithm. (Now its so well known that its not surprising. I try to teach it to people who don't already know its true so they'll be surprised.)
Over the years the exponent came down, though there was a long time between Coopersmith and Winograd's exp of 2.3755 and the recent improvements.
Are these improvements going to lead to the exp 2+epsilon?
Alas the answer is prob no :-(
In Fast Matrix Mult: Limitations of the Laser Method Ambainis, Filmus, and Le Gall show that the methods that lead to the algorithms above will NOT lead to an exp of 2+epsilon.
How to react to news like this? Barriers should not make us give up; however, they should make us look for new techniques, perhaps guided by the barrier. I would like to think that if you know where NOT to look then it helps you know where TO look.
There have been barriers that were broken (IP=PSPACE didn't relativize, the recent 2-prover PIR used techniques NOT covered by the barrier result, and the Erdos distance problem was proven after it was known that the current techniques `wouldn't work'). I am sure there are other examples, feel free to leave some in the comments
Will this result help guide researchers in the right direction? Lets hope so!
Of course, its possible that 2+epsilon is false and the barrier result is a first step in that direction.
Which one am I routing for? Neither- I am routing for finding out!
bill g.
Thursday, January 08, 2015
The History of the History of the History of Computer Science
In 2007, the science historian Martin Campbell-Kelly wrote an article The History of the History of Software, where he writes about how he initially wrote histories of the technical aspects of computer software back in the 70's but now he's evolved into writing more about the applications and implications of software technologies. He argues that the whole field of the history of software has moved in the same directions.
Let me mention two recent examples in that History of Computing category. The Imitation Game give a great, though slightly fictionalized, portrait of the computing and computer science pioneer Alan Turing focusing on his time at Bletchley Park breaking the Enigma code. Walter Isaacson, whose histories of Franklin, Einstein and Jobs I thoroughly enjoyed, writes The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution which tells the stories of computers from Ada Lovelace to Google (oddly stopping before social networks).
But what can we do about the History of Computer Science, particularly for theoretical computer science? We live in a relatively young field where most of the great early researchers still roam among us. We should take this opportunity to learn and record how our field developed. I've dabbled a bit myself, talking to several of the pioneers, writing (with Steve Homer) a short history of computational complexity in the 20th Century and a history chapter in The Golden Ticket.
But I'm not a historian. How do we collect the stories and memories of the founders of the field and tell their tales while we still have a chance?
Donald Knuth made an emotional argument against this trend last May in his Stanford Kailath lecture Let's Not Dumb Down the History of Computer Science. If you can find an hour, this is a video well worth watching.
In the January CACM Thomas Haigh gave his thoughts in The Tears of Donald Knuth. Haigh argues that Knuth conflates the History of Computer Science with the History of Computing. Haigh says that historians focus on the latter and the History of Computer Science doesn't get enough emphasis.Let me mention two recent examples in that History of Computing category. The Imitation Game give a great, though slightly fictionalized, portrait of the computing and computer science pioneer Alan Turing focusing on his time at Bletchley Park breaking the Enigma code. Walter Isaacson, whose histories of Franklin, Einstein and Jobs I thoroughly enjoyed, writes The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution which tells the stories of computers from Ada Lovelace to Google (oddly stopping before social networks).
But what can we do about the History of Computer Science, particularly for theoretical computer science? We live in a relatively young field where most of the great early researchers still roam among us. We should take this opportunity to learn and record how our field developed. I've dabbled a bit myself, talking to several of the pioneers, writing (with Steve Homer) a short history of computational complexity in the 20th Century and a history chapter in The Golden Ticket.
But I'm not a historian. How do we collect the stories and memories of the founders of the field and tell their tales while we still have a chance?
Monday, January 05, 2015
Why do students do this?
Before my midterm in ugrad Theory of Computation I gave the students a sheet of practice problems to do that I would go over before the midterm.
One of them was: Let L be in DTIME(T(n)). Give an algorithm for L*. Try to make it efficient. What is the time complexity of your algorithm? (I had done that if L is in P then L^* is in P in class earlier in the term.)
My intention was that they do the Dynamic Programming solution. Since it wasn't being collected I didn't have to worry about what would happen if they did it by brute force. When I went over it in class I did the Dynamic Programming Solution, which is roughly T(n)^3 time.
I allow my students to bring in a sheet of notes that they make up themselves.
On the exam was the problem: Let L_1 \in DTIME(T_1(n)) and L_2\in DTIME(T_2(n)).
Give an algorithm for L_1L_2. What is the time complexity of your algorithm?
Of my 20 students 5 of them gave me, word for word, the dynamic programming solution to the L, L* problem.
Why would they do this? Speculations:
But I ask-- (1) is this kind of thing common? For my Sophomore Dscrete Math yes, but I was very slightly surprised to see it in my senior course. (2) Do you have examples? I am sure that you do, but my point is NOT to do student-bashing, its to ask WHY they do this.
One of them was: Let L be in DTIME(T(n)). Give an algorithm for L*. Try to make it efficient. What is the time complexity of your algorithm? (I had done that if L is in P then L^* is in P in class earlier in the term.)
My intention was that they do the Dynamic Programming solution. Since it wasn't being collected I didn't have to worry about what would happen if they did it by brute force. When I went over it in class I did the Dynamic Programming Solution, which is roughly T(n)^3 time.
I allow my students to bring in a sheet of notes that they make up themselves.
On the exam was the problem: Let L_1 \in DTIME(T_1(n)) and L_2\in DTIME(T_2(n)).
Give an algorithm for L_1L_2. What is the time complexity of your algorithm?
Of my 20 students 5 of them gave me, word for word, the dynamic programming solution to the L, L* problem.
Why would they do this? Speculations:
- They just copied it off of their cheat sheet with no understanding.
- They wanted pity points (they didn't get any and I told the class that if a similar thing happens on the final I will give them LESS THAN zero on the problem).
- They so hoped that the L, L* problem would be on the exam (possibly becuase it was on their cheat sheet) that they misread the problem.
- They thought `Dr. Gasarch wouldn't have put it on the practice exam unless it was on the real exam' (or something like that), so they misread it.
But I ask-- (1) is this kind of thing common? For my Sophomore Dscrete Math yes, but I was very slightly surprised to see it in my senior course. (2) Do you have examples? I am sure that you do, but my point is NOT to do student-bashing, its to ask WHY they do this.