Thursday, December 18, 2014

The NIPS Experiment

The NIPS (machine learning) conference ran an interesting experiment this year. They had two separate and disjoint program committees with the submissions split between them. 10% (166) of the submissions were given to both committees. If either committee accepted one of those papers it was accepted to NIPS.

According to an analysis by Eric Price, of those 166, about 16 (about 10%) were accepted by both committees, 43 (26%) by exactly one of the committees and 107 (64%) rejected by both committees. Price notes that of the accepted papers, over half (57%) of them would not have been accepted with a different PC. On the flip side 83% of the rejected papers would still be rejected. More details of the experiment here.

No one who has ever served on a program committee should be surprised by these results. Nor is there anything really wrong or bad going on here. A PC will almost always accept the great papers and almost always reject the mediocre ones, but the middle ground are at a similar quality level and personal tastes come into play. There is no objective perfect ordering of the papers and that's why we task a program committee to make those tough choices. The only completely fair committees would either accept all the papers or reject all the papers.

These results can lead to a false sense of self worth. If your paper is accepted you might think you had a great submission, more likely you had a good submission and got lucky. If your paper was rejected, you might think you had a good submission and was unlucky, more likely you had a mediocre paper that would never get in.

In the few days since NIPS announced these results, I've already seen people try to use them not only to trash program committees but for many other subjective decision making. In the end we have to make choices on who to hire, who to promote and who to give grants. We need to make subjective decisions and those done by our peers aren't always consistent but they work much better than the alternatives. Even the machine learning conference doesn't use machine learning to choose which papers to accept.

12 comments:

  1. Lance

    I agree that PCs have to make choices, and let us take as a given that there are a number of "papers in the middle" that, because of limited slots, end up coming down to a matter of taste of the PC. I also accept that, wherever you draw the line, there will be papers at the boundary, and therefore some unhappiness and potential for variance in the outcome (which we sometimes label as arbitrariness).

    But there is still the question of where you draw the line, and we (as a community) have some control over that. Where you draw that line can significantly affect the variance in outcome.

    It would be interesting to look at the NIPS data, for example, and see if you could determine the "next 10 papers" each side of the PC would accept, would that non-trivially increase their overlap. If so, that would show that we could decrease variance by accepting more papers.

    Now of course there is a variance-quality tradeoff (this is your point that the only completely fair solutions -- e.g. the ones that necessarily get all decisions right -- accept all or none of the papers), but I think the NIPS data strongly suggests that we're not at the right point in this space. For many conferences, I believe the variance is too high, and the quality decrease from accepting more papers would be small. While I need to dive more into it, the NIPS data certainly suggests that to me (for that conference).

    The solution is to accept more papers. How many more papers depends on external issues like logistics as well as community goals on acceptable quality. But so far I feel the NIPS data backs what I've suggested for years -- we should be accepting more papers to our major conferences.

    ReplyDelete
  2. There are a few commenters bloggers out there whose message can be summarized to "our conferences are perfect, don't change a thing". This post by Lance seems to fall in this category.

    Yes, every reviewing system we implement will be imperfect, but do we have any evidence that the present one cannot be improved, even slightly?

    Would be not be better off as a community if we made the reviewing system less sensitive to a single reviewing error, which data suggests it presently is? Don't we know of better mechanisms other than unanimity in the presence of random signals? Have we not studied random algorithms as a field to know how to extract the most information out of repeated runs of a random event? Can we also not provide guidance as to ways in which to improve a review e.g. MIkkel Thorup's proposal "measure the best idea in the paper, not the worst" or SIGCOMM's guidelines "give constructive criticism"? Will we continue to ignore evidence from other fields within CS such as for example double blind reviews are an improvement over the present state, even though admittedly imperfect?

    We as a field pay a heavy price by attitudes like this. Look around: STOC/FOCS/CCC are shadows of their former selves. Meanwhile NIPS quadruples in size.

    ReplyDelete
  3. A few comments:

    - The overall picture seems absolutely right to me, but I hesitate to focus too much on the precise "half the conference would be different" conclusion from the experiment. NIPS, like STOC, FOCS, and SODA, is a very diverse conference in its content. This is unlike more focused conferences such as CCC, COLT, SOCG, etc. For those latter conferences I would expect a much higher level of agreement. I would guess that the NIPS community is even more diverse than that for STOC/FOCS/SODA so I expect that the consensus numbers for STOC/FOCS/SODA would be a little higher than for NIPS, certainly above the "half the papers" threshold but nowhere close to the levels of a CCC or the like.

    - The key factor that is not called out specifically is the % of papers accepted. This is one really important variable we can control. When we set it we can also get a measure of the quality balance of the program - how much is the attention to the papers diluted by the sheer quantity of work? If you look over the years there have only been subtle, but not drastic changes in acceptance rates - as tracked on the ACM DL, the overall percentage is 31%. In the late 1980's we had acceptance rates at STOC that are very similar to current rates in the 27-30% range. Even in the 1970's, the rates were in the mid 30%'s range. We are missing data from the early 1980's but I recall a couple of conferences having acceptance rates in the 30-35% range. We often run a version of the NIPS experiment in our resubmission process, running on a twice per year (or three times including SODA for more algorithmically oriented things) rather than an annual cycle. How many papers (essentially unchanged) get rejected for FOCS and accepted for the next STOC or vice versa. If that number is high, then we have a problem. When rates have dipped below 25% I have noticed a marked increase in the rate of resubmission acceptances for the following conference. In the late 1980's one of the arguments for going to parallel sessions was that this rate had grown significantly.

    - Note that using the % of submitted papers as a guide is useful only if there is similar self-selection behavior on submitters. In the early-mid 2000's there was a sudden increase in total submissions that I think did not reflect similar self-selection behavior. However, as part of our thinking about the future of FOCS/STOC, I think that it is good to re-evaluate whether this has stabilized, andwhere we think the acceptance rate should be targeted. My sense is that there is room to grow this a bit, and the structure of our meetings should expand to include the option of a modest but not drastic increase.

    - No matter what level we set the acceptance rate, there will be similar randomness/taste in the selection process itself for those papers in a wide range near the boundary and there would be similar complaints. I strongly disagree with Michael about the variance in evaluation for the next 10 papers being the right indicator. I completely agree with Lance that this randomness/taste has nothing whatsoever to do with whether a PC did a good job or not. The PC's role is to choose papers to make as good a conference as they can but given a fixed # of papers there are many essentially equally good conferences that they can choose.

    ReplyDelete
    Replies
    1. The PC's role is to choose papers to make as good a conference as they can but given a fixed # of papers there are many essentially equally good conferences that they can choose.

      I completely agree. However, as a community we seem to place a lot of emphasis on having papers selected for presentation in one of these "many essentially equally good conference programmes" for a handful of conferences. The number of good programmes to choose from for these "prime conferences" is often quite large because the number of "good" submissions is typically quite a bit higher than the number of available slots. (By way of example, last year the number of submissions to ICALP Track A went up suddenly from 249 to 319, but we could only increase the number of slots from 71 to 87, leading to an actual decrease in the acceptance rate for that track.)

      I agree with Michael and you in that we should try to accept more papers for our "prime" conferences. How many more is something that needs to be investigated. At ICALP, we went from 124 accepted papers in 2013 to 136 in 2014, but since the number of submissions grew so much, the overall acceptance rate went down by 1%.

      Delete
  4. It's the proportions that are problematic, not the fact that the process contains randomness. Taking the experiment at face value, the "parallel universe NIPS" with exactly the same submissions but a different set of random coin flips would have less than half its papers in common with the real one -- in fact, by Eric's numbers, the largest possible fraction of "sure accepts" in the conference in the messy middle model is 30%, and even that's just reached by making unrealistic assumptions. So by this model, the large majority of people at NIPS are there as lottery winners, and only a minority have obvious accepts.

    Is this really the proportions you would have guessed, if you had been asked to give an estimate?

    ReplyDelete
  5. Paul,

    I'm assuming you misunderstood my argument. The indicator I suggested I believe argues that if it was possible to increase the conference size, it would be a good idea to do so. This is what I argued -- that our current choice of the number of papers to accept was not best, even given concerns about quality. Strangely, it's also what you argued, so I'm not sure where you disagree with me.

    Once you decide the number of papers to accept is fixed, then I'd agree the determination of what goes in will end up being somewhat arbitrary. All I said was I don't see why we take as a given the number of papers is fixed.

    ReplyDelete
  6. How many papers (essentially unchanged) get rejected for FOCS and accepted for the next STOC or vice versa.

    I looked at the data a few years back from SODA on a certain given random year and fourteen papers rejected had appeared in the subsequent STOC or FOCS.

    Additionally I ran some simulations. The way the process is currently designed there would be a large amount of disagreement even if PC members unanimously agreed on rankings but their rankings were subject to random noise due to human factors such as misinterpretation, time pressures, etc.. For example for a conference such as STOC/FOCS I get that around 35% of the papers would be different simply due to random noise.

    However, contrary to what people say, this is not unavoidable. How does one lower noise in a random process? repeat the trial!

    Say, if all papers around the boundary get an automatic fourth review the disagreement drops by half to a much more reasonable 18%.

    ReplyDelete
  7. Michael: My disagreement was on whether or not one should use your "next 10 papers" measure as a meaningful basis for deciding how large a conference should be. I don't see how this measure would have anything do to with how we want to set the conference size. The kind of thing I was thinking of might be something on the order of 10%-15% more papers on average (to pull up acceptance rates by 2.5-5% on average) which could get us up to levels that have historically seemed healthy levels for the community. (This is separate from questions raised in other comments about the criteria we use to select these papers.)

    Anonymous 12/18/14/4:14 pm: Precisely this type of extra evaluation for papers near the boundary that you suggest does happen in the context of FOCS/STOC program committees, both with extra evaluations from PC members, additional external reviews that are solicited, and face-to-face meetings (though the latter are not independent). One issue with your assumptions is that the "randomness" is not a random process - it reflects properties of the paper-reviewer taste match (as well as the compatibility of the way that a paper is written) - and these are correlated beyond any interpretation of ratings as a noisy independent estimators.

    ReplyDelete
  8. Anon #4 : Just to point out, this is how systems conferences that I've been involved in usually run. There's a "second round" of reviewing where boundary papers get additional reviews. In many cases, more than one.

    ReplyDelete
  9. One can at best hope that PCs:
    (a) reject (almost) all the really boring papers
    (b) accept (almost) all the 10-20 really excellent papers
    (c) choose a selection of middle papers
    It seems that's the case.

    ReplyDelete
  10. @MM

    I'm aware some conferences use a fourth reviewer for controversial cases, however at the end of the day they still average out the scores. Let's say there is a paper whose true score(TM) is 7.3 (i.e.7 7 8), with a conference acceptance threshold of 6.5. However the reviewer which should have given 8 hiccuped and gave a score of 4 instead. We then bring in an external reviewer who gives a guarded 6. So total scores are 4 6 7 7 for a combined score of 6 and the paper still gets rejected. If the authors are lucky a fifth reviewer comes in and gives a 7 for a combined score of 6.2 and the paper still gets rejected,

    This is a case where the conference management software could/should lower the weight of the outlier score of 4 in the ranking, thus leading to a better decision and improving the reviewing process.

    There are other changes that can be made. This year's STOC review form is the first one I've seen in a long time that reminds the reviewers that there are papers other than theorem/proof which are worthwhile publishing. For example, they explicitly mention introducing a new model as a category to keep in mind. This is a minor change that improves the outcome of the reviewing process.

    ReplyDelete
  11. It might worth to have two types of tracks: main track where a small number of exceptionally good and interesting papers are presented and a bunch of side tracks where all papers that are good enough to be accepted are presented in shorter time slots (20 min).

    Since having papers in toFOCS/STOC/SODA is considered important in our careers high influence committee member buses is problematic. The personal preferences of committee members may cause person X's paper on the boundary get accepted while person Y's similar quality paper get rejected. Keep in mind that widespread small biases can lead to significant total bias in a system. This is also inline with previous concerns about dishonesty in writing papers (by trying to make results look harder to prove them they really are).

    What I like about NIPS is that they apply their knowledge to the conference, not something I see in theory (where are out game theorists applying their skills to design better and fairer conferences?).

    ReplyDelete