Thursday, October 30, 2014

Metrics in Academics

Congratulations to the San Francisco Giants, winning the World Series last night. In honor of their victory let's talk metrics. Baseball has truly embraced metrics as evidenced in the book and movie Moneyball about focusing on statistics to choose which players to trade for. This year we saw a dramatic increase in the infield shift, the process of moving the infielders to different locations for each batter based on where they hit the ball, all based on statistics.

Metrics work in baseball because we do have lots of statistics, but also an objective goal of winning games and ultimately the World Series. You can use machine learning techniques to predict the effects of certain players and positions and the metrics can drive your decisions.

In the academic world we certainly have our own statistics, publications counts and citations, grant income, teaching evaluation scores, sizes of classes and majors, number of faculty and much more. We certainly draw useful information from these values and they feed into the decisions of hiring and promotion and evaluation of departments and disciplines. But I don't like making decisions solely based on metrics, because we don't have an objective outcome.

What does it mean to be a great computer scientist? It's not just a number, not necessarily the person with a large number of citations or a high h-index, or the one who brings in huge grants, or the one with high teaching scores, or whose students gets high paying jobs. It's a much more subjective measure, the person who has a great impact. in the many various ways one can have an impact. It's why faculty applications require recommendation letters. It's why we have faculty recruiting and P&T committees, instead of just punching in a formula. It's why we have outside review committees that review departments and degrees, and peer review of grant proposals.

As you might have guessed this post is motivated by attempts to rank departments based on metrics, such as described in the controversial guest post last week or by Mitzenmacher. There are so many rankings based on metrics, you just need to find one that makes you look good. But metric-based rankings have many problems, most importantly they can't capture the subjective measure of greatness and people will disagree on which metric to use. If a ranking takes hold, you may optimize to the metric instead of to the real goals, a bad allocation of resources.

I prefer the US News & World report approach to ranking CS Departments, which are based heavily on surveys filled out by department and graduate committee chairs. For the subareas, it would be better to have, for example, theory people rank the theory groups but I still prefer the subjective approach.

In the end, the value of a program is its reputation, for a strong reputation is what attracts faculty and students. Reputation-based rankings can best capture the relative strengths of academic departments in what really matters.


  1. While I agree with the idea in principle, the main problem with established reputation-based rankings is that they tend themselves to affect reputation, and are hence very static.

    The reason we think of GATech as a top-10 department as opposed to (say) Columbia is because US News & World report tells us so. It would be interesting to assess how far this influences the next ranking and whoever answers the surveys.

  2. You have a high reputation because you got a high rank, so the next time around you get a high rank because you have a good reputation. The typical example are the modern celebrities which are famous for being famous.

    In the end, the value of a program is its reputation,

    Woa! I would have thought the value of the program is how much it advances science, in its many forms: research, teaching, outreach, mentoring.

    But no, as it turns out what we are really all here for is reputation and that is our true value.

  3. But I don't like making decisions solely based on metrics, because we don't have an objective outcome.

    Neither does a google search, yet very good rankings have been developed to predict what type of page is likely to be useful and what isn't. Machine learning deals all the time with manually classified, subjective training data which is then classified by SVMs.

    I would encourage people to inform themselves about how much can be achieved through numerical means before issuing statements about the impossibility of achieving a good ranking on the basis of some simplistic observation.

  4. Sick getting Sicker, Rich getting Richer. Of course, you like it. This is called power law distribution.

  5. > I prefer the US News & World report approach to ranking CS Departments, which are
    > based heavily on surveys filled out by department and graduate committee chairs.

    When I was just starting out as a faculty member, I was part of a committee responsible for filling out the US News & World Report survey on behalf of our department.

    The way we did the rankings was simple and totally unscientific. We went down the list of schools, and the head of the committee called out "do people like this school?" or "anyone know anybody who works at this school?" If the answer(s) were yes, the school was given a check; if not, it was not.

    Oh, and we ranked our own department as a top school, because "why not?"

    Sure, keep on believing that US News & World Report rankings mean anything.

  6. Anon @ 2:51pm - Google search has an objective, quantitative goal - make the most money for Google. Google can then use various techniques to break that goal down to micro-goals for the graphic design of its pages, the content of the search results, etc.

    I thought the post itself was clean on the difference between cases when there is a objective quantitative goal (baseball games, Google search) and cases when the goal itself is qualitative and subjective.

  7. Regarding anon @ 3:22pm, I don't understand why people keep bringing up "rich getting richer" when it comes to reputation rankings.

    With "rich getting richer" the positive feedback loop is built on money, not reputation. In some tax/economic systems (poorly designed in my opinion), money tends to "converge" towards very few economic players at the top of the spectrum.

    It is not true that reputation tends to converge in the hands of very few already reputable players while everyone else ends up being disreputable. That just doesn't make sense. If anything, reputation is the hardest to earn and the easiest to lose.

    If the university has the money to spend, it can acquire reputation by hiring people. Once the extra money disappears, the reputation will disappear as well. The rise and (mild) fall of the mathematics department at a certain state university on the East Coast illustrates this.

    I almost wish someone would do a blog post refuting this.

  8. Honestly, I think the blog author jumped the gun with his enthusiasm for reputation based rankings. At this rate, the TCS community is going to endorse Southern-style debutante balls for assistant professors pretty soon. :)))

    Reputation based systems are terrible, but better than all known alternatives.

    The core of the problem is that if a system is not reputation based, it requires a "central authority" who will decide which department is better (maybe using Excel, data sets, etc). For example, in the most recent episode, Dr. Hajiaghayi acted as the central ranking authority.

  9. In my department a secretary filled up the form. At another one the Dean was in charge. He was not a computer scientist.

  10. If anything, reputation is the hardest to earn and the easiest to lose.

    Hardest to earn yes, easiest to lose no. There is data backing this up, showing that reputation is a trailing indicator taking over 10-20 years to be updated. To use an actual example, the Notre Dame football team went through a twelve year slump where if finished outside the top 15 every season only to be ranked in the top 5 the next year at the beginning of the season.

    In fact the NCAA coaches poll has been a subject of study because it is easy to measure how wrong reputation can be. In almost every season there are teams in the top 25 whose computer ranked position is six or more spots away. Certain teams, like Notre Dame are consistently ranked higher while other teams like Boise State are consistently under-rated.

    Google search has an objective, quantitative goal - make the most money for Google.

    You are really stretching facts here to make a point. And by the way, at no time "making money" comes into the training of the ranking function. The training uses subjective data of what is good and trains the ranker on that. Same training is done by researchers in Information Retrieval and routinely published in SIGIR. Metrics are subjective and measured using user studies, where users subjectively tell you: I like page A but not page B.

    Really, if you have no clue how google in particular and search rankings in general work please refrain from commenting. It is getting embarrassing to see otherwise good researchers declare authoritatively what can and cannot be done with metrics when they have not spent a minute in their life thinking about ranking functions.

    I've published on the subject (sports ranking, academic ranking, information retrieval ranking) and know very well how far sabermetrics can go and their eventual limitations. As in search ranking one can train the classifier against the gold standard of having a committee of notables actually spend the time looking at each department and each researcher and coming up with an overall score. In my experience it would be trivial to beat the US News ranking with such a system. .

  11. Anon @ 7:39am - concerning the football program rankings, a time lag effect is different from a "rich getting richer" effect. So, the rest of your comments don't apply.

    Concerning Google Search, you are (again) confusing the goal and the steps you take to achieve the goal. The purpose of Google (the company) is objective and quantitative - to make the most money for shareholders. The purpose of Google Search (the product) is to help Google make the most money. Employees can then have various debates and use various techniques (statistics being one of them) to achieve this goal. For instance, they may decide that keeping search results and advertising results separate will help make the most money in the long run, because people will trust the search engine more. They may improve the search rankings using user studies, etc.

    With CS departments, there isn't a objective quantitative measure that tells you which department is the best. This has been repeated many times already, so I'm not sure what else one can do to get the point across.

    About training a classifier to emulate a committee of notables - why do that, when you can just ask the committee of notables? It's not like we have thousands of top CS departments.

    I don't know why you added phrases like "if you have no clue ... please refrain from commenting" and "I've published on the subject" - these are just generic appeals to authority and don't add to the discussion.

  12. Reputation based rankings are just the best of many imperfect/terrible options.

    I wonder if the author is going to retract "In the end, the value of a program is its reputation" and "Reputation-based rankings can best capture the relative strengths of academic departments in what really matters."

    I wouldn't want to have to stand behind those comments as a scientist. :)

  13. Anonymous @ 11:38 - You claimed that reputation was easiest to lose, I showed otherwise. I don't see how the "rich getting richer" has anything to do with the point I made.

    "these are just generic appeals to authority and don't add to the discussion."

    No, they are appeals to authority on the face of enormous ignorance about how ranking is done. However they seem to have made no impact: you seem to think that simply repeating the fact that google makes money somehow makes the training of the ranking function be different than what actually is.

    There's no content in your argument. I'm signing off.

  14. There might not be a consensus about what is *the* objective function but that does mean there is not objective functions.

    For undergrad applicants which US News ranking is aimed at are a few important ones and we can make good predictions for them. For example, for many it is the expected income after graduation. For some it is likelihood of becoming a millioner. For those who have love of science is finding a good tenured position (in which case I would tell them that going MIT doubles their chance compared to other top schools, and if you can attend top 5 you should do so because it more than quadraples your chance). Longer term objectives like impact on national economy or human life quality are more difficult to mesure. And of course there are objective functions that we don't know how to mesure.

  15. Can the community adopt a metric-based ranking without a central authority? It seemed the last attempt suffered from a COI.

    (I find the discussion so far thoughtful. I hope Anonymous @ 1:04 signed off that particular sub-thread only.)

  16. Lio Pachter looked into the specific weights given to different parameters in the US News ranking, As one can expect, a different weighing scheme, may result in different scores. Interestingly they found that 99% of the possible weightings resulted in one of four possible rankings at the top three, with more and more disagreement the further down the scale one goes.

    One sensible solution is to give departments a confidence interval stating that "in 85% of the weightings this department ranked between 5 and 14".

  17. I do not understand why people are so upset with "numeric-based" rankings. I agree with Luca, etc. that such rankings are sometimes (maybe even often?) erroneous, and occasionally makes the reader say "No way X is better than Y". So what? In majority of cases, as long as the "objective and formally defined" score function is not "crazy", good schools still come on top and bad schools at the bottom. And when things go wrong, it's actually interesting to find an explanation why.

    Reading all the comments, it appears the major objection is not scientific, but emotional. People don't like the word "Best". So do not use the word "best". For each ranking, say "Best school according to number of pubs at these venues", or "Best pubs according to number of tenure-track offers by its graduates", etc.
    And then the decision makers simply take a particular ranking into account, together with other rankings, reference letters, "reputation", etc.

    Perhaps, being a cryptographer, I always live by the rule "Extra information cannot hurt, as long as you understand what this information is". In particular, whatever cool decision you can make without extra information (e.g., particular ranking), you can simulate with having the information, by simply ignoring it :).

    So let's ease up and stop attacking ranking systems. Instead, let's constructively criticize them and try to improve them. Also let's produce a lot of diverse ranking criteria, which will help eliminate errors and make it much harder for weak departments to score consistently close to the top.

    Finally, I like rankings because they help keep people/departments "honest". If you get great reputation, it is very hard to loose it, even if most of the faculty suddenly become dead wood. Good students still go to top school and help "cover up" lazy faculty. Having an objective (alas, imperfect) measure might be one weak but somewhat effective way for "historically weaker but objectively stronger" schools to start pointing out that "Hey, X is no longer such a hot shot. In contrast, we improved dramatically according at least to this objective metric". If such a claim is premature, people will make fun of the person/department claiming it, so there will be some deterrent from blindly using rankings to promote yourself too soon. But, if done with modesty and proper disclaimers, I do not see why such self-promotion is bad.

  18. occasionally makes the reader say "No way X is better than Y"

    I've been following rankings like that for a long time. About 1 out of 10 times upon close examination of departments X and Y the outcome is the reverse: "gee, I hadn't realized that X has moved up so much/Y has fallen so far".

    This is why I care to follow the rankings (reputation based or otherwise) despite their numerous flaws. If they were all noise, what's the point?