Thursday, June 04, 2009

A Failure of Social Networks?

One of Northwestern's professors made quite a splash a month ago with his group's predictions on the spread of the Swine Flu (now called H1N1). Dirk Brockmann claimed a worst-case scenario of 1700 flu cases by the end of May. Another group in Indiana made similar predictions. The research got picked up on a front page New York Times article. Brockmann's group used data from Where's George, a site that tracks movement of dollar bills, to understand interactions that would lead to the spreading of disease.

So how did these groups do? Not even in the right ballpark, off by two orders of magnitude. As a follow-up article in Tuesday's Science Times the CDC estimated there were well over 100,000 cases by the end of last month. What went wrong? In short Brockmann claims faulty initial data. Nevertheless I always worry that bad predictions from scientists make it harder to have the public trust us when we really need them to.

Might this be an instance where prediction markets greatly out-performed the experts? In short, no. There were relevant markets but two big problems: 
  • No one thought to create a market for the number of flu cases over a couple of thousand. 
  • Prediction markets require a verifiable outcome so they were based on CDC confirmed cases. But after the flu turned out not to be that dangerous, the CDC stopped confirming most cases and there were less than 7500 confirmed cases by the end of May.

5 comments:

  1. I agree with you that whereisgeorge data is faulty for the purpose. But your blogpost is in hindsight, unless I miss it when you pointed this out earlier.

    How's dollar movement a more relevant data than other more relevant data, such as passenger data of airlines? It may also be possible for the academics to ask for the anonymous credit card swipe data to figure out the circulation of people.

    Since dollar is likely to move within US, and within US most cell-phone plans are nationwide, even a cell-phone company may be able to model the people circulations.

    So I am not able to understand why the dollar circulation data is relevant. Since a very small fraction of people are likely to enter the serial number of their dollar bills on a website, the website's data is likely to be highly noise sensitive. For an example, I won't expect more than 1% to enter this data. Even in this 1%, may be they may enter the data once in a while, say 10% of the time. So if a dollar bill changes hands 1000 times, then it is entered once. So if you are sampling a path 1 in 1000 times, you are going to miss major itineries. If you check the ranking of whereisgeorgy site on Alexa, you would see I could be 10 to 100 times generous by saying 1 in 1000 sampling rate. The actual sampling rate could be 1 in 10K to 1 in 100K.

    ReplyDelete
  2. http://www.hubdub.com/m40657/How_many_people_in_the_USA_will_be_infected_with_the_Swine_Flu_by_June_30th_2009 ?

    that's more than a few thousand?

    ReplyDelete
  3. How's dollar movement a more relevant data than other more relevant data, such as passenger data of airlines? It may also be possible for the academics to ask for the anonymous credit card swipe data to figure out the circulation of people.


    I hope no airline is ever willing to give out such data. Given a randomly selected record, it would be essentially impossible to guess which person it was from. However, given partial information about a person's travel history, one might be able to identify their records and learn more. This isn't super-sensitive information, for most people, but I still don't want it sitting on some random university hard drive.

    ReplyDelete
  4. Dollar circulation does not say that you are the person who took a dollar bill from A to B, and you are the same person who took the bill from B to C.

    All airlines has to provide, on this date, so many people traveled from the city A to city B. This is more relevant data than dollar bill circulation data. Has not much privacy concerns.

    You would be surprised how (relatively) easily data is available to university researcher on request in comparison to industrial researchers.

    The number of people traveling by air data must also be compiled by federal agencies given its importance in the air safety.

    Basically dollar circulation data is not useful for the purpose. Yeah, if you are estimating the economics indicators such as velcoity of money, then it could be useful. But then the sampling rate of this website is likely to be so low that the data from the site is likely to have high noise sensitivity.

    ReplyDelete
  5. The study of social networks, especially by physicists or computer scientists, could really use a lot more hard data and rigor. It's a hype-driven field right now.

    ReplyDelete