The Dangers of Too Much Data

Wondering whether aspirin will protect your heart or cause internal bleeding? Or whether you should kick your coffee habit or embrace it? It’s often hard to make sense of the conflicting advice that comes out of medical research studies. John Timmer explains that our statistical tools simply haven’t kept up with the massive amounts of data researchers now have access to. In medical (and economic) research, scientists claim a “statistically significant” finding if there’s a less than 5% chance that an observed pattern (between coffee and liver disease, for example) occurred at random. In the new age of data, that rule causes problems: “Even given a low tolerance for error, the sheer number of tests performed ensures that some of them will produce erroneous results at random.” In lay terms, all those new tests you get at the doctor’s office are translated into data sets, which researchers then pore over searching for connections and patterns. And, if you have enough data to examine, eventually you’ll find a statistically significant relationship where no such relationship actually exists — by sheer coincidence. (HT: Matthew Rotkis)[%comments]

Leave A Comment

Comments are moderated and generally will be posted if they are on-topic and not abusive.

 

COMMENTS: 32

  1. Rick says:

    The problem seems to lie not in having too much data, as having less wouldn’t make us better informed. Rather it is in the acceptance of the 95% confidence interval for medical analysis.

    Thumb up 0 Thumb down 0

  2. John says:

    Reminds me of the bible codes ! If you have enough data you’ll find some interesting random pattern !

    Thumb up 0 Thumb down 0

  3. theDude says:

    As more studies are done, the percentage of Type I errors might not change (assuming everyone has been using the same significance level), but in absolute number terms there will be more results out there which are reporting faulty conclusions (Type I errors) because of the larger number of studies.

    As another poster mentioned, that doesn’t mean we should prefer fewer data sets or fewer studies. It just means that we need to be aware that Type I errors will exist and rigorously test the same hypotheses over time to double check results that are published in the media.

    Thumb up 0 Thumb down 0

  4. Quin says:

    Reminds me of the WSJ’s provocative article: “Most Science Studies Appear to Be Tainted By Sloppy Analysis” (http://online.wsj.com/article/SB118972683557627104.html), which seems to apply to more than just medicine.

    Thumb up 0 Thumb down 0

  5. GLK says:

    The problems don’t end with medicine. Mercurial statistical outcomes are everywhere. I never had a problem with them until politicians started using them to extort money from the befuddled hordes.

    Thumb up 0 Thumb down 0

  6. Ben says:

    This is why you still need qualitative analysis. You need to be able to explain something, not just show a correlation. If you can only show a correlation and have no qualitative basis to use for analysis, you have little other than an impressive graph for a power point slide.

    What is unfortunate is that most people will probably believe the pretty graph on the power point slide more readily than the wall of text that would make up a qualitative analysis. This is a separate issue however.

    Thumb up 0 Thumb down 0

  7. gevin shaw says:

    Many of the reversals in medical advice so well noticed over the years are a result of sloppy reporting. Correlations statistically significant enough to warrant research are reported as findings in lurid headlines rather than just an announcement of research into a statistically significant correlation. A couple of years later, the findings of the actual research can then appear as a refutation of a previous finding when it’s really just an explanation of a previously reported correlation.

    I’d imagine researchers, in search of grants and budget increases and contract renewals, contribute to this in their press releases, not lying or deceiving, but playing up the more lurid and exciting aspects of the raw data.

    Thumb up 0 Thumb down 0

  8. Michael F. Martin says:

    Why was this post not signed, I wonder?

    Thumb up 0 Thumb down 0