The Dangers of Too Much Data

Wondering whether aspirin will protect your heart or cause internal bleeding? Or whether you should kick your coffee habit or embrace it? It’s often hard to make sense of the conflicting advice that comes out of medical research studies. John Timmer explains that our statistical tools simply haven’t kept up with the massive amounts of data researchers now have access to. In medical (and economic) research, scientists claim a “statistically significant” finding if there’s a less than 5% chance that an observed pattern (between coffee and liver disease, for example) occurred at random. In the new age of data, that rule causes problems: “Even given a low tolerance for error, the sheer number of tests performed ensures that some of them will produce erroneous results at random.” In lay terms, all those new tests you get at the doctor’s office are translated into data sets, which researchers then pore over searching for connections and patterns. And, if you have enough data to examine, eventually you’ll find a statistically significant relationship where no such relationship actually exists — by sheer coincidence. (HT: Matthew Rotkis)[%comments]

Leave A Comment

Comments are moderated and generally will be posted if they are on-topic and not abusive.

 

COMMENTS: 32

  1. sal says:

    that is not the problem with an abundance of data, nor is it a problem with our statistical tools. it is a problem with how people inappropriately apply those tools to the abundant data.

    Thumb up 0 Thumb down 0

  2. Dzof says:

    Can somebody please explain this to me? To my understanding, the author is saying this:. The more data you have, the more likely you’ll be able to find a statistically significant relationship

    Let’s say that your hypothesis was that peole who read Freakonomics are more likely to eat spinach.

    Then I agree that the larger your sample size of readers, the more people you will find who read Freakonomics and eat spinach.

    But isn’t that offset by the large amounts of people who don’t read Freakonomics but also eat spinach?

    i.e. that P(ppl who eat spinach | they read Freakonomics) won’t really change.

    Which means that the significance test actually won’t be different either?

    Or am I missing something?

    Thumb up 0 Thumb down 0

  3. Mark B. says:

    Dzof, significance tests are computed with the help of the “standard error”, which is a measurement influenced, in part, with the help of the sample size. The larger the sample size, the smaller the standard errors, and the better the chance of finding a statistically significant result (a result that differs from zero).

    I disagree with the post’s author in that this is not a problem with having lots of data, or doing lots of tests on data. Instead its a problem with (almost) exclusively relying on p-values as tests of statistical significance rather than a combination of p-values and effect sizes.

    Thumb up 0 Thumb down 0

  4. Robert Grant says:

    @Dzof

    I think the point is that if you looked at the set of people who read Freakonomics, the probability that they will also all eat spinach is very low, but the probably that they all do *something* (own a dog/use Firefox/go jogging three times a week/drink Pepsi) gets higher the more things you compare them against.

    I.e. groups will sometimes overlap strongly even with there being no reason for it, if enough groupings are measured. I think it’s just the standard correlation != causation thing, restated.

    Thumb up 0 Thumb down 0

  5. johnd says:

    Dzof,

    The problem isn’t with data sets containing many data points. Those data are, and will always be, the most statistically reliable.

    The point of the article is that we are generating a lot of data sets with a modest number of data points. If you have thousands of data sets, at least a few of them will erroneously show a statistical relationship at 5% confidence by pure chance.

    Thumb up 0 Thumb down 0

  6. Mark says:

    What the author is trying to say is:
    When CNN says that a new study shows a link between consuming 5 cups of coffee per day and liver disease, then CNN actually means that the study found a “statistically significant” relationship between consuming 5 cups of coffee per day and liver disease. However, the standard threshold for deciding that something is “statistically significant” is that there is a less than 5% chance the link observed in the study occurred randomly.

    So looking at just one study, it is likely (95% chance) that such a link exists. However, if you have lots of studies each year that show some “statistically significant” relationship, then you would expect about 5% of those results to have been actually produced by randomness and not by any actual causal effect. Basically, if the threshold of “statistical significance” is a less than 5% chance of randomness causing the result, then out of 100 studies that show a relationship, you would expect the relationship espoused by the study to be wrong for 5 of the studies.

    It is probably not quite as simple as that since some studies will show a much less than 5% chance of randomness causing the relationship, but that’s the gist of the post.

    Thumb up 0 Thumb down 0

  7. kip says:

    @Dzof: It’s more like this: if you have data for readers of thousands of books, and what they eat, and you run enough studies, eventually you are going to find a “statistically significant” correlation between readers of one book (say, Freakonomics) and eaters of some food (say, spinach). But it may simply be one of those 5% of cases where the pattern actually did occur at random.

    Thumb up 0 Thumb down 0

  8. Nicole says:

    Great article. I’m a graduate student whose thesis work involves both genomic microarray and “wet lab” data. I do lots of work with datasets involving >50,000 probesets for each of 50-100 samples. There is always a balance between a low false discovery rate and setting cutoffs so stringently that real differences are missed.

    My advisor requires the following two rules be applied after the statistical algorithm of choice has spit out the top 1000 statistically significant associations.

    1) Plot the data graphically and ask if it passes the “eyeball test.” You should be able to glance at the graph and immediately see if there one group is dramatically different from the others. This works best if you do a “dot plot,” in which every sample/patient/mouse/whatever is represented by a single point and not a bar graph that compresses everything to averages.

    2) Differences that are real and biologically important do not require p-values, chi squared, FDR, fancy transformations or any other statistical test to be believable. If the graph is unimpressive and you have to squint to see that maybe group A is different from group B, but golly, the p value is 0.045….it’s not likely to be real.

    It follows from these two rules that there is no substitute for looking at the data or graphs in the original study. Science reporting would be a lot clearer to the general public if graphical representations were included with articles. Not likely, but I can dream, can’t I?

    Thumb up 0 Thumb down 0