A reader named Evan Schumacher wrote in with an interesting bleg. (Read about blegs here and send your own here.)
Tucked inside his bleg is the part that tickled me the most: a website Evan created to tell him whether it’s worth it to watch a basketball game he’d recorded. Anyway, I’ll give my answer below, after his bleg.
I was wondering if too much data is ever a bad thing? I ask because I thought one of the rules of life that I’ve learned is that it’s best to have as much data as possible.
Whether it be hard numbers or smart people around, and at least when you are starting, you want as much information as you can get. The smart guys are the ones who know how to analyze it.
However, in my personal life I was having a problem with too much data. I watch all of the Warriors basketball games on DVR. However, nothing is worse than watching for 1.5 hours and in the end your team gets blown out. However, I never want to see the score before I watch because that ruins the game. To solve the problem I created a little website to warn me if the games are bad (www.shouldiwatch.com), but it won’t tell me anything about the outcome (who won or the score) if the game was relatively close. Trust me, as a Warriors fan this is a huge time saver. It’s a stupid little example, but it makes me wonder if there are other cases when you are doing research when you need to turn away from some information.
Anyway, it’s a little bit backwards to think of, but I thought it might be interesting to explore.
His question sounds as if it is directed more at quants than writers, but as a writer I’ll say that I face this dilemma daily. Right now, in the middle of our writing SuperFreakonomics, I’m facing a number of short sections that require a bunch of historical reading and research. But the key thing is that these sections remain short — they are not the donut in this case, but the donut hole, and if they start getting swollen they will turn the book into a flabby monster.
The problem is that the reading and research is so much fun that it is really hard to limit yourself. Especially in this age of Google (and Google Books) and Amazon and even Wikipedia (yes, I was an early detractor but have come around on certain subjects), I am constantly trying to take a little sip from a firehose, and it’s nearly impossible. Reading too much inevitably turns into wanting to write too much; in this case, shorter will be better, but it takes a lot of effort and a long time to get the right three paragraphs (as opposed to a much easier but, to my mind, less effective 12 paragraphs).
The problem is that the more I’ve read — and the more data I’ve consumed, to get back to Evan’s question — the better those three paragraphs will be in the end. It reminds me of making maple syrup, which we did every winter as kids. You’d run around collecting all this sap, gallons and gallons of it from the trees you’d tapped, and then stay up all night boiling it down on an open fire — all to produce one little jar of syrup.
Was it worth the effort? Some people would say yes, others no. But in any case, it sure tasted good.

“A wealth of information creates a poverty of attention.”
- economist Herbert Simon:
Misterb…you got the wrong guy…you meant nuclear mom…however, I take issue with the marginal utility & needing to take into account the cost of transmission. As I stated when MU goes negative why would I need the slight added cost of transmission to make me toss the next bit out the door? Unless you’re paying me to get it. Interesting idea.
Sherlock Holmes thought that too much information cluttered the mind. He described the mind as something of a room that, if too many things are brought in, you cannot put your hand on what you want.
And so, according to Watson, Holmes was quite ignorant of many seemingly important things (e.g., astronomy, philosophy, etc., if I recall). This was done of purpose so that Holmes might feel his mind with only those things that pertained to his passion for solving human puzzles.
It’s not that it’s best to have as much information as possible. Rather, it’s best to have as much truthful information as possible.
The Internet is an information addict’s paradise, but let’s be honest, it’s choking on fluff, clutter, mistakes, misinformation and actual, out-and-out falsehood. Sites like Wikipedia are especially large repositories of information –some of it accurate, perhaps a lot of — but it has been, and still can be, gamed by people who are, for whatever reason, motivated to manipulate it.
I too love to accumulate information, but must expend a great deal of time filtering it for veracity. The more information I get, the greater the effort I must make to filter out the detritus and lies. At some point one reaches a point of diminishing returns, where one acquires so much information that veracity-filtration either becomes impossible, or is just too unreliable.
Too much data is a great thing in many ways. However, a lot more data means a lot more work and massive specialization of human capital. The result is that actual human beings end up being less well-rounded–which arguably has strong drawbacks.
For example, massive amounts of data mean that Dr. Wolfers can right about the economics of happiness. This can be great for his career. Toiling over such data might prevent him from spending time with friends and family, though. You see the trade off.
Data is only relevant in the context of information (I would offend this forum if I were to expound on the differences between the two). So the real question is “Can you have too much information?” My answer is — Of course! You need THE RIGHT amount of information in any setting, and too much information can be just as bad as too little.
Here’s a fairly rigorous attempt at an answer:
On Feature Selection: Learning with Exponentially many Irrelevant Features as Training Examples, Andrew Y. Ng. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998.
“… in the presence of many irrelevant features, the main source of error in wrapper model feature selection is from overfitting hold-out or cross-validation data.”
In other words, if you have more irrelevant data, you’re likely to find more statistical anomalies that may cause errors. If you’re careful, you can design algorithms that are quite robust to irrelevant data, though, as described in the paper above. In which case, if you have a suspicion that a piece of data might be relevant, then it will usually help a model to include it.
I used to know a stunningly attractive young woman with a filthy-rich and quite generous father. When she got bored daddy would buy her a business, or whatever she wanted. She would complain that when she gained premenstrual bloat, it was only in her bustline.
So yes, I guess you can have too much of anything. And there’s too much sand at the beach, and too much water in the sea, and sooooo many stars in the sky….
So be grateful for what you have, and if having too much data is your only complaint, come here and I’ll give you something to complain about….