Attack of the Super Crunchers: Adventures in Data Mining

There’s been plenty written about the success that can be generated by an effective algorithm. Google and scores of other businesses thrive in large part because they are masters of the algorithmic mindset, gathering and analyzing data in ways previously thought impossible. As consumers, we’ve become accustomed to reaping the benefit of this revolution: using Google to find practically anything, heading to the discount airfare site that guarantees the “absolute lowest” rates, clicking through Amazon’s personal book recommendations.

Ian Ayres, Yale Law School professor, Forbes columnist, and data fanatic, has now written a book on data mining, Super Crunchers: Why Thinking-By-Numbers Is the New Way to Be Smart. (Full disclosure: Levitt is a friend and collaborator of Ayres, and he blurbed the book; Ayres also discusses Freakonomics and other research by Levitt in the book.)

Ayres writes about “a new breed of number crunchers … who have analyzed large datasets to discover empirical correlations between seemingly unrelated things.” These include hospitals predicting physician cleanliness based on infection rates and credit card companies examining a customer’s charge history to determine whether he or she will get divorced. Besides their usefulness in consumer transactions, Ayres argues, regression and data analysis can predict outcomes far better than we can, and have already had a huge impact on human behavior. Below are a few excerpts.

On determining the presence of racial discrimination in auto loan rates:

While most consumers now know that the sales price of a car can be negotiated, many do not know that auto lenders, such as Ford Motor Credit or GMAC, often give dealers the option of marking up a borrower’s interest rate. When a car buyer works with the dealer to arrange financing, the dealer normally sends the customer’s credit information to a potential lender. The lender then responds with a private message to the dealer that offers a “buy rate” — the interest rate at which the lender is willing to lend. Lenders will often pay a dealer — sometimes thousands of dollars — if the dealer can get the consumer to sign a loan with an inflated interest rate …

In a series of cases that I worked on, African-American borrowers challenged the lenders’ markup policies because they disproportionately harmed minorities. [Vanderbilt economist Mark] Cohen and I found that on average white borrowers paid what amounted to about a $300 markup on their loans, while black borrowers paid almost $700 in markup profits. Moreover, the distribution of markups was highly skewed. Over half of white borrowers paid no markup at all, because they qualified for loans where markups were not allowed. Yet 10 percent of GMAC borrowers paid more than $1,000 in markups and 10 percent of the Nissan customers paid more than a $1,600 markup. These high markup borrowers were disproportionately black. African-Americans were only 8.5 percent of GMAC borrowers, but paid 19.9 percent of the markup profits….

These studies were only possible because lenders now keep detailed electronic records of every transaction. The one variable they don’t keep track of is the borrower’s race. Once again, though, technology came to the rescue. Fourteen states … will, for a fee, make public the information from their driver’s license database — information that includes the name, race and Social Security number of the driver.

On the campaign of Don Berwick, a pediatrician and president of the Institute for Healthcare Improvement, to change hospital practices to follow the results of data analysis (a topic that Dubner and Levitt addressed here):

In December 2004, [Berwick] brazenly announced a plan to save 100,000 lives over the next year and a half. The “100,000 Lives Campaign” challenged hospitals to implement six changes in care to prevent avoidable deaths. He wasn’t looking for subtle or sophisticated changes. He wasn’t calling for increased precision in surgical operations. No … he wanted hospitals to change some of their basic procedures. For example, a lot of people after surgery develop lung infections while they’re on ventilators. Randomized studies showed that simply elevating the head of the hospital bed and frequently cleaning the patient’s mouth substantially reduces the chance of infection. Again and again, Berwick simply looked at how people were actually dying and then tried to find out whether there was large-scale statistical evidence showing interventions that might reduce these particular risks ….

Berwick’s most surprising suggestion, however, is the one with the oldest pedigree. He noticed that thousands of ICU patients die each year from infections after a central line catheter is placed in their chests. About half of all intensive care patients have central line catheters, and ICU infections are deadly (carrying mortality rates of up to 20 percent). He then looked to see if there was any statistical evidence of ways to reduce the chance of infection. He found a 2004 article in Critical Care Medicine that showed that systematic hand-washing (combined with a bundle of improved hygienic procedures such as cleaning the patient’s skin with an antiseptic called chlorhexidine) could reduce the risk of infection from central-line catheters by more than 90 percent. Berwick estimated that if all hospitals just implemented this one bundle of procedures, they might be able to save as many as 25,000 lives per year.

On predicting the success of law review articles (measured in subsequent mentions from other articles):

As a law professor, my primary publishing job is to write law review articles. I don’t get paid for them, but a central measure of an article’s success is the number of times the articles have been cited by other professors. So with the help of a full-time number-crunching assistant named Fred Vars, I went out and analyzed what caused a law review article to be cited more or less. Fred and I collected citation information on all the articles published for fifteen years in the top three law reviews. Our central statistical formula had more than fifty variables. Like Epagogix [a group that created an algorithm intended to predict whether a movie will be successful based on characteristics of its script], Fred and I found that seemingly incongruous things mattered a lot. Articles with shorter titles and fewer footnotes were cited significantly more, whereas articles that included an equation or an appendix were cited a lot less. Longer articles were cited more, but the regression formula predicted that citations per page peak for articles that were a whopping fifty-three pages long….

Law review editors who want to maximize their citation rates should also avoid publishing criminal and labor law articles, and focus instead on constitutional law. And they should think about publishing more women. White women were cited 57 percent more often than white men, and minority women were cited more than twice as often.

Leave A Comment

Comments are moderated and generally will be posted if they are on-topic and not abusive.

 

COMMENTS: 42

  1. DrNova says:

    Thanks for the informative blog. No doubt the “way of the algorithm” is justified when lives are saved through behavior modification in hospitals.

    An unfortunate outcome in pragmatic America, however, will likely be an ADDICTION to this “way” as “the only way.”

    Especially disturbing is the thought that the “way of the algorithm” will lead to uniform articles in medical journals or law journals, based on the “algorithm” that will “get” the author the most fame, renown, attention, or other desirable appeal to human vanity.

    I guess the coming wave of academic papers by students in top-tier universitites will be churned out in the “way of the algorithm,” dealing more death yet to spontaneous human genius, checked by human discipline, in delivery of the word.

    Let’s see the algorithm predicting how much less fruit the trees will bear, when addiction to algorithms determines performance.

    The “new new thing” is always the “saving grace”–the same kind of “saving grace” that television was to General Sarnoff at RCA in the earliest days. There can be no doubt that the “way of the algorithm” will descend to its lucrative outcome as a tool for commercial and political propaganda, following in TV’s footsteps.

    Is this not the “way of all flesh?”

    God bless The New York Times for this platform for free speech.

    (ISAIAH 55, JOHN 21:17)

    Thumb up 0 Thumb down 0

  2. DrNova says:

    Thanks for the informative blog. No doubt the “way of the algorithm” is justified when lives are saved through behavior modification in hospitals.

    An unfortunate outcome in pragmatic America, however, will likely be an ADDICTION to this “way” as “the only way.”

    Especially disturbing is the thought that the “way of the algorithm” will lead to uniform articles in medical journals or law journals, based on the “algorithm” that will “get” the author the most fame, renown, attention, or other desirable appeal to human vanity.

    I guess the coming wave of academic papers by students in top-tier universitites will be churned out in the “way of the algorithm,” dealing more death yet to spontaneous human genius, checked by human discipline, in delivery of the word.

    Let’s see the algorithm predicting how much less fruit the trees will bear, when addiction to algorithms determines performance.

    The “new new thing” is always the “saving grace”–the same kind of “saving grace” that television was to General Sarnoff at RCA in the earliest days. There can be no doubt that the “way of the algorithm” will descend to its lucrative outcome as a tool for commercial and political propaganda, following in TV’s footsteps.

    Is this not the “way of all flesh?”

    God bless The New York Times for this platform for free speech.

    (ISAIAH 55, JOHN 21:17)

    Thumb up 0 Thumb down 0

  3. Blue Sun says:

    One critical problem with attempting to use data-mining to build a real-world picture is making sure that your algorithm considers all of the relevant factors and places appropriate weighting on them.

    In the lending example, did the data-miners factor in the borrowers’ incomes, family size and stability, past loan history and general credit history, make of car, model and price of car, geographical region, or dozens of other factors that might have affected the decisions of the dealers and lenders?

    I once read a study that found that a disproportionate percentage of trash incinerators were located in Black neighborhoods. For weeks, leaders in the Black community were expressing their outrage. When others checked the data, however, they found that by correlating by median neighborhood income, a poor White neighborhood was just as likely to have an incinerator as a poor Black one. It turned out that race was not the deciding factor, but poverty and the resulting powerlessness against the local government was.

    We must always be careful not to read what we expect to see from incomplete correlations, or to confuse correlation with causation.

    Thumb up 0 Thumb down 0

  4. Blue Sun says:

    One critical problem with attempting to use data-mining to build a real-world picture is making sure that your algorithm considers all of the relevant factors and places appropriate weighting on them.

    In the lending example, did the data-miners factor in the borrowers’ incomes, family size and stability, past loan history and general credit history, make of car, model and price of car, geographical region, or dozens of other factors that might have affected the decisions of the dealers and lenders?

    I once read a study that found that a disproportionate percentage of trash incinerators were located in Black neighborhoods. For weeks, leaders in the Black community were expressing their outrage. When others checked the data, however, they found that by correlating by median neighborhood income, a poor White neighborhood was just as likely to have an incinerator as a poor Black one. It turned out that race was not the deciding factor, but poverty and the resulting powerlessness against the local government was.

    We must always be careful not to read what we expect to see from incomplete correlations, or to confuse correlation with causation.

    Thumb up 0 Thumb down 0

  5. Jarret says:

    Interesting article. Also interesting comments. I think that using algorithms for such things as deciding which paper to publish really only works until the algorithm is commonly adopted. Once it becomes ubiquitous, it becomes analagous to any sort of arb trading when everyone uses the same model: efficacy vanishes. If every paper were predictably the same, those papers would bore people, and the predidictive ability of the algorithm would be sharply reduced going forward. I guess its called being human. Thank God.

    Thumb up 0 Thumb down 0

  6. Jarret says:

    Interesting article. Also interesting comments. I think that using algorithms for such things as deciding which paper to publish really only works until the algorithm is commonly adopted. Once it becomes ubiquitous, it becomes analagous to any sort of arb trading when everyone uses the same model: efficacy vanishes. If every paper were predictably the same, those papers would bore people, and the predidictive ability of the algorithm would be sharply reduced going forward. I guess its called being human. Thank God.

    Thumb up 0 Thumb down 0