jump to navigation

Let the Data Talk April 9, 2010

Posted by Sarah in Uncategorized.
Tags: , ,
add a comment

Here‘s a report from The Economist about the new data economy. It’s interesting and gets many things right.

Extracting knowledge from data is what I’m interested in as far as mathematical research goes. How humans extract knowledge from sense data is perhaps the central question of philosophy and of neuroscience. And, on a social level, I’m interested in what will happen when we manage to tap the wealth of medical, commercial, communications, and intelligence data that’s all around us but has heretofore been too big to handle. It wouldn’t be an exaggeration to call this a revolution.

Take medical records. Most people, when they think of electronic medical records, assume they mean more efficient bookkeeping. They do, but that’s not the half of it. Medical research has, up till now, been conducted in small, experimental trials. But if you had a searchable database of every patient in the US and their treatments, you could run correlations on an enormous scale. You could determine what treatments worked, what treatments didn’t go well together, what the average cost-benefit ratio was … all without spending a cent or getting off your seat. It would transform medicine. It might even replace the intuitive skills of the doctor as diagnostician; she could run your symptoms in the database and give you empirical probabilities of what you’ve got. And, of course, the same transformations would be at work on the insurance end. Privacy, at always, would be at risk, if not totally forfeited. A wise doctor once told me that other doctors had <em>completely failed</em> to grasp the magnitude of what’s coming, and that the people who understood it early would be at a huge advantage.

Data speaks. It’s not just a library accessible by card catalog. Given the right tools, it narrates, it emphasizes, it reveals salient features. From the Economist article:

Sometimes those data reveal more than was intended. For example, the city of Oakland, California, releases information on where and when arrests were made, which is put out on a private website, Oakland Crimespotting. At one point a few clicks revealed that police swept the whole of a busy street for prostitution every evening except on Wednesdays, a tactic they probably meant to keep to themselves

There’s a guiding principle here: let the data tell you what to look for. It’s maybe analogous to Hayek’s knowledge problem — maybe it’s the solution to that problem. No individual, coming up with categories and questions and definitions, can give a better picture of what’s going on than the data themselves do.

For example: Google’s spell-checker, which auto-corrects words based on what most people type, does far better than spell-checkers that try to encode dictionaries and rules of grammar. Google treats language as what is spoken. It’s learning to do the same with voice recognition and translation (link here.)

Amazon doesn’t identify your preferences in books by labeling them with pre-set categories (“horror,” “drama”) but looks at the correlations between what you browse and what other people buy after browsing the same books. It doesn’t presume to decide what the important categories are. It lets the data talk.

Yahoo used to “organize” the web into categories — sports, science, entertainment, and so on. The problem was, pre-selected categories didn’t capture all the things that people wanted to do with the web. That approach failed.

Pre-set categories are the opposite of spontaneous order, of letting the data talk. They artificially define the dimensions of the data set — but those artificially defined dimensions may not be orthogonal and may not be the dominant, “leading” dimensions. They don’t tell you the “real” shape of the data; they’re arbitrary, like inches.

We still do statistical studies, in polling and social sciences, along a few pre-selected categories. What is your race? What is your gender? If the categories the researcher picks happen not to be the important ones, then we haven’t learned anything. I can imagine, vaguely, someday doing polls that instead looked at your facebook profile and your entire Internet history. Huge amounts of data. And then singled out whatever was important. It might turn out that we’ve been looking at race and gender and age but what really matters to your political views is whether you search for porn a lot or not.

In practice, “letting the data talk” usually means making a big matrix of correlations or covariances and looking at its eigenvalues. The big eigenvalues are the big dimensions, the “important” things, the factors that predominantly shape the data. There are endless variations on this theme — Google’s own PageRank is one. ¬†In the natural sciences, data analysis has often been done with ab initio estimates — scientists develop algorithms to detect features they already expect to see. ¬†Sometimes that’s the only option we have, but I think we’re going to move towards a philosophy of avoiding such preconceived categories.

This is interesting science and technology, but I’d like to argue that it’s far more than science and technology. It should change the way we think, on an epistemic level.