jump to navigation

Scholarship and morality May 2, 2010

Posted by Sarah in Uncategorized.

Two posts from Emily, a classmate of mine, “The Queer Activist — a Brief Observation” and “A Story of Emotional Poles” got me thinking. The posts are concerned with the question of whether it’s okay, whether it’s good enough, to spend your life as a scholar. Emily is more of a scholar than I am, and she’s also more of an activist, and when she cautiously, thoughtfully answers the question for herself in the affirmative, I think “Emily, why were you ever worried?”

But for myself, I do have unanswered questions. If I do math, whom do I serve? Whom do I help? I think this is work that does help, in the very long run, because anyone who wants to act based on empirical data will need to learn to interpret that data. Everyone who wants to change the world for the better needs to understand the world; and, perhaps, they need techniques for dimensionality reduction and denoising and so on. And we also need pure math to understand that. So mathematicians help people get the facts right, which is the basis of everything else. But, because it’s basic, it’s also distant.

I consider the fact that it’s all right to take into account one’s own abilities, education, and curiosities, in deciding the course of one’s life — we’re all certainly encouraged to do that — and mine very clearly point to math. I’m doing it because it’s beautiful and important to me and because I can do it. But is it enough? Is it a life? I give as much to charity as I can afford; I’m not as active a participant in my community as I should be, but I can change that; I’m trying to be a good friend to my friends, but I’m not always sure how; and still I’ll probably be wondering for a long time, “Is this a life?”

I think it can be, but I’m going to have to be very careful. To give, and contribute, and participate, in the academic community and the wider community; if I can see beauty, to share it, and if I can see something going wrong that I have a chance of fixing, to fix it. And not to mistakenly be too much of a “do-gooder” — remembering the words of Szent-Gyorgi, “If any student comes to me and says he wants to be useful to mankind and go into research to alleviate human suffering, I advise him to go into charity instead. Research wants real egotists who seek their own pleasure and satisfaction, but find it in solving the puzzles of nature.”


High dimensions and policy April 28, 2010

Posted by Sarah in Uncategorized.

Robin Hanson says this:

Imagine the space of all policies, where one point in that space is the current status quo policy. To a first approximation, policy insight consists on learning which directions from that point are “up” as opposed to “down.” This space is huge – with thousands or millions of dimensions. And while some dimensions may be more important than others, because those changes are easier to implement or have a larger slope, there are a great many important dimensions.

In practice, however, most policy debate focuses on a few dimensions, such as the abortion rate, the overall tax rate, more versus less regulation, for or against more racial equality, or a pro versus anti US stance. In fact, political scientists Keith Poole and Howard Rosenthal are famous for showing that one can explain 85% of the variation in US Congressional votes by a single underlying dimension, where there are two separated clumps. Most of the remaining variation is explained by one more dimension. Similar results have since been found for many other nations and eras.

This sounds, to me, like the main insight of dimensionality reduction. How do you know you’ve picked a good basis? Is the set of coordinates you choose to measure actually the set of coordinates that most efficiently explain the data?

Maybe policy outcomes really are nearly clustered along one axis, and maybe that axis is the Democrat/Republican one. Possibly. But we’d have to check. That’s what rank estimation is for. (See Kritchman and Nadler, who do it with a matrix perturbation approach.)

I’d like to see that particular insight trickle into society more broadly. There are objective ways to compare the usefulness of coordinate systems. Put another way, if you want to play Twenty Questions with the universe, some questions are better than others. And there is always a possibility that the ones we’re using aren’t so good.

Let the Data Talk April 9, 2010

Posted by Sarah in Uncategorized.
Tags: , ,
add a comment

Here‘s a report from The Economist about the new data economy. It’s interesting and gets many things right.

Extracting knowledge from data is what I’m interested in as far as mathematical research goes. How humans extract knowledge from sense data is perhaps the central question of philosophy and of neuroscience. And, on a social level, I’m interested in what will happen when we manage to tap the wealth of medical, commercial, communications, and intelligence data that’s all around us but has heretofore been too big to handle. It wouldn’t be an exaggeration to call this a revolution.

Take medical records. Most people, when they think of electronic medical records, assume they mean more efficient bookkeeping. They do, but that’s not the half of it. Medical research has, up till now, been conducted in small, experimental trials. But if you had a searchable database of every patient in the US and their treatments, you could run correlations on an enormous scale. You could determine what treatments worked, what treatments didn’t go well together, what the average cost-benefit ratio was … all without spending a cent or getting off your seat. It would transform medicine. It might even replace the intuitive skills of the doctor as diagnostician; she could run your symptoms in the database and give you empirical probabilities of what you’ve got. And, of course, the same transformations would be at work on the insurance end. Privacy, at always, would be at risk, if not totally forfeited. A wise doctor once told me that other doctors had <em>completely failed</em> to grasp the magnitude of what’s coming, and that the people who understood it early would be at a huge advantage.

Data speaks. It’s not just a library accessible by card catalog. Given the right tools, it narrates, it emphasizes, it reveals salient features. From the Economist article:

Sometimes those data reveal more than was intended. For example, the city of Oakland, California, releases information on where and when arrests were made, which is put out on a private website, Oakland Crimespotting. At one point a few clicks revealed that police swept the whole of a busy street for prostitution every evening except on Wednesdays, a tactic they probably meant to keep to themselves

There’s a guiding principle here: let the data tell you what to look for. It’s maybe analogous to Hayek’s knowledge problem — maybe it’s the solution to that problem. No individual, coming up with categories and questions and definitions, can give a better picture of what’s going on than the data themselves do.

For example: Google’s spell-checker, which auto-corrects words based on what most people type, does far better than spell-checkers that try to encode dictionaries and rules of grammar. Google treats language as what is spoken. It’s learning to do the same with voice recognition and translation (link here.)

Amazon doesn’t identify your preferences in books by labeling them with pre-set categories (“horror,” “drama”) but looks at the correlations between what you browse and what other people buy after browsing the same books. It doesn’t presume to decide what the important categories are. It lets the data talk.

Yahoo used to “organize” the web into categories — sports, science, entertainment, and so on. The problem was, pre-selected categories didn’t capture all the things that people wanted to do with the web. That approach failed.

Pre-set categories are the opposite of spontaneous order, of letting the data talk. They artificially define the dimensions of the data set — but those artificially defined dimensions may not be orthogonal and may not be the dominant, “leading” dimensions. They don’t tell you the “real” shape of the data; they’re arbitrary, like inches.

We still do statistical studies, in polling and social sciences, along a few pre-selected categories. What is your race? What is your gender? If the categories the researcher picks happen not to be the important ones, then we haven’t learned anything. I can imagine, vaguely, someday doing polls that instead looked at your facebook profile and your entire Internet history. Huge amounts of data. And then singled out whatever was important. It might turn out that we’ve been looking at race and gender and age but what really matters to your political views is whether you search for porn a lot or not.

In practice, “letting the data talk” usually means making a big matrix of correlations or covariances and looking at its eigenvalues. The big eigenvalues are the big dimensions, the “important” things, the factors that predominantly shape the data. There are endless variations on this theme — Google’s own PageRank is one.  In the natural sciences, data analysis has often been done with ab initio estimates — scientists develop algorithms to detect features they already expect to see.  Sometimes that’s the only option we have, but I think we’re going to move towards a philosophy of avoiding such preconceived categories.

This is interesting science and technology, but I’d like to argue that it’s far more than science and technology. It should change the way we think, on an epistemic level.