jump to navigation

Alternating methods for SVD November 9, 2010

Posted by Sarah in Uncategorized.
Tags: , ,
1 comment so far

Last week’s applied math seminar was Arthur Szlam, talking about alternating methods for SVD. I’m putting up some notes here; several results are presented without proof.

Given an m x n matrix A, we want to find the best rank k approximation to A.
There is an explicit solution to this problem, known as the singular value decomposition. We write
A = U \Sigma V^T
where \Sigma is diagonal, U, V are orthogonal. The columns of U are the eigenvectors of A^T A and the columns of V are the eigenvectors of A A^T. This costs O(m^2 n) operations, and it’s a completely explicit solution. To get a rank-k approximation, we take
\hat{U} \hat{\Sigma} \hat{V}^T
where the “hat” means taking the first k columns of each matrix.
This is the best rank-k approximation, both in the sense of Frobenius norm and operator norm.

However, if k << m, O(m^2 n) time is wasteful. There are recent attempts to do this faster, in O(mkn) time.

We fix k, and a small oversampling parameter p, and let l = k + p. Draw an l \times m random Gaussian matrix \Omega, and set Q^T = \Omega A.
Calculate the SVD, \tilde{U} \tilde{\Sigma} \tilde{V}^T = Q, and set U to be the first k columns of \tilde{U}, etc.

Why does this work?

Well, pretend A is really of rank 5, for instance. Hit A with a random matrix. It is unlikely that any vector will be in the kernel of A. Whatever's left is in the range. So we're very likely to have the full range of A; we probably preserve the SVD. However, in real life we don't have a known rank k matrix.

What we have is a guarantee of quality, in terms of the decay of the singular values:
E ||A - A Q Q^T|| \le [ 1 + \sqrt{\frac{k}{p-1}} + \frac{e \sqrt{k + p}}{p} \sqrt{m-k}] \sigma_{k +1}

Here the expected value is taken over the random Gaussian draw, the norm is the operator norm, and \sigma_{k + 1} is the k + 1th singular value; if A is really rank k, this is zero, and the expected error is therefore zero. If \sigma_{k+1} is small, then the bound is good. We're in danger if m is large and \sigma_{k+1} is not super small.

This actually happens, unfortunately. For example, in term document matrices, where a collection of text documents form the rows, and words form the columns, with a word count in each matrix entry indicating how often a word was used in each document. (A little imagination should indicate the importance of these kinds of matrices for semantic analysis.) These matrices are sparse and large, but their singular values decay slowly. So what can we do here?

We hit \Omega by some more powers of A, to drive down the smaller eigenvalues. In practice, 10 or 15 times.
Q_q^T = \Omega A (A^TA)^q
Now we have a new estimate, depending on the power q.

E ||A - A Q_q Q_q^T|| \le [1 + \sqrt{\frac{k}{p-1}} + \frac{p \sqrt{k + p}}{p} \sqrt{m-k}]^{1/2q + 1} \sigma_{k+1}

This works, but it can be memory intensive to save all the intermediate powers. This is where the "alternating" method comes in.

Here we look at a related problem: matrix completion. You're given a subset of the entries of a matrix A, and you know A is low rank. How do you fill in the rest of the entries? The simplest approach is alternating least squares.

Write A = PQ^T where P is m-by-k and Q is n-by-k.
Alternate between updates of P and Q:
P_{j+1} = argmin ||PQ_j^T - A||^2
Q_{j+1} = argmin ||P_{j-1} Q^T - A||
These are least-squares problems, and can be solved explicitly using

Step 1: form m \times l Gaussian, P_0.
Step 2: form n \times l matrix Q_0 st
Q_0 R_0 = (P_0^T A)^T
(a QR-decomposition, with Q orthogonal and R upper triangular.)
Step 3: iterate for j = 0 \dots j = q-1
P_{j+1} R_{2j - 1} = A Q_j
Q_{j + 1} R_{2j +2} = (P_{j+1}^T A)^T

Coming back to our own problem, we can use the same formula to calculate Q_q^T.
Q_j = (A^T A)^j A^T P_o (\prod_{s = 0}^{2j} R_s^{-1}.

This turns out to work well in practice; the example set was a database of labeled faces, which have the first 20 eigenvalues decaying fast, then the next decaying very slowly. This kind of alternating SVD works well for faces.

Advertisements

A Yardstick for Smell: Thinking in PCA May 22, 2010

Posted by Sarah in Uncategorized.
Tags: ,
add a comment

I mention this Discover article because it’s really an applied math story being popularized as a biology story. Noam Sobel and his team from Weitzmann made a database of thousands of smell-producing molecules, along with their properties (like size, or how tightly the molecules pack.) They then ran Principal Components Analysis on the data to find the critical properties, and observed that a single dimension or score encapsulated most of the variation — and even could predict which smells humans would find more and less appealing.

What I want to harp on is the use of PCA. To those of us who actually do math and science full time, it’s a pretty familiar, standard technique. It’s over a hundred years old. It’s very simple. It doesn’t even work in the difficult cases (a lot of modern random matrix theory is about identifying when PCA doesn’t work.) But I don’t think Principal Components Analysis (or multidimensional eigenvector analysis generally) gets the attention it deserves in the popular imagination. Here, given a glut of data, is a tool that can tell you what are the most useful ways to measure it. We’re used to science giving us measurements; but this is science telling us what measurements are important. Sobel’s research isn’t really a biological discovery at all; it’s a data analysis discovery.

And when you’re used to thinking in the language of dimensionality reduction, suddenly you see when it’s needed but missing. A gap, like “Oh, this really should be a dimensionality reduction problem.” For instance, I was reading some of Simon Baron-Cohen’s research about autism; for the unfamiliar, his theory is that people lie on a spectrum between “systematizing” and “empathizing” thoughts, with autistics on the “systematizing” extreme, much better at logical puzzles than interpersonal communication. I was reading his papers and I immediately thought “Something’s missing here.” If I wanted to know whether a single axis distinguished autistics from non-autistics, well, I’d look at all kinds of neurological and psychological properties, get a big cloud of data from autistics and non-autistics, and see if most of the variation was due to this empathizing/systematizing business.

Now, I’m not remotely a psychologist, but it looks like Baron-Cohen doesn’t do that. He gives out a survey, observes that autistics are more systematizing and less empathizing than non-autistics, and essentially says “Ta-Da!” And I go, “Huh?” How on earth does he know that this is the main difference between autistics and non-autistics? Where are all the correlations and variances? Now, maybe that’s standard practice in psychology. I’ve read a bit more in the social sciences, and it’s standard practice there. But to me, this kind of research is just crying out for a little data analysis. I think there are whole areas of research where people aren’t yet thinking in PCA. Maybe that should change.