jump to navigation

Multiscale SVD, K planes and geometric wavelets November 30, 2010

Posted by Sarah in Uncategorized.
Tags: , , , ,
add a comment

I went to the applied math talk by Mauro Maggioni today; here are notes from the presentation.

Multiscale SVD

We often have to deal with high-dimensional data clouds that may lie on a low-dimensional manifold. Examples: databases of faces, collections of text documents, configurations of molecules. But, we don’t know a priori what the dimension of the manifold is. So we ned some mechanisms for estimating that, in the situation where we have noise in the data.

Some terminology: the data points are x_i + \eta_i, the real data plus some white Gaussian noise of variance \sigma^2. The intrinsic dimension of the manifold M is k while the dimension of the ambient space is D.

Previous work at estimating dimensions generally falls into two categories. One is volume-based (the number of points on the manifold confined within a ball should grow as r^k where r is the radius, if you’ve chosen the right dimension k. The other possibility is PCA, done locally and patched together somehow to account for the fact that we have a curved manifold rather than a linear hyperplane. The problem with this is that you don’t know what “locally” means — how many patches do we need? There’s also the so-called “curvature dilemma” with PCA: the more highly curved the manifold, the more likely it is to confuse the top principal value (representing the tangent direction) with the second-highest principal value (representing the direction of curvature.)

Maggioni’s alternative is multiscale SVD. We fix a point. The noise around the manifold looks like a hollow tube — most of the noise is concentrated at a radius of \sigma \sqrt{D} from the surface of the manifold itself. So looking at a small ball (smaller than the radius of the tube) picks up less noise, looking at a large ball picks up all the noise, and looking at a still larger ball begins to pick up the curvature of the manifold. We do SVD within all these balls, and look at the singular values over a range of scales. There’s a “sweet spot,” a particular interval of scales where the signal singular values have separated from the noise singular values, but the curvature singular values haven’t caught up yet; this accurately represents the curvature of the manifold. (I blogged about this earlier here and the paper is here.)

The assumption about the data is that the tangent covariance grows differently (O(r^2)) with respect to the radius than the normal covariance (O(r^4)). This can be computed to be true explicitly for manifolds of co-dimension one, for instance.

The result is that if M is assumed to have small enough curvature and \eta is assumed to have small enough variance, then with high probability we can determine the intrinsic dimension k with only k \log k points. Compared to a large number of other dimension-estimating methods, this outperforms all of them. Isomap, in particular, really falls apart in the presence of noise, and can’t tell a 6-dimensional sphere from a 6-dimensional cube.

One interesting fact is that many real-life data sets have dimensionality that varies. A database of text documents, for instance, is low-dimensional in some regions, and very, very high-dimensional in others. (We measure dimensionality locally, so there’s no reason in principle that it shouldn’t vary.)

Measures supported on K planes
Sometimes high-dimensional data is supported on a set of planes \pi_1 \dots \pi_k, possibly of different dimensions d_1 \dots d_k. We don’t know how many planes there are, which planes they are, or what their dimensions are. This is the situation in face recognition; it’s also relevant to image and signal processing more generally (dictionary learning, sparse approximation, etc.)

Maggioni and collaborators developed an algorithm for doing this.

1) Draw n_0 random points and n n_0 nearest neighbors. Do multiscale SVD at these points. Produce an estimated plane \hat{\pi}_k and its estimated dimension \hat{d}_k. Also estimate the noise level \hat{\sigma}.

2. Construct an affinity matrix, with the (i, j)th coordinate

e^{\frac{-d(x_i, \hat{\pi}_k)^2}{\hat{\sigma}^2}}

This is almost a matrix composed of zeros and ones — either a point is on a plane (in which we get one) or a point is not on that plane (in which case we get 0 because the distance is large.) In practice, with noise and curvature, it’s not quite binary, but it’s close.

3. Denoise and cluster this matrix, and find updated estimates for the planes from this.

Tested on a face database, this worked very well. It misclassified only 3 faces out of 640. What’s remarkable here is that the Yale Face Database used had 64 different photos of each person, taken at different angles, different lighting conditions, etc. This has always been a hard problem for computer vision: how do you recognize that two pictures are the same person viewed frontal and profile, while two other pictures are of two different people? Apparently, this is a solution to such heterogeneity problems.

Geometric wavelets
The problem of dictionary learning is as follows. Given n points in \mathbb{R}^D, and a data matrix X, construct a dictionary \Phi of m vectors such that

X = \Phi \alpha
where \alpha is a sparse vector.

There’s a tradeoff here between the size of \Phi and the sparsity of \alpha: obviously, you could just let \Phi be the whole of X and then \alpha would just be the identity, but that would be silly — it wouldn’t compress the data at all.

One way of doing this is something called geometric wavelets. (Paper here. Shorter, more descriptive paper with cool picture here.)
Given a manifold, we partition it into a dyadic tree. We do principal components analysis on the coarsest scale and get a plane; then we divide that into two smaller subsets, do PCA on each of them, and get two new planes. For each plane, at the center point of that cube in the manifold grid, we have a tangent vector \Phi_j, which approximates the tangent to the manifold. The wavelets keep track of the differences \Phi_j - \Phi_{j+1}. If we have some smoothness in the manifold, these corrections decay quickly, so finitely many wavelets are a good approximator for any point.

There’s a fast algorithm for computing this, very like the wavelet transform, except for a point cloud rather than a function. This is useful for compressing databases, analyzing online data, and so forth.

The idea is very like Peter Jones’ construction of beta numbers (summary here) except that it may be more efficient in terms of storage. More on geometric wavelets from Dan Lemire.

Alternating methods for SVD November 9, 2010

Posted by Sarah in Uncategorized.
Tags: , ,
1 comment so far

Last week’s applied math seminar was Arthur Szlam, talking about alternating methods for SVD. I’m putting up some notes here; several results are presented without proof.

Given an m x n matrix A, we want to find the best rank k approximation to A.
There is an explicit solution to this problem, known as the singular value decomposition. We write
A = U \Sigma V^T
where \Sigma is diagonal, U, V are orthogonal. The columns of U are the eigenvectors of A^T A and the columns of V are the eigenvectors of A A^T. This costs O(m^2 n) operations, and it’s a completely explicit solution. To get a rank-k approximation, we take
\hat{U} \hat{\Sigma} \hat{V}^T
where the “hat” means taking the first k columns of each matrix.
This is the best rank-k approximation, both in the sense of Frobenius norm and operator norm.

However, if k << m, O(m^2 n) time is wasteful. There are recent attempts to do this faster, in O(mkn) time.

We fix k, and a small oversampling parameter p, and let l = k + p. Draw an l \times m random Gaussian matrix \Omega, and set Q^T = \Omega A.
Calculate the SVD, \tilde{U} \tilde{\Sigma} \tilde{V}^T = Q, and set U to be the first k columns of \tilde{U}, etc.

Why does this work?

Well, pretend A is really of rank 5, for instance. Hit A with a random matrix. It is unlikely that any vector will be in the kernel of A. Whatever's left is in the range. So we're very likely to have the full range of A; we probably preserve the SVD. However, in real life we don't have a known rank k matrix.

What we have is a guarantee of quality, in terms of the decay of the singular values:
E ||A - A Q Q^T|| \le [ 1 + \sqrt{\frac{k}{p-1}} + \frac{e \sqrt{k + p}}{p} \sqrt{m-k}] \sigma_{k +1}

Here the expected value is taken over the random Gaussian draw, the norm is the operator norm, and \sigma_{k + 1} is the k + 1th singular value; if A is really rank k, this is zero, and the expected error is therefore zero. If \sigma_{k+1} is small, then the bound is good. We're in danger if m is large and \sigma_{k+1} is not super small.

This actually happens, unfortunately. For example, in term document matrices, where a collection of text documents form the rows, and words form the columns, with a word count in each matrix entry indicating how often a word was used in each document. (A little imagination should indicate the importance of these kinds of matrices for semantic analysis.) These matrices are sparse and large, but their singular values decay slowly. So what can we do here?

We hit \Omega by some more powers of A, to drive down the smaller eigenvalues. In practice, 10 or 15 times.
Q_q^T = \Omega A (A^TA)^q
Now we have a new estimate, depending on the power q.

E ||A - A Q_q Q_q^T|| \le [1 + \sqrt{\frac{k}{p-1}} + \frac{p \sqrt{k + p}}{p} \sqrt{m-k}]^{1/2q + 1} \sigma_{k+1}

This works, but it can be memory intensive to save all the intermediate powers. This is where the "alternating" method comes in.

Here we look at a related problem: matrix completion. You're given a subset of the entries of a matrix A, and you know A is low rank. How do you fill in the rest of the entries? The simplest approach is alternating least squares.

Write A = PQ^T where P is m-by-k and Q is n-by-k.
Alternate between updates of P and Q:
P_{j+1} = argmin ||PQ_j^T - A||^2
Q_{j+1} = argmin ||P_{j-1} Q^T - A||
These are least-squares problems, and can be solved explicitly using

Step 1: form m \times l Gaussian, P_0.
Step 2: form n \times l matrix Q_0 st
Q_0 R_0 = (P_0^T A)^T
(a QR-decomposition, with Q orthogonal and R upper triangular.)
Step 3: iterate for j = 0 \dots j = q-1
P_{j+1} R_{2j - 1} = A Q_j
Q_{j + 1} R_{2j +2} = (P_{j+1}^T A)^T

Coming back to our own problem, we can use the same formula to calculate Q_q^T.
Q_j = (A^T A)^j A^T P_o (\prod_{s = 0}^{2j} R_s^{-1}.

This turns out to work well in practice; the example set was a database of labeled faces, which have the first 20 eigenvalues decaying fast, then the next decaying very slowly. This kind of alternating SVD works well for faces.

Johnson-Lindenstrauss and RIP September 10, 2010

Posted by Sarah in Uncategorized.
Tags: ,
2 comments

Via Nuit Blanche, I found a paper relating the Johnson-Lindenstrauss Theorem to the Restricted Isometry Property (RIP.) It caught my eye because the authors are Rachel Ward and Felix Krahmer, whom I actually know! I met Rachel when I was an undergrad and she was a grad student. We taught high school girls together and climbed mountains in Park City, where I also met Felix.

So what’s this all about?
The Johnson-Lindenstrauss Lemma says that any p points in n-dimensional Euclidean space can be embedded in O(\epsilon^{-2} \log(p)) dimensions, without distorting the distance between any two points by a distance of more than a factor between (1 -\epsilon) and (1 + \epsilon). So it gives us almost-isometric embeddings of high-dimensional data in lower dimensions.

The Restricted Isometry Property, familiar to students of compressed sensing, is the property of a matrix \Phi that

(1-\delta) ||x||_2^2 \le ||\Phi x||_2^2 \le (1 + \delta) ||x||_2^2
for all sufficiently sparse vectors x.
The relevance of this property is that these are the matrices for which \ell_1-minimization actually yields the sparsest vector — that is, RIP is a sufficient condition for basis pursuit to work.

Now these two concepts involve equations that look a lot alike… and it turns out that they actually are related. The authors show that RIP matrices produce Johnson-Lindenstrauss embeddings. In particular, they produce improved Johnson-Lindenstrauss bounds for special types of matrices with known RIP bounds.

The proof relies upon Rademacher sequences (uniformly distributed in [-1, 1]), which I don’t know a lot about, but should probably learn.

[N.B.: corrections from the original post come from Felix.]