Multiscale SVD, K planes and geometric wavelets November 30, 2010

Posted by Sarah in Uncategorized.
Tags: , , , ,

I went to the applied math talk by Mauro Maggioni today; here are notes from the presentation.

Multiscale SVD

We often have to deal with high-dimensional data clouds that may lie on a low-dimensional manifold. Examples: databases of faces, collections of text documents, configurations of molecules. But, we don’t know a priori what the dimension of the manifold is. So we ned some mechanisms for estimating that, in the situation where we have noise in the data.

Some terminology: the data points are $x_i + \eta_i$, the real data plus some white Gaussian noise of variance $\sigma^2$. The intrinsic dimension of the manifold $M$ is $k$ while the dimension of the ambient space is $D$.

Previous work at estimating dimensions generally falls into two categories. One is volume-based (the number of points on the manifold confined within a ball should grow as $r^k$ where $r$ is the radius, if you’ve chosen the right dimension k. The other possibility is PCA, done locally and patched together somehow to account for the fact that we have a curved manifold rather than a linear hyperplane. The problem with this is that you don’t know what “locally” means — how many patches do we need? There’s also the so-called “curvature dilemma” with PCA: the more highly curved the manifold, the more likely it is to confuse the top principal value (representing the tangent direction) with the second-highest principal value (representing the direction of curvature.)

Maggioni’s alternative is multiscale SVD. We fix a point. The noise around the manifold looks like a hollow tube — most of the noise is concentrated at a radius of $\sigma \sqrt{D}$ from the surface of the manifold itself. So looking at a small ball (smaller than the radius of the tube) picks up less noise, looking at a large ball picks up all the noise, and looking at a still larger ball begins to pick up the curvature of the manifold. We do SVD within all these balls, and look at the singular values over a range of scales. There’s a “sweet spot,” a particular interval of scales where the signal singular values have separated from the noise singular values, but the curvature singular values haven’t caught up yet; this accurately represents the curvature of the manifold. (I blogged about this earlier here and the paper is here.)

The assumption about the data is that the tangent covariance grows differently ($O(r^2)$) with respect to the radius than the normal covariance ($O(r^4)$). This can be computed to be true explicitly for manifolds of co-dimension one, for instance.

The result is that if $M$ is assumed to have small enough curvature and $\eta$ is assumed to have small enough variance, then with high probability we can determine the intrinsic dimension $k$ with only $k \log k$ points. Compared to a large number of other dimension-estimating methods, this outperforms all of them. Isomap, in particular, really falls apart in the presence of noise, and can’t tell a 6-dimensional sphere from a 6-dimensional cube.

One interesting fact is that many real-life data sets have dimensionality that varies. A database of text documents, for instance, is low-dimensional in some regions, and very, very high-dimensional in others. (We measure dimensionality locally, so there’s no reason in principle that it shouldn’t vary.)

Measures supported on K planes
Sometimes high-dimensional data is supported on a set of planes $\pi_1 \dots \pi_k$, possibly of different dimensions $d_1 \dots d_k$. We don’t know how many planes there are, which planes they are, or what their dimensions are. This is the situation in face recognition; it’s also relevant to image and signal processing more generally (dictionary learning, sparse approximation, etc.)

Maggioni and collaborators developed an algorithm for doing this.

1) Draw $n_0$ random points and $n n_0$ nearest neighbors. Do multiscale SVD at these points. Produce an estimated plane $\hat{\pi}_k$ and its estimated dimension $\hat{d}_k$. Also estimate the noise level $\hat{\sigma}$.

2. Construct an affinity matrix, with the (i, j)th coordinate

$e^{\frac{-d(x_i, \hat{\pi}_k)^2}{\hat{\sigma}^2}}$

This is almost a matrix composed of zeros and ones — either a point is on a plane (in which we get one) or a point is not on that plane (in which case we get 0 because the distance is large.) In practice, with noise and curvature, it’s not quite binary, but it’s close.

3. Denoise and cluster this matrix, and find updated estimates for the planes from this.

Tested on a face database, this worked very well. It misclassified only 3 faces out of 640. What’s remarkable here is that the Yale Face Database used had 64 different photos of each person, taken at different angles, different lighting conditions, etc. This has always been a hard problem for computer vision: how do you recognize that two pictures are the same person viewed frontal and profile, while two other pictures are of two different people? Apparently, this is a solution to such heterogeneity problems.

Geometric wavelets
The problem of dictionary learning is as follows. Given $n$ points in $\mathbb{R}^D$, and a data matrix $X$, construct a dictionary $\Phi$ of m vectors such that

$X = \Phi \alpha$
where $\alpha$ is a sparse vector.

There’s a tradeoff here between the size of $\Phi$ and the sparsity of $\alpha$: obviously, you could just let $\Phi$ be the whole of $X$ and then $\alpha$ would just be the identity, but that would be silly — it wouldn’t compress the data at all.

One way of doing this is something called geometric wavelets. (Paper here. Shorter, more descriptive paper with cool picture here.)
Given a manifold, we partition it into a dyadic tree. We do principal components analysis on the coarsest scale and get a plane; then we divide that into two smaller subsets, do PCA on each of them, and get two new planes. For each plane, at the center point of that cube in the manifold grid, we have a tangent vector $\Phi_j$, which approximates the tangent to the manifold. The wavelets keep track of the differences $\Phi_j - \Phi_{j+1}$. If we have some smoothness in the manifold, these corrections decay quickly, so finitely many wavelets are a good approximator for any point.

There’s a fast algorithm for computing this, very like the wavelet transform, except for a point cloud rather than a function. This is useful for compressing databases, analyzing online data, and so forth.

The idea is very like Peter Jones’ construction of beta numbers (summary here) except that it may be more efficient in terms of storage. More on geometric wavelets from Dan Lemire.