Posted by Sarah in Uncategorized.
Tags: , , ,

Currently reading this paper by Alex Bronstein et al.

The idea here is to make searchable databases for shapes (2d and 3d objects) in the same way a text engine responds to search queries.

There are two main parts to this task: feature detection, and feature description. Different methods can be employed to define what a feature is; a dense descriptor just selects all the points in the image as descriptors, but there are other possibilities. SIFT looks for local maxima of the discrete image Laplacian that persist through different scales. MSER finds level sets that show the smallest variation of area when traversing the level-set graph.

Feature description aims to produce a “bag of words” from the features, by performing vector quantization on the feature space. Two images can then be compared by comparing their bags of features with plain old Euclidean distance.

There are extra challenges if we want to do this with 3d objects rather than images: for one, conformations of 3d objects (like a body in different poses) are much more complicated than the (usually affine) transformations seen on a given image. Also, while images are typically represented as matrices of pixels, 3d objects can be represented as meshes, point clouds, level sets, etc, so computations have to be portable between formats. Also, 3d objects don’t have a universal system of coordinates.

The feature description process used by the authors is based on heat kernel signatures.
Recall that the heat kernel $K_t(x, z)$ is the fundamental solution of the heat equation
$(\Delta_x + \frac{\partial }{\partial t} )u = 0.$
The heat kernel can also be seen as the transition density function of a Brownian motion.
For any point on the surface, we give its heat kernel signature as
$p(x) = c(x)(K_{t_1}(x, x), K_{t_2}(x, x), \dots K_{t_n}(x, x))$
This is invariant under isometric deformations of the space, since it depends only on the Riemannian metric. It captures local information in a small neighborhood of x for small t, and global information for large t. It can be proven that any continuous map preserving the heat kernel signature must be an isometry. And computing it depends on computing the eigenvalues of the Laplacian, which can be done with various formats of representing 3d shapes.

$K_t(x, x') = \sum_{l = 0}^\infty e^{-\lambda_i t} \phi_l(x) \phi_l(x')$

So, since the eigenvalues of the Laplacian decay exponentially, we can truncate this sum for computation of the heat kernel.
In the discrete case, instead of the Laplacian we would use a discretization of the form

$\hat{\Delta} (f) = \frac{1}{a_i} \sum_j w_{ij} (f_i - f_j)$

After vector quantization to get a bag of features, the authors generalize this notion to a spatiall sensitive bag of features:

$\int_{X \times X} \theta(x) \theta^T(y) K_t(x, y) d\mu(x) d\mu(y)$
giving a matrix that represents the distribution of nearby words.

From there, we can compare these matrices by embedding them into a Hamming Space. (Hamming distances happen to be very quick to compute on modern CPU architectures.

ShapeGoogle was tested against various possible transformations of shapes (isometries, topology changes, holes, noise, scale changes) and performed well, outperforming other methods like Shape DNA, part-based bag of words, and clock matching bag of features.

It’s an interesting project, and I suspect that the ability to search shape databases will prove useful. Here’s a list of current content-based image retrieval search engines: both Google and Bing use these methods (as opposed to only looking at metadata.)