Abstract
Many recognition algorithms depend on careful positioning of an object into a canonical pose, so the position
of features relative to a fixed coordinate system can be examined. Currently, this positioning is done either manually or by training a class-specialized learning algorithm
with samples of the class that have been hand-labeled with
parts or poses. In this paper, we describe a novel method to
achieve this positioning using poorly aligned examples of a
class with no additional labeling.
Given a set of unaligned examplars of a class, such as faces, we automatically build an alignment mechanism, without any additional labeling of parts or poses in the data set. Using this alignment mechanism, new members of the class, such as faces resulting from a face detector, can be precisely aligned for the recognition process. Our alignment method improves performance on a face recognition task, both over unaligned images and over images aligned with a face alignment algorithm specifically developed for and trained on hand-labeled face images. We also demonstrate its use on an entirely different class of objects (cars), again without providing any information about parts or pose to the learning algorithm.
1. Introduction
The identification of certain objects classes, such as faces
or cars, can be dramatically improved by first transforming
a detected object into a canonical pose. Such registration
reduces the variability that an identification system or classifier must contend with in the modeling process. Subsequent identification can condition on spatial position for a
detailed analysis of the structure of the object in question.
Thus, many recognition algorithms assume the prior rough
alignment of objects to a canonical pose [1, 7, 15, 17]. In
general, the better this alignment is, the better identification results will be. In fact, alignment itself has emerged
as an important sub-problem in the face recognition literature [18], and a number of systems exist for the detailed alignment of specific categories of objects, such as faces [3, 4, 5, 6, 12, 19, 20].
We point out that it is frequently much easier to obtain
images that are roughly aligned than those that are precisely
aligned, indicating an important role for automatic alignment procedures. For example, images of people can be
taken easily with a motion detector in an indoor environment, but will result in images that are not precisely aligned.
Although there exist many individual components to do
both detection and recognition, we believe the absence of a
complete end-to-end system capable of performing recognition from an arbitrary scene is in large part due to the
difficulty in alignment, the middle stage of the recognition
pipeline (Figure 1). Often, the middle stage is ignored, with
the assumption that the detector will perform a rough alignment, leading to suboptimal recognition performance.
A system that did attempt to address the middle stage
would suffer from two signficant drawbacks of current
alignment methods:
- They are typically designed or trained for a single class of objects, such as faces.
- They require the manual labelling either of specific features of an object (like the middle of the eye or the corners of the mouth),1 or a description of the pose (such as orientation and position information).
As a result, these methods require significant additional
effort when applied to a new class of objects. Either they
must be redesigned from scratch, or a new data set must be
collected, identifying specific parts or poses of the new data
set before an alignment system can be built. In contrast systems for the detection and recognition steps of the recognition pipeline only require simple, discrete labels, such
as object versus non-object or pair match versus pair nonmatch, which are straight forward to obtain, making these
systems significantly easier to set up than current systems
for alignment, where even the form of the supervised input
is very often class-dependent.
Some previous work has used detectors capable of returning some information about object rotation, in addition
to position and scale, such as, for faces, [8, 16]. Using the
detected rotation angle, along with the scale and position
of the detected region, one could place each detected object
into a canonical pose. However, so far, these efforts have
only provided very rough alignment due to the lack of precision in estimating the pose parameters. For example, in [8],
the rotation is only estimated to within 30 degrees, so that
one of 12 rotation-specific detectors can be used. Moreover,
even in the case of frontal faces, position and scale are only
roughly estimated, and, in fact, for face images, we use this
as a starting point and show that a more precise alignment
can be obtained.
More concretely, in this work, we describe a system that,
given a collection of images from a particular class, automatically generates an “alignment machine” for that object
class. The alignment machine, which we call an image funnel, takes as input a poorly aligned example of the class and
returns a well-aligned version of the example. The system is
fully automatic in that it is not necessary to label parts of the
objects or identify their initial poses, or even specify what
constitutes an aligned image through an explicitly labeled
canonical pose, although it is important that the objects be
roughly aligned to begin with. For example, our system
can take a set of images as output by the Viola-Jones face
detector, and return an image funnel which dramatically improves the subsequent alignment of facial images.
(We note that the term alignment has a special meaning
in the face recognition community, where it is often used to
refer to the localization of specific facial features. Here,
because we are using images from a variety of different
classes, we use the term alignment to refer to the rectification of a set of objects that places the objects into the same
canonical pose. The purpose of our alignments is not to
identify parts of objects, but rather to improve positioning
for subsequent processing, such as an identification task.)
3. Methodology
3.1. Congealing with SIFT descriptors
We now describe how we have adapted the basic congealing algorithm to work on realistic sets of images. We
consider a sequence of possible choices for the alphabet X
on which to congeal. In particular, we discuss how each
choice improves upon the previous choice, eventually leading to an appropriate feature choice for congealing on complex images.
In applying congealing to complicated images such as
faces from news photographs, a natural first attempt is to
set the alphabet X over the possible color values at each
pixel. However, the high variation present in color in the
foreground object as well as the variation due to lighting
will cause the distribution field to have high entropy even
under a proper alignment, violating one of the necessary
conditions for congealing to work.
Rather than considering color, one could set X to be binary, corresponding to the absence or presence of an edge
at that pixel. However, another necessary condition for congealing to work is that there must be a “basin of attraction”
at each point in the parameter space toward a low entropy
distribution.
For example, consider two binary images a and b of
the number 1, identical except for an x-translation. When
searching over possible transformations to align b to a, unless the considered transformation is close enough to the
exact displacement to cause b and a to overlap, the transformation will not cause any change in the entropy of the
resulting distribution field.
Another way of viewing the problem is that, when X
is over edge values, there will be plateaus in the objective
function that congealing is minimizing, corresponding to
neighborhoods of transformationsthat do not cause changes
in the amount of edge overlap between images, creating
many local minima problems in the optimization.
Therefore, rather than simply taking the edge values, instead, to generate a basin of attraction, one could integrate
the edge values over a window for each pixel. To do this,
we calculate the SIFT descriptor [13] over an 8x8 window
for each pixel. This gives the desired property, since if a
section of one pixel’s window shares similar structure with
a section of another pixel’s window (need not be the corresponding section), then the SIFT descriptors will also be
similar. In addition, using the SIFT descriptor gives additional robustness to lighting.
Congealing directly with the SIFT descriptors has its
own difficulties, as each SIFT descriptor is a 32 dimensional
vector in our implementation, which is too large of a space
to estimate entropy without an extremely large amount of
data. Instead, we compute the SIFT descriptors for each
pixel of each image in the set, and then cluster these using kmeans to produce a small set of clusters (in our experiments, we have been using 12 clusters), and let X be
over the possible clusters. In other words, the distribution
fields consist of distributions over the possible clusters at
each pixel.
After clustering, rather than assigning a cluster for each
pixel, we instead do a soft assignment of cluster values for
each pixel. Congealing with hard assignments of pixels to
clusters would force each pixel to take one of a small number of cluster values, leading to local plateaus in the optimization landscape. For example, in the simpliest case,
doing a hard assignment with two clusters would lead to the
same local minima problems as discussed before with edge
values.
This problem of local minima was borne out by preliminary experiments we ran using hard cluster assignments,
where we found that the congealing algorithm would terminate early without significantly altering the initial alignment
of any of the images.
To get around this problem, we model the pixel’s SIFT
descriptors as being generated from a mixture of Gaussians
model, with one Gaussian centered at each cluster center
and σi’s for each cluster that maximize the likelihood of
the labeling. Then, for each pixel, we have a multinomial
distribution with size equal to the number of clusters, where
the probability of an outcome i is equal to the probability
that the pixel belongs to cluster i. So, instead of having an intensity value at each pixel, as in traditional congealing,
we have a vector of probabilities at each pixel.
The idea of treating each pixel as a mixture of clusters is
motivated by the analogy to gray pixels in the binary image
case. In the binary image case, a gray pixel is interpreted
as being a mixture of underlying black and white “subpixels” [10]. In the same way, rather than doing a hard assignment of a pixel to one cluster, we treat each pixel as being a
mixture of the underlying clusters.
No comments:
Post a Comment
Thanks for your comments