-->

Welcome to our Coding with python Page!!! hier you find various code with PHP, Python, AI, Cyber, etc ... Electricity, Energy, Nuclear Power

Wednesday, 20 October 2021

Unsupervised Joint Alignment of Complex Images

All about Agile, Ansible, DevOps, Docker, EXIN, Git, ICT, Jenkins, Kubernetes, Puppet, Selenium, Python, etc

Abstract 

Many recognition algorithms depend on careful positioning of an object into a canonical pose, so the position of features relative to a fixed coordinate system can be examined. Currently, this positioning is done either manually or by training a class-specialized learning algorithm with samples of the class that have been hand-labeled with parts or poses. In this paper, we describe a novel method to achieve this positioning using poorly aligned examples of a class with no additional labeling. 




Given a set of unaligned examplars of a class, such as faces, we automatically build an alignment mechanism, without any additional labeling of parts or poses in the data set. Using this alignment mechanism, new members of the class, such as faces resulting from a face detector, can be precisely aligned for the recognition process. Our alignment method improves performance on a face recognition task, both over unaligned images and over images aligned with a face alignment algorithm specifically developed for and trained on hand-labeled face images. We also demonstrate its use on an entirely different class of objects (cars), again without providing any information about parts or pose to the learning algorithm. 

1. Introduction 

The identification of certain objects classes, such as faces or cars, can be dramatically improved by first transforming a detected object into a canonical pose. Such registration reduces the variability that an identification system or classifier must contend with in the modeling process. Subsequent identification can condition on spatial position for a detailed analysis of the structure of the object in question. Thus, many recognition algorithms assume the prior rough alignment of objects to a canonical pose [1, 7, 15, 17]. In general, the better this alignment is, the better identification results will be. In fact, alignment itself has emerged as an important sub-problem in the face recognition literature [18], and a number of systems exist for the detailed alignment of specific categories of objects, such as faces [3, 4, 5, 6, 12, 19, 20]. We point out that it is frequently much easier to obtain images that are roughly aligned than those that are precisely aligned, indicating an important role for automatic alignment procedures. For example, images of people can be taken easily with a motion detector in an indoor environment, but will result in images that are not precisely aligned.
Although there exist many individual components to do both detection and recognition, we believe the absence of a complete end-to-end system capable of performing recognition from an arbitrary scene is in large part due to the difficulty in alignment, the middle stage of the recognition pipeline (Figure 1). Often, the middle stage is ignored, with the assumption that the detector will perform a rough alignment, leading to suboptimal recognition performance. A system that did attempt to address the middle stage would suffer from two signficant drawbacks of current alignment methods:
  • They are typically designed or trained for a single class of objects, such as faces. 
  • They require the manual labelling either of specific features of an object (like the middle of the eye or the corners of the mouth),1 or a description of the pose (such as orientation and position information).  
As a result, these methods require significant additional effort when applied to a new class of objects. Either they must be redesigned from scratch, or a new data set must be collected, identifying specific parts or poses of the new data set before an alignment system can be built. In contrast systems for the detection and recognition steps of the recognition pipeline only require simple, discrete labels, such as object versus non-object or pair match versus pair nonmatch, which are straight forward to obtain, making these systems significantly easier to set up than current systems for alignment, where even the form of the supervised input is very often class-dependent. Some previous work has used detectors capable of returning some information about object rotation, in addition to position and scale, such as, for faces, [8, 16]. Using the detected rotation angle, along with the scale and position of the detected region, one could place each detected object into a canonical pose. However, so far, these efforts have only provided very rough alignment due to the lack of precision in estimating the pose parameters. For example, in [8], the rotation is only estimated to within 30 degrees, so that one of 12 rotation-specific detectors can be used. Moreover, even in the case of frontal faces, position and scale are only roughly estimated, and, in fact, for face images, we use this as a starting point and show that a more precise alignment can be obtained. More concretely, in this work, we describe a system that, given a collection of images from a particular class, automatically generates an “alignment machine” for that object class. The alignment machine, which we call an image funnel, takes as input a poorly aligned example of the class and returns a well-aligned version of the example. The system is fully automatic in that it is not necessary to label parts of the objects or identify their initial poses, or even specify what constitutes an aligned image through an explicitly labeled canonical pose, although it is important that the objects be roughly aligned to begin with. For example, our system can take a set of images as output by the Viola-Jones face detector, and return an image funnel which dramatically improves the subsequent alignment of facial images. (We note that the term alignment has a special meaning in the face recognition community, where it is often used to refer to the localization of specific facial features. Here, because we are using images from a variety of different classes, we use the term alignment to refer to the rectification of a set of objects that places the objects into the same canonical pose. The purpose of our alignments is not to identify parts of objects, but rather to improve positioning for subsequent processing, such as an identification task.)

3. Methodology 

3.1. Congealing with SIFT descriptors 



We now describe how we have adapted the basic congealing algorithm to work on realistic sets of images. We consider a sequence of possible choices for the alphabet X on which to congeal. In particular, we discuss how each choice improves upon the previous choice, eventually leading to an appropriate feature choice for congealing on complex images. In applying congealing to complicated images such as faces from news photographs, a natural first attempt is to set the alphabet X over the possible color values at each pixel. However, the high variation present in color in the foreground object as well as the variation due to lighting will cause the distribution field to have high entropy even under a proper alignment, violating one of the necessary conditions for congealing to work. Rather than considering color, one could set X to be binary, corresponding to the absence or presence of an edge at that pixel. However, another necessary condition for congealing to work is that there must be a “basin of attraction” at each point in the parameter space toward a low entropy distribution. For example, consider two binary images a and b of the number 1, identical except for an x-translation. When searching over possible transformations to align b to a, unless the considered transformation is close enough to the exact displacement to cause b and a to overlap, the transformation will not cause any change in the entropy of the resulting distribution field. Another way of viewing the problem is that, when X is over edge values, there will be plateaus in the objective function that congealing is minimizing, corresponding to neighborhoods of transformationsthat do not cause changes in the amount of edge overlap between images, creating many local minima problems in the optimization. Therefore, rather than simply taking the edge values, instead, to generate a basin of attraction, one could integrate the edge values over a window for each pixel. To do this, we calculate the SIFT descriptor [13] over an 8x8 window for each pixel. This gives the desired property, since if a section of one pixel’s window shares similar structure with a section of another pixel’s window (need not be the corresponding section), then the SIFT descriptors will also be similar. In addition, using the SIFT descriptor gives additional robustness to lighting. Congealing directly with the SIFT descriptors has its own difficulties, as each SIFT descriptor is a 32 dimensional vector in our implementation, which is too large of a space to estimate entropy without an extremely large amount of data. Instead, we compute the SIFT descriptors for each pixel of each image in the set, and then cluster these using kmeans to produce a small set of clusters (in our experiments, we have been using 12 clusters), and let X be over the possible clusters. In other words, the distribution fields consist of distributions over the possible clusters at each pixel. After clustering, rather than assigning a cluster for each pixel, we instead do a soft assignment of cluster values for each pixel. Congealing with hard assignments of pixels to clusters would force each pixel to take one of a small number of cluster values, leading to local plateaus in the optimization landscape. For example, in the simpliest case, doing a hard assignment with two clusters would lead to the same local minima problems as discussed before with edge values. This problem of local minima was borne out by preliminary experiments we ran using hard cluster assignments, where we found that the congealing algorithm would terminate early without significantly altering the initial alignment of any of the images. To get around this problem, we model the pixel’s SIFT descriptors as being generated from a mixture of Gaussians model, with one Gaussian centered at each cluster center and σi’s for each cluster that maximize the likelihood of the labeling. Then, for each pixel, we have a multinomial distribution with size equal to the number of clusters, where the probability of an outcome i is equal to the probability that the pixel belongs to cluster i. So, instead of having an intensity value at each pixel, as in traditional congealing, we have a vector of probabilities at each pixel. The idea of treating each pixel as a mixture of clusters is motivated by the analogy to gray pixels in the binary image case. In the binary image case, a gray pixel is interpreted as being a mixture of underlying black and white “subpixels” [10]. In the same way, rather than doing a hard assignment of a pixel to one cluster, we treat each pixel as being a mixture of the underlying clusters.


No comments:

Post a Comment

Thanks for your comments

Rank

seo