{"title": "Learning Appearance Based Models: Mixtures of Second Moment Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 845, "page_last": 851, "abstract": "", "full_text": "Learning Appearance Based Models: \nMixtures of Second Moment Experts \n\nChristoph 8regler and Jitendra Malik \n\nComputer Science Division \n\nUniversity of California at Berkeley \n\nemail: bregler@cs.berkeley.edu, malik@cs.berkeley.edu \n\nBerkeley, CA 94720 \n\nAbstract \n\nThis paper describes a new technique for object recognition based on learning \nappearance models. The image is decomposed into local regions which are \ndescribed by a new texture representation called \"Generalized Second Mo(cid:173)\nments\" that are derived from the output of multiscale, multiorientation filter \nbanks. Class-characteristic local texture features and their global composition \nis learned by a hierarchical mixture of experts architecture (Jordan & Jacobs). \nThe technique is applied to a vehicle database consisting of 5 general car \ncategories (Sedan, Van with back-doors, Van without back-doors, old Sedan, \nand Volkswagen Bug). This is a difficult problem with considerable in-class \nvariation. The new technique has a 6.5% misclassification rate, compared to \neigen-images which give 17.4% misclassification rate, and nearest neighbors \nwhich give 15 .7% misclassification rate. \n\n1 Introduction \n\nUntil a few years ago neural network and other statistical learning techniques were not very \npopular in computer vision domains. Usually such techniques were only applied to artificial \nvisual data or non-mainstream problems such as handwritten digit recognition. \nA significant shift has occurred recently with the successful application of appearance-based \nor viewer-centered techniques for object recognition, supplementing the use of 3D models. \nAppearance-based schemes rely on collections of images of the object. A principal advantage \nis that they implicitly capture both shape and photometric information(e.g. surface reflectance \nvariation). They have been most sucessfully applied in the domain of human faces [15, 11 , I , \n14] though other 3d objects under fixed lighting have also been considered [13] . View-based \nrepresentations lend themselves very naturally to learning from examples- principal component \nanalysis[15, 13] and radial basis functions[l] have been used. \nApproaches such as principal component analysis (or \"eigen-images\") use global representations \nat the image level. The objective of our research was to develop a representation which would \n\n\f846 \n\nC. Bregler and J. Malik \n\nbe more 'localist', where representations of different 'parts' of the object would be composed \ntogether to form the representation of the object as a whole. This appears to be essential in order \nto obtain robustness to occlusion and ease of generalization when different objects in a class may \nhave variations in particular parts but not others. A part based view is also more consistent with \nwhat is known about human object recognition (Tanaka and collaborators). \nIn this paper, we propose a domain independent part decomposition using a 2D grid representation \nof overlapping local image regions. The image features of each local patch are represented using a \nnew texture descriptor that we call \"Generalized Second Moments\". Related representations have \nalready been successfully applied to other early-vision tasks like stereopsis, motion, and texture \ndiscrimination. Class-based local texture features and their global relationships are induced using \nthe \"Hierarchical Mixtures of Experts\" Architecture (HME) [8]. \nWe apply this technique to the domain of vehicle classification. The vehicles are seen from behind \nby a camera mounted above a freeway(Figure 1). We urge the reader to examine Figure 3 to see \nexamples of the in class variations in the 5 different categories. Our technique could classify five \nbroader categories with an error of as low as 6.5% misclassification, while the best results using \neigen-images and nearest neighbor techniques were 17.4% and 15.7% mis-classification error. \n\nFigure 1: Typical shot of the freeway segment \n\n2 Representation \n\nAn appearance based representation should be able to capture features that discriminate the \ndifferent object categories. It should capture both local textural and global structural information. \nThis corresponds roughly to the notion in 3D object models of (i) parts (ii) relationship between \nparts. \n\n2.1 Structural Description \n\nObjects usually can be decomposed into parts. A face consists of eyes, nose, and mouth. Cars \nare made out of window screens, taillights, license plates etc. The question is what granularity is \nappropriate and how much domain knowledge should be exploited. A car could be a single part in \na scene, a license plate could be a part, or the letters in the license plate could be the decomposed \nparts. Eyes, nose, and mouth could be the most important parts of a face for recognition, but \nmaybe other parts are important as well. \nIt would be advantageous if each part could be described in a decoupled way using a representation \nthat was most appropriate for it. Object classification should be based on these local part \ndescriptions and the relationship between the parts. The partitioning reduces the complexity \ngreatly and invariance to the precise relation between the parts could be achieved. \nFor our domain of vehicle classification we don't believe it is appropriate to explicitly code any \n\n\fLearning Appearance Based Models: Mixtures of Second Moment Experts \n\n847 \n\npart decomposition. The kind and number of useful parts might vary across different car makes. \nThe resolution of the images (1 OOx 100 pixel) restricts us to a certain degree of granularity. We \ndecided to decompose the image using a 20 grid of overlapping tiles or Gaussian windows but \nonly local classification for each tile region is done. The content of each local tile is represented \nby a feature vector (next section). The generic grid representation allows the mixture of experts \narchitecture to induce class-based part decomposition, and extract local texture and global shape \nfeatures . For example the outline of a face could be represented by certain orientation dominances \nin the local tiles at positions of the face boundary. The eyes are other characteristic features in \nthe tiles. \n\n2.2 Local Features \n\nWe wanted to extract characteristic features from each local tile. The traditional computer vision \napproach would be to find edges and junctions. The weakness of these representations is that \nthey do not capture the richness of textured regions, and the hard decision thresholds make the \nmeasurement process non-robust. \nAn alternative view is motivated by our understanding of processing in biological vision systems. \nWe start by convolving image regions with a large number of spatial filters, at various orientations, \nphases, and scales. The response values of such filters contain much more general information \nabout the local neighborhood, a fact that has now been recognized and exploited in a number of \nearly vision tasks like stereopsis, motion and texture analysis [16,9,6, 12, 7]. \nAlthough this approach is loosely inspired by the current understanding of processing in the \nearly stages of the primate visual system, the use of spatial filters has many advantages from a \npure analytical viewpoint[9, 7]. We use as filter kernels, orientation selective elongated Gaussian \nderivatives. This enables one to gain the power of orientation specific features, such as edges, \nwithout the disadvantage of non-robustness due to hard thresholds. If multiple orientations are \npresent at a single point (e.g. junctions), they are represented in a natural way. Since multiple \nscales are used for the filters, no ad hoc choices have to be made for the scale parameters of \nthe feature detectors. Interestingly the choices of these filter kernels can also be motivated in a \nlearning paradigm as they provide very useful intermediate layer units in convolutional neural \nnetworks [3]. \nThe straightforward approach would then be to characterize each image pixel by such a vector \nof feature responses. However note that there is considerable redundancy in the filter responses(cid:173)\nparticularly at coarse scales, the responses of filters at neighboring pixels are strongly correlated. \nWe would like to compress the representation in some way. One approach might be to subsample \nat coarse scales, another might be to choose feature locations with local magnitude maxima or \nhigh responses across several directions. However there might be many such interesting points \nin an image region. It is unclear how to pick the right number of points and how to order them. \nLeaving this issue of compressing the filter response representation aside for the moment, let \nus study other possible representations of low level image data. One way of representing the \ntexture in a local region is to calculate a windowed second moment matrix [5]. Instead of finding \nmaxima offilterresponses, the second moments of brightness gradients in the local neighborhood \nare weighted and averaged with a circular Gaussian window. The gradient is a special caSe of \nGaussian oriented filter banks. The windowed second moment matrix takes into account the \nresponse of all filters in this neighborhood. The disadvantage is that gradients are not very \norientation selective and a certain scale has to be selected beforehand. Averaging the gradients \n\"washes\" out the detailed orientation information in complex texture regions . \nOrientation histograms would avoid this effect and have been applied successfully for classi(cid:173)\nfication [4]. Elongated families of oriented and scaled kernels could be used to estimate the \norientation at each point. But as pointed out already, there might be more than one orientation at \neach point, and significant information is lost. \n\n\f848 \n\nC. Bregler and 1. Malik \n\nFigure 2: Left image: The black rectangle outlines the selected area of interest. Right image: The \nreconstructed scale and rotation distribution of the Generalized Second Moments. The horizontal \naxis are angles between 0 and 180 degrees and the vertical axis are different scales. \n\n3 Generalized Second Moments \n\nWe propose a new way to represent the texture in a local image patch by combining the filter \nbank approach with the idea of second moment matrices. \n\nThe goal is to compute a feature vector for a local image patch that contains information about \nthe orientation and scale distribution. We compute for each pixel in the image patch the R basis \nkernel responses (using X-Y separable steerable scalable approximations of a rich filter family). \nGiven a spatial weighting function of the patch (e.g. Gaussian), we compute the covariance \nmatrix of the weighted set of R dimensional vectors. In [2] we show that this covariance matrix \ncan be used to reconstruct for any desired oriented and scaled version of the filter family the \nweighted sum of all filter response energies: \n\nE(O, CT) = L W(x, y)[F(J,a(x, y)f \n\nX,Y \n\n(1) \n\nUsing elongated kernels produces orientation/scale peaks, therefore the sum of all orienta(cid:173)\ntion/scale responses doesn't \"wash\" out high peaks. The height of each individual peak corre(cid:173)\nsponds to the intensity in the image. Little noisy orientations have no high energy responses in the \nsum. E( 0, CT) is somehow a \"soft\" orientation/scale histogram of the local image patch. Figure \n2 shows an example of such a scale/orientation reconstruction based on the covariance matrix \n(see [2] for details). Three peaks are seen, representing the edge lines along three directions and \nscales in the local image patch. \nThis representation greatly reduces the dimensionality without being domain specific or applying \nany hard decisions. It is shift invariant in the local neighborhood and decouples scale in a nice \nway. Dividing the R x R covariance matrix by its trace makes this representation also illumination \ninvariant. \nUsing a 10xlO grid and a kernel basis of 5 first Gaussian derivatives and 5 second Gaussian \nderivatives represents each input image as an 10 . 10 . (5 + 1) . 5 = 3000 dimensional vector (a \n5 x 5 covariance matrix has (5 + 1) \u00b75 independent parameters). Potentially we could represent \nthe full image with one generalized second moment matrix of dimension 20 if we don't care \nabout capturing the part decomposition. \n\n4 Mixtures of Experts \n\nEven if we only deal with the restricted domain of man-made object categories (e.g. cars), the \nextracted features still have a considerable in-class variation. Different car shapes and poses \nproduce nonlinear class subspaces. Hierarchical Mixtures of Experts (HME by Jordan & Jacobs) \nare able to model such nonlinear decision surfaces with a soft hierarchical decomposition of the \n\n\fLearning Appearance Based Models: Mixtures of Second Moment Experts \n\n849 \n\nSedan \n\nVan1 \n\nBug \n\nOld \n\nVan2 \n\nFigure 3: Example images of the five vehicle classes. \n\nfeature space and local linear classification experts. Potentially different experts are \"responsible\" \nfor different object poses or sub-categories. \nThe gating functions decompose the feature space into a nested set of regions using a hierarchical \nstructure of soft decision boundaries. Each region is the domain for a specific expert that classifies \nthe feature vectors into object categories. We used generalized linear models (GUM). Given the \ntraining data and output labels, the gating functions and expert functions can be estimated using \nan iterative version of the EM-algorithm. For more detail see [8]. \n\nIn order to reduce training time and storage requirements, we trained such nonlinear decision \nsurfaces embedded in one global linear subspace. We choose the dimension of this linear \nsubspace to be large enough, so that it captures most of the lower-dimensional nonlinearity (3000 \ndimensional feature space projected into an 64 dimensional subspace estimated by principal \ncomponents analysis). \n\n5 Experiments \n\nWe experimented with a database consisting of images taken from a surveillance camera on a \nbridge covering normal daylight traffic on a freeway segment (Figure 1). The goal is to classify \ndifferent types of vehicles. We are able to segment each moving object based on motion cues \n[10] . We chose the following 5 vehicle classes: Modem Sedan, Old Sedan, Van with back-doors, \nVan without back-door, and Volkswagen Bug. The images show the rear of the car across a small \nset of poses (Figure 3). All images are normalized to looxloo pixel using bilinear interpolation. \nFor this reason the size or aspect ratio can not be used as a feature. \nWe ran our experiments using two different image representations: \n\n\u2022 Generalized Second Moments: A 10 x 10 grid was used. Generalized second moments \n\n\f850 \n\nC. Bregler and J. Malik \n\nMiucl ... \n\nClassiftcation Error \n\n1.00 \n\n0.90 \n\n0.80 \n\n0.70 \n\n\\. \n\n0.60 \n\nO.SO \\\\ \n0.40 \\'\\ \n-\\\"', \n----'\" :~-\n---.-.... \" ... ::~:::~ ~.,.. \n\n0.30 \n\n0.20 \n\n0.10 \n\n0.00 \n\n~---~ \n-~- -----:::: =:::==... \n\n---\n\n1\"'00.. \n\n.-. GSM+HME \n\n50.00 \n\n100.00 \n\n150.00 \n\n200.00 \n\nTraiD.Si2e \n\nFigure 4: The classification errors of four different techniques. The X-axis shows the size of \nthe training set, and the Y-axis shows the percentage of rnisclassified test images. HME stands \nfor Hierarchical Mixtures of Experts, GSM stands for Generalized Second Moments, and I-NN \nstands for Nearest Neighbors. \n\nwere computed 1 using a window of (F = 6 pixel, and 5 filter bases of 3: I elongated first \nand second Gaussian derivatives on a scale range between 0.25 and 1.0. \n\n\u2022 Principal Components Analysis (\"Eigen-Images\"): We used no grid decomposition and \n\nprojected the global gray level vector into a 64 dimensional linear space. \n\nTwo different classifiers were used: \n\n\u2022 HME architecture with 8 local experts. \n\u2022 A simple I-Nearest-Neighbor Classifier (1-NN). \n\nFigure 4 shows the classification error rate for all 4 combinations as a function of the size of the \ntraining set. Each experiment is run 5 times with different sampling of training and test images2 . \nThe database consists of 285 example images. Therefore the number of test images are (285-\nnumber of training images). \nAcross all experiments the HME architecture based on Generalized Moments was superior to all \nother techniques. The best performance with a misclassification of 6.5% was achieved using 228 \ntraining images. When fewer than 120 training images are used, the HME archi tecture performed \nworse than nearest neighbors. \nThe most common confusion was between sedans and \"old\" sedans. The second most confusion \nwas between vans with back-doors, vans without back-doors, and old sedans. \n\nIWe experimented also with grid sizes between 6x6 to 16x16, and with 8 filter bases and a rectangle \n\nwindow for the second moment statistics without getting significant improvement \n\n2Por a given training size n we trained 5 classifiers on 5 different training and test sets and computed \nthe average error rate. The training and test set for each classifier was generated by the same database. The \nn training examples were randomly sampled from the database, and the remaining examples were used for \nthe test set. \n\n\fLearning Appearance Based Models: Mixtures of Second Moment Experts \n\n851 \n\n6 Conclusion \n\nWe have demonstrated a new technique for appearance-based object recognition based on a 20 \ngrid representation, generalized second moments, and hierarchical mixtures of experts. Experi(cid:173)\nments have shown that this technique has significant better performance than other representation \ntechniques like eigen-images and other classification techniques like nearest neighbors. \n\nWe believe that learning such appearance-based representations offers a very attractive method(cid:173)\nology. Hand-coding features that could discriminant object categories like the different car types \nin our database seems to be a nearly impossible task. The only choice in such domains is to \nestimate discriminating features from a set of example images automatically. \n\nThe proposed technique can be applied to other domains as well. We are planning to experiment \nwith face databases, as well as larger car databases and categories to furtherinvestigate the utility \nof hierarchical mixtures of experts and generalized second moments. \n\nAcknowledgments We would like to thank Leo Breiman, Jerry Feldman, Thomas Leung, Stuart Russell, \nand Jianbo Shi for helpful discussions and Michael Jordan, Lawrence Saul, and Doug Shy for providing \ncode. \n\nReferences \n[1] D. Beymer, A. Shashua, and T. Poggio. Example based image analysis and synthesis. M.J. T. A.l. \n\nMemo No. 1431, Nov 1993. \n\n[2] c. Bregler and J. Malik. Learning Appearance Based Models: Hierarchical Mixtures of Experts \nApproach based on Generalized Second Moments. Technical Report UCBIICSD-96-897, Compo Sci. \nDep., U.c. Berkeley, http://www.cs/ breglerlsoft.html, 1996. \n\n[3] Y. Le Cun, B. Boser, J.S. Denker, S. Solla, R. Howard, and L. Jackel. Back-propagation applied to \n\nhandwritten zipcode recognition. Neural Computation, 1(4), 1990. \n\n[4] W. Freeman and M. Roth. Orientation histograms for hand gesture recognition. In International \n\nWorkshop on Automatic Face- and Gesture-Recognition, 1995. \n\n[5] J. Garding and T. Lindeberg. Direct computation of shape cues using scale-adapted spatial derivative \n\noperators. Int. J. of Computer Vision, 17, February 1996. \n\n[6] OJ. Heeger. Optical flow using spatiotemporal filters. Int. J. of Computer Vision, 1, 1988. \n[7] D. Jones and 1. Malik. Computational framework for determining stereo correspondence from a set \n\nof linear spatial filters. Image and Vision Computing, 10(10), 1992. \n\n[8] M.1. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural \n\nComputation, 6(2), March 1994. \n\n[9] J.J. Koenderink. Operational significance of receptive field assemblies. BioI. Cybern., 58: 163-171, \n\n1988. \n\n[10] D. Koller, J. Weber, and J. Malik. Robust multiple car tracking with occlusion reasoning. In Proc. \n\nThird European Conference on Computer Vision, pages 189-196, May 1994. \n\n[11] M. Lades,J.C. Vorbrueggen, J. Buhmann, J. Lange, C. von der Malsburg, and R.P. Wuertz. Distortion \ninvariant object recognition in the dynamic link architecure. In IEEE Transactions on Computers, \nvolume 42, 1993. \n\n[12] J. Malik and P. Perona. Preattentive texture discrimination with early vision mechanisms. J. Opt. Soc. \n\nAm. A, 7(5):923-932, 1990. \n\n[13] H. Murase and S.K. Nayar. Visual learning and recognition of 3-d objects from appearance. Int. J. \n\nComputer Vision, 14(1):5-24, January 1995. \n\n[14] H.A. Rowley, S. Baluja, and T. Kanade. Human face detection in visual scenes. In NIPS, volume 8, \n\n1996. \n\n[15] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1 ):71-86, \n\n1991. \n\n[16] R.A. Young. The gaussian derivative theory of spatial vision: Analysis of cortical cell receptive field \n\nline-weighting profiles. Technical Report GMR-4920, General Motors Research, 1985. \n\n\f", "award": [], "sourceid": 1223, "authors": [{"given_name": "Christoph", "family_name": "Bregler", "institution": null}, {"given_name": "Jitendra", "family_name": "Malik", "institution": null}]}