{"title": "Using a neural net to instantiate a deformable model", "book": "Advances in Neural Information Processing Systems", "page_first": 965, "page_last": 972, "abstract": null, "full_text": "U sing a neural net to instantiate a \n\ndeformable model \n\nChristopher K. I. Williams; Michael D. Revowand Geoffrey E. Hinton \n\nDepartment of Computer Science, University of Toronto \n\nToronto, Ontario, Canada M5S lA4 \n\nAbstract \n\nDeformable models are an attractive approach to recognizing non(cid:173)\nrigid objects which have considerable within class variability. How(cid:173)\never, there are severe search problems associated with fitting the \nmodels to data. We show that by using neural networks to provide \nbetter starting points, the search time can be significantly reduced. \nThe method is demonstrated on a character recognition task. \n\nIn previous work we have developed an approach to handwritten character recogni(cid:173)\ntion based on the use of deformable models (Hinton, Williams and Revow, 1992a; \nRevow, Williams and Hinton, 1993). We have obtained good performance with this \nmethod, but a major problem is that the search procedure for fitting each model to \nan image is very computationally intensive, because there is no efficient algorithm \n(like dynamic programming) for this task. In this paper we demonstrate that it is \npossible to \"compile down\" some of the knowledge gained while fitting models to \ndata to obtain better starting points that significantly reduce the search time. \n\n1 DEFORMABLE MODELS FOR DIGIT RECOGNITION \n\nThe basic idea in using deformable models for digit recognition is that each digit has \na model, and a test image is classified by finding the model which is most likely to \nhave generated it. The quality of the match between model and test image depends \non the deformation of the model, the amount of ink that is attributed to noise and \nthe distance of the remaining ink from the deformed model. \n\n\u00b7Current address: Department of Computer Science and Applied Mathematics, Aston \n\nUniversity, Birmingham B4 7ET, UK. \n\n\f966 \n\nChristopher K. T. Williams, Michael D. Revow, Geoffrey E. Hinton \n\nMore formally, the two important terms in assessing the fit are the prior probabil(cid:173)\nity distribution for the instantiation parameters of a model (which penalizes very \ndistorted models), and the imaging model that characterizes the probability distri(cid:173)\nbution over possible images given the instantiated model l . Let I be an image, M \nbe a model and z be its instantiation parameters. Then the evidence for model M \nis given by \n\nP(IIM) = J P(zIM)P(IIM, z)dz \n\n(1) \n\nThe first term in the integrand is the prior on the instantiation parameters and the \nsecond is the imaging model i.e., the likelihood of the data given the instantiated \nmodel. P(MII) is directly proportional to P(IIM), as we assume a uniform prior \non each digit. \nEquation 1 is formally correct, but if z has more than a few dimensions the evalua(cid:173)\ntion of this integral is very computationally intensive. However, it is often possible \nto make an approximation based on the assumption that the integrand is strongly \npeaked around a (global) maximum value z*. In this case, the evidence can be ap(cid:173)\nproximated by the highest peak of the integrand times a volume factor ~(zII, M), \nwhich measures the sharpness of the peak2 . \n\nP(IIM) ~ P(z*IM)P(Ilz*, M)~(zII, M) \n\n(2) \nBy Taylor expanding around z* to second order it can be shown that the volume \nfactor depends on the determinant of the Hessian of 10gP(z, 11M) . Taking logs \nof equation 2, defining EdeJ as the negative log of P(z*IM), and EJit as the cor(cid:173)\nresponding term for the imaging model, then the aim of the search is to find the \nminimum of E tot = EdeJ + EJit . Of course the total energy will have many local \nminima; for the character recognition task we aim to find the global minimum by \nusing a continuation method (see section 1.2). \n\n1.1 SPLINES, AFFINE TRANSFORMS AND IMAGING MODELS \n\nThis section presents a brief overview of our work on using deformable models for \ndigit recognition. For a fuller treatment, see Revow, Williams and Hinton (1993) . \n\nEach digit is modelled by a cubic B-spline whose shape is determined by the posi(cid:173)\ntions of the control points in the object-based frame. The models have eight control \npoints, except for the one model which has three, and the seven model which has \nfive. To generate an ideal example of a digit the control points are positioned at \ntheir \"home\" locations. Deformed characters are produced by perturbing the con(cid:173)\ntrol points away from their home locations. The home locations and covariance \nmatrix for each model were adapted in order to improve the performance. \n\nThe deformation energy only penalizes shape deformations. Affine transformations, \ni.e., translation, rotation, dilation, elongation, and shear, do not change the under(cid:173)\nlying shape of an object so we want the deformation energy to be invariant under \nthem . We achieve this by giving each model its own \"object-based frame\" and \ncomputing the deformation energy relative to this frame. \n\nlThis framework has been used by many authors, e.g. Grenander et al (1991) . \n2The Gaussian approximation has been popularized in the neural net community by \n\nMacKay (1992) . \n\n\fUsing a Neural Net to Instantiate a Deformable Model \n\n967 \n\nThe data we used consists of binary-pixel images of segmented handwritten digits. \nThe general flavour of a imaging model for this problem is that there should be a \nhigh probability of inked pixels close to the spline, and lower probabilities further \naway. This can be achieved by spacing out a number of Gaussian \"ink generators\" \nuniformly along the contour; we have found that it is also useful to have a uniform \nbackground noise process over the area of the image that is able to account for \npixels that occur far away from the generators. The ink generators and background \nprocess define a mixture model. Using the assumption that each data point is \ngenerated independently given the instantiated model, P(Ilz*, M) factors into the \nproduct of the probability density of each black pixel under the mixture model. \n\n1.2 RECOGNIZING ISOLATED DIGITS \n\nFor each model, the aim of the search is to find the instantiation parameters that \nminimize E tot . The search starts with zero deformations and an initial guess for \nthe affine parameters which scales the model so as to lie over the data with zero \nskew and rotation. A small number of generators with the same large variance are \nplaced along the spline, forming a broad, smooth ridge of high ink-probability along \nthe spline. We use a search procedure similar to the (iterative) Expectation Max(cid:173)\nimization (EM) method of fitting an unconstrained mixture of Gaussians, except \nthat (i) the Gaussians are constrained to lie on the spline (ii) there is a deforma(cid:173)\ntion energy term and (iii) the affine transformation must be recalculated on each \niteration. During the search the number of generators is gradually increased while \ntheir variance decreases according to predetermined \"annealing\" schedule3 . \n\nAfter fitting all the models to a particular image, we wish to evaluate which of the \nmodels best \"explains\" the data. The natural measure is the sum of Ejit, Edej \nand the volume factor. However, we have found that performance is improved by \nincluding four additional terms which are easily obtained from the final fits of the \nmodel to the image. These are (i) a measure which penalizes matches in which \nthere are beads far from any inked pixels (the \"beads in white space\" problem), \nand (ii) the rotation, shear and elongation of the affine transform. It is hard to \ndecide in a principled way on the correct weightings for all of these terms in the \nevaluation function. We estimated the weightings from the data by training a \nsimple postprocessing neural network. These inputs are connected directly to the \nten output units. The output units compete using the \"softmax\" function which \nguarantees that they form a probability distribution, summing to one. \n\n2 PREDICTING THE INSTANTIATION PARAMETERS \n\nThe search procedure described above is very time consuming. However, given many \nexamples of images and the corresponding instantiation parameters obtained by the \nslow method, it is possible to train a neural network to predict the instantiation \nparameters of novel images. These predictions provide better starting points, so the \nsearch time can be reduced. \n\n3The schedule starts with 8 beads increasing to 60 beads in six steps, with the variance \ndecreasing from 0.04 to 0.0006 (measured in the object frame). The scale is set in the \nobject-based frame so that each model is 1 unit high. \n\n\f968 \n\nChristopher K. I. Williams, Michael D. Revow, Geoffrey E. Hinton \n\n2.1 PREVIOUS WORK \n\nPrevious work on hypothesizing instantiation parameters can be placed into two \nbroad classes, correspondence based search and parameter space search. In corre(cid:173)\nspondence based search, the idea is to extract features from the image and identify \ncorresponding features in the model. Using sufficient correspondences the instantia(cid:173)\ntion parameters of the model can be determined. The problem is that simple, easily \ndetectable image features have many possible matches, and more complex features \nrequire more computation and are more difficult to detect. Grimson (1990) shows \nhow to search the space of possible correspondences using an interpretation tree. \n\nAn alternative approach, which is used in Hough transform techniques, is to di(cid:173)\nrectly work in parameter space. The Hough transform was originally designed for \nthe detection of straight lines in images, and has been extended to cover a number \nof geometric shapes, notably conic sections. Ballard (1981) further extended the \napproach to arbitrary shapes with the Generalized Hough Transform . The param(cid:173)\neter space for each model is divided into cells (\"binned\"), and then for each image \nfeature a vote is added to each parameter space bin that could have produced that \nfeature. After collecting votes from all image features we then search for peaks in \nthe parameter space accumulator array, and attempt to verify pose. The Hough \ntransform can be viewed as a crude way of approximating the logarithm of the \nposterior distribution P(zII, M) (e.g. Hunt et al , 1988). \nHowever, these two techniques have only been used on problems involving rigid \nmodels, and are not readily applicable to the digit recognition problem. For the \nHough space method, binning and vote collection is impractical in the high di(cid:173)\nmensional parameter space, and for the correspondence based approach there is a \nlack of easily identified and highly discriminative features. The strengths of these \ntwo techniques, namely their ability to deal with arbitrary scalings, rotations and \ntranslations of the data, and their tolerance of extraneous features, are not really \nrequired for a task where the input data is fairly well segmented and normalized. \n\nOur approach is to use a neural network to predict the instantiation parameters for \neach model, given an input image. Zemel and Hinton (1991) used a similar method \nwith simple 2-d objects, and more recently, Beymer et al (1993) have constructed \na network which maps from a face image to a 2-d parameter space spanning head \nrotations and a smile/no-smile dimension. However, their method does not directly \nmap from images to instantiation parameters; they use a computer vision corre(cid:173)\nspondence algorithm to determine the displacement field of pixels in a novel image \nrelative to a reference image, and then use this field as the input to the network. \nThis step limits the use of the approach to images that are sufficiently similar so \nthat the correspondence algorithm functions well. \n\n2.2 \n\nINSTANTIATING DIGIT MODELS USING NEURAL \nNETWORKS \n\nThe network which is used to predict the model instantiation parameters is shown \nin figure 1. The (unthinned) binary images are normalized to give 16 x 16 8-bit \ngreyscale images which are fed into the neural network. The network uses a standard \nthree-layer architecture; each hidden unit computes a weighted sum of its inputs, \nand then feeds this value through a sigmoidal nonlinearity u(x) = 1/(1 + e- X ). The \n\n\fUsing a Neural Net to Instantiate a Deformable Model \n\n969 \n\ncps for 0 model \n\ncps for I model \n\ncps for 9 model \n\no \n\nFigure 1: The prediction network architecture. \"cps\" stands for control points. \n\noutput values are a weighted linear combination of the hidden unit activities plus \noutput biases. The targets are the locations of the control points in the normalized \nimage, found from fitting models as described in section 1.2. \n\nThe network was trained with backpropagation to minimize the squared error, using \n900 training images and 200 validation images of each digit drawn from the br \nset of the CEDAR CDROM 1 database of Cities, States, ZIP Codes, Digits, and \nAlphabetic Characters4 . Two test sets were used; one was obtained from data in the \nbr dataset, and the other was the (official) bs test set. After some experimentation \nwe chose a network with twenty hidden units, which means that the net has over \n8,000 weights. With such a large number of weights it is important to regularize the \nsolution obtained by the network by using a complexity penalty; we used a weight \npenalty A L:j wJ and optimized A on a validation set. Targets were only set for the \ncorrect digit at the output layer; nothing was backpropagated from the other output \nunits. The net took 440 epochs to train using the default conjugate gradient search \nmethod in the Xerion neural network simulator5. It would be possible to construct \nten separate networks to carry out the same task as the net described above, but \nthis would intensify the danger of overfitting, which is reduced by giving the network \na common pool of hidden units which it can use as it decides appropriate. \n\nFor comparison with the prediction net described above, a trivial network which \njust consisted of output biases was trained; this network simply learns the average \nvalue of the control point locations. On a validation set the squared error of the \nprediction net was over three times smaller than the trivial net. Although this is \nencouraging, the acid test is to compare the performance of elastic models settled \nfrom the predicted positions using a shortened annealing schedule; if the predictions \nare good, then only a short amount of settling will be required. \n\n4Made available by the Unites States Postal Service Office of Advanced Technology. \n5Xerion was designed and implemented by Drew van Camp, Tony Plate and Geoffrey \n\nHinton at the University of Toronto. \n\n\f970 \n\nChristopher K. I. Williams, Michael D. Revow, Geoffrey E. Hinton \n\nFigure 2: A comparision of the initial instantiations due to the prediction net (top row) \nand the trivial net (bottom row) on an image of a 2. Notice that for the two model the \nprediction net is much closer to the data. The other digit models mayor may not be greatly \naffected by the input data; for example, the predictions from both nets seem essentially \nthe same for the zero, but for the seven the prediction net puts the model nearer to the \ndata. \n\nThe feedforward net predicts the position of the control points in the normalized \nimage. By inverting the normalization process, the positions of the control points \nin the un-normalized image are determined. The model deformation and affine \ntransformation corresponding to these image control point locations can then be \ndetermined by running a part of one iteration of the search procedure. Experiments \nwere then conducted with a number of shortened annealing schedules; for each one, \ndata obtained from settling on a part of the training data was used to train the \npostprocessing net. The performance was then evaluated on the br test set. \n\nThe full annealing schedule has six stages. The shortened annealing schedules are: \n\n1. No settling at all \n2. Two iterations at the final variance of 0.0006 \n3. One iteration at 0.0025 and two at 0.0006 \n4. The full annealing schedule (for comparison) \n\nThe results on the br test set are shown in table 1. The general trends are that the \nperformance obtained using the prediction net is consistently better than the trivial \nnet, and that longer annealing schedules lead to better performance. A comparison \nof schedules 3 and 4 in table 1 indicates that the performance of the prediction \nnet/schedule 3 combination is similar to (or slightly better than) that obtained \nwith the full annealing schedule, and is more than a factor of two faster. The \nresults with the full schedule are almost identical to the results obtained with the \ndefault \"box\" initialization described in section 1.2. Figure 2 compares the outputs \nof the prediction and trivial nets on a particular example. Judging from the weight \n\n\fUsing a Neural Net to Instantiate a Deformable Model \n\n971 \n\nSchedule number Trivial net Prediction net A verage time required \nto settle one model (s) \n\n1 \n2 \n3 \n4 \n\n427 \n329 \n160 \n40 \n\n200 \n58 \n32 \n36 \n\n0.12 \n0.25 \n0.49 \n1.11 \n\nTable 1: Errors on the internal test set of 2000 examples for different annealing schedules. \nThe timing trials were carried out on a R-4400 machine. \n\nvectors and activity patterns of the hidden units, it does not seem that some of the \nunits are specialized for a particular digit class. \nA run on the bs test set using schedule 3 gave an error rate of 4.76 % (129 errors), \nwhich is very similar to the 125 errors obtained using the full annealing schedule \nand the box initialization. A comparison of the errors made on the two runs shows \nthat only 67 out of the 129 errors were common to the two sets. This suggests that \nit would be very sensible to reject cases where the two methods do not agree. \n\n3 DISCUSSION \n\nThe prediction net used above can be viewed as an interpolation scheme in the \ncontrol point position space of each digit z(I) = Zo + 2:i ai(I)zi, where z(I) is \nthe predicted position in the control point space, Zo is the contribution due to the \nbiases, ai is the activity of hidden unit i and Zi is its location in the control point \nposition space (learned from the data) . If there are more hidden units than output \ndimensions, then for any particular image there are an infinite number of ways to \nmake this equation hold exactly. However, the network will tend to find solutions \nso that the ai(I)'s will vary smoothly as the image is perturbed. \nThe nets described above output just one set of instantiation parameters for a \ngiven model. However, it may be preferable to be able to represent a number of \nguesses about model instantiation parameters; one way of doing this is to train a \nnetwork that has multiple sets of output parameters, as in the \"mixture of experts\" \narchitecture of Jacobs et aI (1991). The outputs can be interpreted as a mixture \ndistribution in the control point position space, conditioned on the input image. \nAnother approach to providing more information about the posterior distribution \nis described in (Hinton, Williams and Revow, 1992b), where P(zlI) is approximated \nusing a fixed set of basis functions whose weighting depends on the input image I. \n\nThe strategies descriped above directly predict the instantiation parameters in pa(cid:173)\nrameter space. It is also possible to use neural networks to hypothesize correspon(cid:173)\ndences, i.e. to predict an inked pixel's position on the spline given a local window \nof context in the image. With sufficient matches it is then possible to compute \nthe instantiation parameters of the model. We have conducted some preliminary \nexperiments with this method (described in Williams, 1994), which indicate that \ngood performance can be achieved for the correspondence prediction task. \n\n\f972 \n\nChristopher K. I. Williams, Michael D. Revow, Geoffrey E. Hinton \n\nWe have shown that the we can obtain significant speedup using the prediction net. \nThe schemes outlined above which allow multimodal predictions in instantiation \nparameter space may improve performance and deserve further investigation. We \nare also interested in improving the performance of the prediction net, for example \nby outputting a confidence measure which could be used to adjust the length of \nthe elastic models' search appropriately. We believe that using machine learning \ntechniques like neural networks to help reduce the amount of search required to fit \ncomplex models to data may be useful for many other problems. \n\nAcknowledgements \n\nThis research was funded by Apple and by the Ontario Information Technology Research \nCentre. We thank Allan Jepson, Richard Durbin, Rich Zemel, Peter Dayan, Rob Tibshirani \nand Yann Le Cun for helpful discussions. Geoffrey Hinton is the Noranda Fellow of the \nCanadian Institute for Advanced Research. \n\nReferences \nBallard, D. H. (1981). Generalizing the Hough transfrom to detect arbitrary shapes. \n\nPattern Recognition, 13(2):111-122. \n\nBeymer, D., Shashua, A., and Poggio, T . (1993). Example Based Image Analysis and \n\nSynthesis. AI Memo 1431, AI Laboratory, MIT. \n\nGrenander, U., Chow, Y., and Keenan, D. M. (1991). Hands: A pattern theoretic study of \n\nbiological shapes. Springer-Verlag. \n\nGrimson, W. E. 1. (1990) . Object recognition by computer. MIT Press, Cambridge, MA. \nHinton, G. E., Williams, C. K. 1., and Revow, M. D. (1992a). Adaptive elastic models \n\nfor hand-printed character recognition. In Moody, J. E., Hanson, S. J., and Lipp(cid:173)\nmann, R. P., editors, Advances in Neural Information Processing Systems 4. Morgan \nKauffmann. \n\nHinton, G. E., Williams, C. K. 1., and Revow, M. D. (1992b). Combinining two methods \nof recognizing hand-printed digits. In Aleksander, 1. and Taylor, J., editors, Artificial \nNeural Networks 2. Elsevier Science Publishers. \n\nHunt, D. J., Nolte, L. W., and Ruedger, W . H. (1988) . Performance of the Hough Trans(cid:173)\n\nform and its Relationship to Statistical Signal Detection Theory. Computer Vision, \nGraphics and Image Processing, 43:221- 238. \n\nJacobs, R. A., Jordan, M. 1., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures \n\nof local experts. Neural Computation, 3(1). \n\nMacKay, D. J. C. (1992). Bayesian Interpolation. Neural Computation, 4(3):415-447. \nRevow, M. D., Williams, C. K. 1., and Hinton, G. E. (1993) . Using mixtures of deformable \nmodels to capture variations in hand printed digits. In Srihari, S., editor, Proceedings \nof the Third International Workshop on Frontiers in Handwriting Recognition, pages \n142-152, Buffalo, New York, USA. \n\nWilliams, C. K. 1. (1994) . Combining deformable models and neural networks for hand(cid:173)\n\nprinted digit recognition. PhD thesis, Dept. of Computer Science, University of \nToronto. \n\nZemel, R. S. and Hinton, G. E. (1991) . Discovering viewpoint-invariant relationships that \n\ncharacterize objects. In Lippmann, R. P., Moody, J. E., and Touretzky, D. S., edi(cid:173)\ntors, Advances In Neural Information Processing Systems 3, pages 299-305. Morgan \nKaufmann Publishers. \n\n\f", "award": [], "sourceid": 1002, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Michael", "family_name": "Revow", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}