{"title": "Recognizing Hand-written Digits Using Hierarchical Products of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 953, "page_last": 959, "abstract": null, "full_text": "Recognizing Hand-written Digits Using \n\nHierarchical Products of Experts \n\nGuy Mayraz & Geoffrey E. Hinton \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \n\n17 Queen Square, London WCIN 3AR, u.K. \n\nAbstract \n\nThe product of experts learning procedure [1] can discover a set of \nstochastic binary features that constitute a non-linear generative model of \nhandwritten images of digits. The quality of generative models learned \nin this way can be assessed by learning a separate model for each class of \ndigit and then comparing the unnormalized probabilities of test images \nunder the 10 different class-specific models. To improve discriminative \nperformance, it is helpful to learn a hierarchy of separate models for each \ndigit class. Each model in the hierarchy has one layer of hidden units and \nthe nth level model is trained on data that consists of the activities of the \nhidden units in the already trained (n -\nl)th level model. After train(cid:173)\ning, each level produces a separate, unnormalized log probabilty score. \nWith a three-level hierarchy for each of the 10 digit classes, a test image \nproduces 30 scores which can be used as inputs to a supervised, logis(cid:173)\ntic classification network that is trained on separate data. On the MNIST \ndatabase, our system is comparable with current state-of-the-art discrimi(cid:173)\nnative methods, demonstrating that the product of experts learning proce(cid:173)\ndure can produce effective generative models of high-dimensional data. \n\n1 Learning products of stochastic binary experts \n\nHinton [1] describes a learning algorithm for probabilistic generative models that are com(cid:173)\nposed of a number of experts. Each expert specifies a probability distribution over the \nvisible variables and the experts are combined by multiplying these distributions together \nand renormalizing. \n\n(1) \n\nwhere d is a data vector in a discrete space, Om is all the parameters of individual model \nm, Pm(dIOm) is the probability of d under model m, and c is an index over all possible \nvectors in the data space. \n\nA Restricted Boltzmann machine [2, 3] is a special case of a product of experts in which \neach expert is a single, binary stochastic hidden unit that has symmetrical connections to \na set of visible units, and connections between the hidden units are forbidden. Inference \nin an RBM is much easier than in a general Boltzmann machine and it is also much easier \n\n\fthan in a causal belief net because there is no explaining away. There is therefore no need \nto perform any iteration to determine the activities of the hidden units. The hidden states, \nSj , are conditionally independent given the visible states, Si, and the distribution of Sj is \ngiven by the standard logistic function : \n\np(Sj = 1) = \n\n1 \n\n1 + exp( - Li WijSi) \n\n(2) \n\nConversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM \nto learn population codes in which units may be highly correlated. It is hard to do this in \ncausal belief nets with one hidden layer because the generative model of a causal belief net \nassumes marginal independence. \n\nAn RBM can be trained using the standard Boltzmann machine learning algorithm which \nfollows a noisy but unbiased estimate of the gradient of the log likelihood of the data. \nOne way to implement this algorithm is to start the network with a data vector on the \nvisible units and then to alternate between updating all of the hidden units in parallel and \nupdating all of the visible units in parallel. Each update picks a binary state for a unit \nfrom its posterior distribution given the current states of all the units in the other set. If \nthis alternating Gibbs sampling is run to equilibrium, there is a very simple way to update \nthe weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data \ndistribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, \nproduced by the RBM [4]: \n\n(3) \nwhere < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units \nand the hidden states are sampled from their conditional distribution given the data, and \nQ ~ is the expected value of SiSj after prolonged Gibbs sampling. \n\nflWij oc QO - Q~ \n\nThis learning rule does not work well because it can take a long time to approach thermal \nequilibrium and the sampling noise in the estimate of Q ~ can swamp the gradient. \n[1] shows that it is far more effective to minimize the difference between QOllQoo and \nQ111Qoo where Q1 is the distribution of the one-step reconstructions of the data that are \nproduced by first picking binary hidden states from their conditional distribution given the \ndata and then picking binary visible states from their conditional distribution given the \nhidden states. The exact gradient of this \"contrastive divergence\" is complicated because \nthe distribution Q1 depends on the weights, but [1] shows that this dependence can safely be \nignored to yield a simple and effective learning rule for following the approximate gradient \nof the contrastive divergence: \n\nflWij oc QO - Ql \n\n(4) \nFor images of digits, it is possible to apply Eq. 4 directly if we use stochastic binary pixel \nintensities, but it is more effective to normalize the intensities to lie in the range [0,1] \nand then to use these real values as the inputs to the hidden units. During reconstruction, \nthe stochastic binary pixel intensities required by Eq. 4 are also replaced by real-valued \nprobabilities. Finally, the learning rule can be made less noisy by replacing the stochastic \nbinary activities of the hidden units by their expected values. So the learning rule we \nactually use is: \n\n(5) \nStochastically chosen binary states of the hidden units are still used for computing the \nprobabilities of the reconstructed pixels. This prevents each real-valued hidden probability \nfrom conveying more than 1 bit of information to the reconstruction. \n\nflWij oc QO - Ql \n\n2 The MNIST database \n\nMNIST, a standard database for testing digit recognition algorithms, is available at \nhttp://www.rese arc h. att. co m/~y a nn /ocr/ mnist / index.html.MNIST \n\n\fMETHOD \nLinear classifier (I-layer NN) \nK-nearest-neighbors, Euclidean \n1000 RBF + linear classifier \nBest Back-Prop: 3-layer NN, 500+ 150 hidden units \n\nReduced Set SVM deg 5 polynomial \nLeNet-l [with 16x16 input] \nLeNet-5 \n\nProduct of Experts (separate 3-layer net for each model) \n\n% ERRORS \n\n12.0 \n5.0 \n3.6 \n2.95 \n\n1.0 \n1.7 \n0.95 \n\n1.7 \n\nTable 1: Performance of various learning methods on the MNIST test set. \n\nhas 60,000 training images and 10,000 test images. Images are highly variable in style but \nare size-normalized and translated so that the center of gravity of their intensity lies at the \ncenter of a fixed-size image of 28 by 28 pixels. \n\nA number of well-known learning algorithms have been run on the MNIST database[5], so \nit is easy to assess the relative performance of a novel algorithm. Some of the experiments \nin [5] included deskewing images or augmenting the training set with distorted versions \nof the original images. We did not use deskewing or distortions in our main experiments, \nso we only compare our results with other methods that did not use them. The results in \nTable 1 should be treated with caution. Some attempts to replicate the degree 5 polynomial \nSVM have produced slightly higher error rates of 1.4% [6] and standard backpropagation \ncan be carefully tuned to achieve under 2% (John Platt, personal communication). \n\nTable 1 shows that it is possible to achieve a result that is comparable with the best dis(cid:173)\ncriminative techniques by using multiple PoE models of each digit class to extract scores \nthat represent unnormalized log probabilities. These scores are then used as the inputs to \na simple logistic classifier. The fact that a system based on generative models can come \nclose to the very best discriminative systems suggests that the generative models are doing \na good job of capturing the distributions. \n\n3 Training the individual PoE models \n\nThe MNIST database contains an average of 6,000 training examples per digit, but these \nexamples are unevenly distributed among the digit classes. In order to simplify the research \nwe produced a balanced database by using only 5,400 examples of each digit. The first \n4,400 examples were the unsupervised training set used for training the individual PoE \nmodels. The remaining examples of each of the 10 digits constituted the supervised training \nset used for training the logistic classification net that converts the scores of all the PoE \nmodels into a classification. \n\nThe original intensity range in the MNIST images was 0 to 255. This was normalized to \nthe range 0 to 1 so that we could treat intensities as probabilities. The normalized pixel \nintensities were used as the initial activities of the 784 visible units corresponding to the 28 \nby 28 pixels. The visible units were fully connected to a single layer of hidden units. The \nweights between the input and hidden layer were initialized to small, zero-mean, Gaussian(cid:173)\ndistributed, random values. The 4,400 training examples were divided into 44 mini-batches. \nOne epoch of learning consisted of a pass through all 44 mini batches in fixed order with the \nweights being updated after each minibatch. We used a momentum method with a small \n\n\fFigure 1: The areas of the blobs show the mean \ngoodness of validation set digits using only the \nfirst-level models with 500 hidden units (white \nis positive). A different constant is added to \nall the goodness scores of each model so that \nrows sum to zero. Successful discrimination \ndepends on models being better on their own \nclass than other models are. The converse is not \ntrue: models can be better reconstructing other, \neasier classes of digits than their own class. \n\n\u2022 \u2022 . . \n\n\u2022 \n\u2022 \u2022 \n\n\u2022 \u2022 \n\n7 \n\n8 \n\n9 \n\n23 4 5 \n\n6 \n\ndigit to be explained \n\n1 1 \n7 \u00b7 7 \n\n'1 , \n\n7 '7 \n7 \u00b7 7 \n7 '7 \n\n7 \nJ \n\n1 \n::r1 \n\nt( f 9 \n'/ 9 \n\n7 \" \nf , \n:::r 7 -\"{ , ? \n\n1 Cf \n\n-:t-1 \nt( f 9 \n~ 1 q 7 9 \n\nJ \nJ \n\nFigure 2: Cross reconstruction of 7s and 9s \nwith models containing 25 hidden units (top) \nand 100 hidden units (bottom). The central \nhorizontal line in each block contains origi-\nnals, and the lines above and below are re-\nconstructions by the 7s and 9s models re-\nspectively. Both models produce stereotyped \ndigits in the small net and much better re-\nconstructions in the large one for both the \ndigit classes. The 9s model sometimes tries \nto close the loop in 7s, and the 7s model tries \nto open the loop in 9s. \n\namount of weight decay, so the change in a weight after the tth minibatch was: \n\n~wL = J..L~wtt + 0.1 ((PiPj)Q~ - (PiPj)Q: - o.OOOlwL) \n\n(6) \n\nwhere Q~ and Qt are averages over the data or the one-step reconstructions for minibatch \nt, and the momentum, J..L, was 0 for the first 50 weight changes and 0.9 thereafter. The \nhidden and visible biases, bi and bj , were initialized to zero. Their values were similarly \naltered (by treating them like connections to a unit that was always on) but with no weight \ndecay. \n\nRather than picking one particular number of hidden units, we trained networks with vari(cid:173)\nous different numbers of units and then used discriminative performance on the validation \nset to decide on the most effective number of hidden units. The largest network was the \nbest, even though each digit model contains 392,500 parameters trained on only 4,400 im(cid:173)\nages. The receptive fields learned by the hidden units are quite local. Since the hidden units \nare fully connected and have random initial weights the learning procedure must infer the \nspatial proximity of pixels from the statistics of their joint activities. Figure 1 shows the \nmean goodness scores of all 10 models on all 10 digit classes. \n\nFigure 2 shows reconstructions produced by the bottom-level models on previously unseen \ndata from the digit class they were trained on and also on data from a different digit class. \nWith 500 hidden units, the 7s model is almost perfect at reconstructing 9s. This is be(cid:173)\ncause a model gets better at reconstructing more or less any image as its set of available \nfeatures becomes more varied and more local. Despite this, the larger networks give better \ndiscriminative information. \n\n\f3.1 Multi-layer models \n\nNetworks that use a single layer of hidden units and do not allow connections within a \nlayer have some major advantages over more general networks. With an image clamped \non the visible units, the hidden units are conditionally independent. So it is possible to \ncompute an unbiased sample of the binary states of the hidden units without any iteration. \nThis property makes PoE's easy to train and it is lost in more general architectures. If, for \nexample, we introduce a second hidden layer that is symmetrically connected to the first \nhidden layer, it is no longer straightforward to compute the posterior expected activity of a \nunit in the first hidden layer when given an image that is assumed to have been generated \nby the multilayer model at thermal equilibrium. The posterior distribution can be computed \nby alternating Gibbs sampling between the two hidden layers, but this is slow and noisy. \n\nFortunately, if our ultimate goal is discrimination, there is a computationally convenient \nalternative to using a multilayer Boltzmann machine. Having trained a one-hidden-layer \nPoE on a set of images, it is easy to compute the expected activities of the hidden units on \neach image in the training set. These hidden activity vectors will themselves have interest(cid:173)\ning statistical structure because a PoE is not attempting to find independent causes and has \nno implicit penalty for using hidden units that are marginally highly correlated. So we can \nlearn a completely separate PoE model in which the activity vectors of the hidden units are \ntreated as the observed data and a new layer of hidden units learns to model the structure \nof this \"data\". It is not entirely clear how this second-level PoE model helps as a way of \nmodelling the original image distribution, but it is clear that if a first-level PoE is trained on \nimages of 2's, we would expect the vectors of hidden activities to be be very different when \nit is presented with a 3, even if the features it has learned are quite good at reconstructing \nthe 3. So a second-level model should be able to assign high scores to the vectors of hidden \nactivities that are typical of the 2 model when it is given images of 2's and low scores to \nthe hidden activities of the 2 model when it is given images that contain combinations of \nfeatures that are not normally present at the same time in a 2. \n\nWe used a three-level hierarchy of PoE's for each digit class. The levels were trained \nsequentially and to simplify the research we always used the same number of hidden units \nat each level. We trained models of five different sizes with 25, 100, 200, 400, and 500 \nhidden units per level. \n\n4 The logistic classification network \n\nAn attractive aspect of PoE's is that it is easy to compute the numerator in Eq. 1 so it is \neasy to compute a goodness score which is equal to the log probability of a data vector \nup to an additive constant. Figure 3 show the goodness of the 7s and 9s models (the most \ndifficult pair of digits to discriminate) when presented with test images of both 7s and 9s. \nIt can be seen that a line can be passed that separates the two digit sets almost perfectly. It \nis also encouraging that all of the errors are close to the decision boundary, so there are no \nconfident misclassifications. \n\nThe classification network had 10 output units, each of which computed a logit, x, that was \na linear function of the goodness scores, g, of the various PoE models, m, on an image, c. \nThe probability assigned to class j was then computed by taking a \"softmax\" of the logits: \n\nPc -\n-\n\nj \n\nexj \n\n- - - - - -0 -\n' \" X C \nL..Jk e k \n\nxj = bj + Lg~Wmj \n\nm \n\n(7) \n\nThere were 10 digit classes each with a three-level hierarchy of PoE models, so the classifi(cid:173)\ncation network had 30 inputs and therefore 300 weights and 10 output biases. Both weights \nand biases were initialized to zero. The weights were learned by a momentum version of \n\n\f(a) \n\n(b) \n\n210 \n\n180 \n\n200 \n\n220 \n\nscore under 7s model (1 st layer) \n\n240 \n\n260 \n\nFigure 3: Validation set cross goodness results of (a) the first-level model and (b) the third(cid:173)\nlevel model of 7s and 9s. All models have 500 hidden units. The third-level models clearly \ngive higher goodness scores for second-level hidden activities in their own hierarchy than \nfor the hidden activities in the other hierarchy. \n\ngradient ascent in the log probability assigned to the correct class. Since there were only \n310 weights to train, little effort was devoted to making the learning efficient. \n\n~Wmj(t) = J-L~wmj(t-l) + 0.0002 Lg~(tj - pi) \n\n(8) \n\nc \n\nwhere tj is 1 if class j is the correct answer for training case c and 0 otherwise. The \nmomentum J-L was 0.9. The biases were treated as if they were weights from an input that \nalways had a value of 1 and were learned in exactly the same way. \n\nIn each training epoch the weight changes were averaged over the whole supervised training \nsetl . We used separate data for training the classification network because we expect the \ngoodness score produced by a PoE of a given class to be worse and more variable on \nexemplars of that class that were not used to train the PoE and it is these poor and noisy \nscores that are relevant for the real, unseen test data. \n\nThe training algorithm was run using goodness scores from PoE networks with different \nnumbers of hidden units. The results in Table 2 show a consistent improvement in classifi(cid:173)\ncation error as the number of units in the hidden layers of each PoE increase. There is no \nevidence for over-fitting, even though large PoE's are very good at reconstructing images of \nother digit classes or the hidden activity vectors of lower-level models in other hierarchies. \nIt is possible to reduce the error rate by a further 0.1 % by averaging together the goodness \nscores of corresponding levels of model hierarchies with 100 or more units per layer, but \nthis model averaging is not nearly as effective as using extra levels. \n\n5 Model-based normalization \n\nThe results of our current system are still not nearly as good as human performance. In \nparticular, it appears the network has only a very limited understanding of image invari-\n\n1 We held back part of the supervised training set to use as a validation set in determining the \noptimal number of epochs to train the classification net, but once this was decided we retrained on all \nthe supervised training data for that number of epochs. \n\n\fNetwork size Learning epochs % Errors \n\n25 \n100 \n200 \n400 \n500 \n\n25 \n100 \n200 \n200 \n500 \n\n3.8 \n2.3 \n2.2 \n2.0 \n1.7 \n\nTable 2: MNIST test set error \nrate as a function of the number \nof hidden units per level. There \nis no evidence of overfitting even \nwhen over 250,000 parameters \nare trained on only 4,400 exam(cid:173)\nples. \n\nances. This is not surprising since it is trained on prenormalized data. Dealing with image \ninvariances better will be essential for approaching human performance. The fact that we \nare using generative models suggests an interesting way of refining the image normaliza(cid:173)\ntion. If the normalization of an image is slightly wrong we would expect it to have lower \nprobability under the correct class-specific model. So we should be able to use the gradient \nof the goodness score to iteratively adjust the normalization so that the data fits the model \nbetter. Using x translation as an example, \n\n8C __ \"\" 8si 8C \n~ \ni 8x 8si \n8x \n\n8C \n-8. = bi + ~ SjWji \nS. \n\n\"\" \nj \n\nwhere Si is the intensity of pixel i. 8si/8x is easily computed from the intensities of \nthe left and right neighbors of pixel i and 8C / 8s i is just the top-down input to a pixel \nduring reconstruction. Preliminary simulations by Yee Whye Teh on poorly normalized \ndata show that this type of model-based renormalization improves the score of the correct \nmodel much more than the scores of the incorrect ones and thus eliminates most of the \nclassification errors. \n\nAcknowledgments \n\nWe thank Yann Le Cun, Mike Revow and members of the Gatsby Unit for helpful discus(cid:173)\nsions. This research was funded the Gatsby Charitable Foundation. \n\nReferences \n\n[1] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Technical Re(cid:173)\nport GeNU TR 2000-004, Gatsby Computational Neuroscience Unit, University College Lon(cid:173)\ndon, 2000. \n\n[2] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. In \nD. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Explorations in \nthe Microstructure of Cognition. Volume 1: Foundations. MIT Press, 1986. \n\n[3] Yoav Freund and David Haussler. Unsupervised learning of distributions of binary vectors using \n\n2-layer networks. In John E. Moody, Steve J. Hanson, and Richard P. Lippmann, editors, Ad(cid:173)\nvances inNeural1nformation Processing Systems, volume 4, pages 912- 919. Morgan Kaufmann \nPublishers, Inc., 1992. \n\n[4] G. E. Hinton and T. J. Sejnowski. Learning and relearning in boltzmann machines. In D. E. \nRumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Explorations in the \nMicrostructure of Cognition. Volume 1: Foundations. MIT Press, 1986. \n\n[5] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, 1. Guyon, \nU. A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of learning algorithms for \nhandwritten digit recognition. In F. Fogelman and P. Gallinari, editors, International Conference \non Artificial Neural Networks, pages 53- 60, Paris, 1995. EC2 & Cie. \n\n[6] Chris J.C. Burges and B. SchOlkopf. Improving the accuracy and speed of support vector ma(cid:173)\n\nchines. In Michael C. Mozer, Michael 1. Jordan, and Thomas Petsche, editors, Advances in \nNeural Information Processing Systems, volume 9, page 375. The MIT Press, 1997. \n\n\f", "award": [], "sourceid": 1807, "authors": [{"given_name": "Guy", "family_name": "Mayraz", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}