{"title": "3D Object Recognition with Deep Belief Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 1339, "page_last": 1347, "abstract": "We introduce a new type of Deep Belief Net and evaluate it on a 3D object recognition task. The top-level model is a third-order Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database(normalized-uniform version), which contains stereo-pair images of objects under different lighting conditions and viewpoints. Our model achieves 6.5% error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has built-in knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semi-supervised learning, and to demonstrate this we consider a modified version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2% error, making it the current best result for NORB.", "full_text": "3D Object Recognition with Deep Belief Nets\n\nVinod Nair and Geo\ufb00rey E. Hinton\n\nDepartment of Computer Science, University of Toronto\n\n10 King\u2019s College Road, Toronto, M5S 3G5 Canada\n\n{vnair,hinton}@cs.toronto.edu\n\nAbstract\n\nWe introduce a new type of top-level model for Deep Belief Nets and evalu-\nate it on a 3D object recognition task. The top-level model is a third-order\nBoltzmann machine, trained using a hybrid algorithm that combines both\ngenerative and discriminative gradients. Performance is evaluated on the\nNORB database (normalized-uniform version), which contains stereo-pair\nimages of objects under di\ufb00erent lighting conditions and viewpoints. Our\nmodel achieves 6.5% error on the test set, which is close to the best pub-\nlished result for NORB (5.9%) using a convolutional neural net that has\nbuilt-in knowledge of translation invariance. It substantially outperforms\nshallow models such as SVMs (11.6%). DBNs are especially suited for\nsemi-supervised learning, and to demonstrate this we consider a modi\ufb01ed\nversion of the NORB recognition task in which additional unlabeled images\nare created by applying small translations to the images in the database.\nWith the extra unlabeled data (and the same amount of labeled data as\nbefore), our model achieves 5.2% error.\n\n1 Introduction\n\nRecent work on deep belief nets (DBNs) [10], [13] has shown that it is possible to learn\nmultiple layers of non-linear features that are useful for object classi\ufb01cation without requir-\ning labeled data. The features are trained one layer at a time as a restricted Boltzmann\nmachine (RBM) using contrastive divergence (CD) [4], or as some form of autoencoder [20],\n[16], and the feature activations learned by one module become the data for training the\nnext module. After a pre-training phase that learns layers of features which are good at\nmodeling the statistical structure in a set of unlabeled images, supervised backpropagation\ncan be used to \ufb01ne-tune the features for classi\ufb01cation [7]. Alternatively, classi\ufb01cation can\nbe performed by learning a top layer of features that models the joint density of the class\nlabels and the highest layer of unsupervised features [6]. These unsupervised features (plus\nthe class labels) then become the penultimate layer of the deep belief net [6].\n\nEarly work on deep belief nets was evaluated using the MNIST dataset of handwritten digits\n[6] which has the advantage that a few million parameters are adequate for modeling most of\nthe structure in the domain. For 3D object classi\ufb01cation, however, many more parameters\nare probably required to allow a deep belief net with no prior knowledge of spatial structure\nto capture all of the variations caused by lighting and viewpoint. It is not yet clear how well\ndeep belief nets perform at 3D object classi\ufb01cation when compared with shallow techniques\nsuch as SVM\u2019s [19], [3] or deep discriminative techniques like convolutional neural networks\n[11].\n\nIn this paper, we describe a better type of top-level model for deep belief nets that is trained\nusing a combination of generative and discriminative gradients [5], [8], [9]. We evaluate the\nmodel on NORB [12], which is a carefully designed object recognition task that requires\n\n1\n\n\fhj\n\nvi\n\nlk\n\n(a)\n\nhj\n\nvi\n\nlk\n\n(b)\n\nh1\n\nh2\n\nh1\n\nh2\n\nl1\n\nl2\n\nl1\n\nl22\n\nv2\n\nv1\n\n(c)\n\nv1\n\nv2\n\n(d)\n\nNh hidden\n\nh dd\nunits\n\nW112\n\nW212\n\nW111\n111\n\nW122\n\nW211\n211\n\nW222\n\nW121\n\nW221\n\nNv visible units\n\n(e)\n\nFigure 1: The Third-Order Restricted Boltzmann Machine. (a) Every clique in the model contains\na visible unit, hidden unit, and label unit. (b) Our shorthand notation for representing the clique\nin (a). (c) A model with two of each unit type. There is one clique for every possible triplet of\nunits created by selecting one of each type. The \u201crestricted\u201d architecture precludes cliques with\nmultiple units of the same type. (d) Our shorthand notation for representing the model in (c).\n(e) The 3D tensor of parameters for the model in (c). The architecture is the same as that of an\nimplicit mixture of RBMs [14], but the inference and learning algorithms have changed.\n\ngeneralization to novel object instances under varying lighting conditions and viewpoints.\nOur model signi\ufb01cantly outperforms SVM\u2019s, and it also outperforms convolutional neural\nnets when given additional unlabeled data produced by small translations of the training\nimages. We use restricted Boltzmann machines trained with one-step contrastive divergence\nas our basic module for learning layers of features. These are fully described elsewhere [6],\n[1] and the reader is referred to those sources for details.\n\n2 A Third-Order RBM as the Top-Level Model\n\nUntil now, the only top-level model that has been considered for a DBN is an RBM with\ntwo types of observed units (one for the label, another for the penultimate feature vector).\nWe now consider an alternative model for the top-level joint distribution in which the class\nlabel multiplicatively interacts with both the penultimate layer units and the hidden units\nto determine the energy of a full con\ufb01guration. It is a Boltzmann machine with three-way\ncliques [17], each containing a penultimate layer unit vi, a hidden unit hj, and a label unit\nlk. See \ufb01gure 1 for a summary of the architecture. Note that the parameters now form a\n3D tensor, instead of a matrix as in the earlier, bipartite model.\n\nConsider the case where the components of v and h are stochastic binary units, and l is a\ndiscrete variable with K states represented by 1-of-K encoding. The model can be de\ufb01ned\nin terms of its energy function\n\nE(v, h, l) = \u2212 X\n\nWijkvihjlk,\n\ni,j,k\n\n(1)\n\nwhere Wijk is a learnable scalar parameter. (We omit bias terms from all expressions for\nclarity.) The probability of a full con\ufb01guration {v, h, l} is then\n\nP (v, h, l) =\n\nexp(\u2212E(v, h, l))\n\nZ\n\n,\n\n(2)\n\nwhere Z = Pv\u2032,h\u2032,l\u2032 exp(\u2212E(v\u2032, h\u2032, l\u2032)) is the partition function. Marginalizing over h gives\nthe distribution over v and l alone.\n\n2\n\n\fThe main di\ufb00erence between the new top-level model and the earlier one is that now the\nclass label multiplicatively modulates how the visible and hidden units contribute to the\nenergy of a full con\ufb01guration. If the label\u2019s kth unit is 1 (and the rest are 0), then the kth\nslice of the tensor determines the energy function. In the case of soft activations (i.e. more\nthan one label has non-zero probability), a weighted blend of the tensor\u2019s slices speci\ufb01es\nthe energy function. The earlier top-level (RBM) model limits the label\u2019s e\ufb00ect to changing\nthe biases into the hidden units, which modi\ufb01es only how the hidden units contribute to\nthe energy of a full con\ufb01guration. There is no direct interaction between the label and the\nvisible units.\nIntroducing direct interactions among all three sets of variables allows the\nmodel to learn features that are dedicated to each class. This is a useful property when the\nobject classes have substantially di\ufb00erent appearances that require very di\ufb00erent features\nto describe. Unlike an RBM, the model structure is not bipartite, but it is still \u201crestricted\u201d\nin the sense that there are no direct connections between two units of the same type.\n\n2.1\n\nInference\n\nThe distributions that we would like to be able to infer are P (l|v) (to classify an input), and\nP (v, l|h) and P (h|v, l) (for CD learning). Fortunately, all three distributions are tractable\nto sample from exactly. The simplest case is P (h|v, l). Once l is observed, the model\nreduces to an RBM whose parameters are the kth slice of the 3D parameter tensor. As a\nresult P (h|v, l) is a factorized distribution that can be sampled exactly.\n\nFor a restricted third-order model with Nv visible units, Nh hidden units and Nl class labels,\nthe distribution P (l|v) can be exactly computed in O(NvNhNl) time. This result follows\nfrom two observations: 1) setting lk = 1 reduces the model to an RBM de\ufb01ned by the kth\nslice of the tensor, and 2) the negative log probability of v, up to an additive constant,\nunder this RBM is the free energy:\n\nFk(v) = \u2212\n\nNh\n\nX\n\nj=1\n\nNv\n\nlog(1 + exp(\n\nX\n\nWijkvi)).\n\ni=1\n\n(3)\n\nThe idea is to \ufb01rst compute Fk(v) for each setting of the label, and then convert them to a\ndiscrete distribution by taking the softmax of the negative free energies:\n\nP (lk = 1|v) =\n\nexp(\u2212Fk(v))\nk=1 exp(\u2212Fk(v))\n\nPNl\n\n.\n\n(4)\n\nEquation 3 requires O(NvNh) computation, which is repeated Nl times for a total of\nO(NvNhNl) computation.\n\nWe can use the same method to compute P (l|h). Simply switch the role of v and h in\nequation 3 to compute the free energy of h under the kth RBM. (This is possible since the\nmodel is symmetric with respect to v and h.) Then convert the resulting Nl free energies\nto the probabilities P (lk = 1|h) with the softmax function.\nNow it becomes possible to exactly sample P (v, l|h) by \ufb01rst sampling \u02dcl \u223c P (l|h). Suppose\n\u02dclk = 1. Then the model reduces to its kth-slice RBM from which \u02dcv \u223c P (v|h, \u02dclk = 1) can be\neasily sampled. The \ufb01nal result {\u02dcv,\u02dcl} is an unbiased sample from P (v, l|h).\n\n2.2 Learning\n\nGiven a set of N labeled training cases {(v1, l1), (v2, l2), ..., (vN , lN )} , we want to learn the\n3D parameter tensor W for the restricted third-order model. When trained as the top-level\nmodel of a DBN, the visible vector v is a penultimate layer feature vector. We can also\ntrain the model directly on images as a shallow model, in which case v is an image (in row\nvector form). In both cases the label l represents the Nl object categories using 1-of-Nl\nencoding. For the same reasons as in the case of an RBM, maximum likelihood learning\nis intractable here as well, so we rely on Contrastive Divergence learning instead. CD was\noriginally formulated in the context of the RBM and its bipartite architecture, but here we\nextend it to the non-bipartite architecture of the third-order model.\n\n3\n\n\fAn unbiased estimate of the maximum likelihood gradient can be computed by running a\nMarkov chain that alternatively samples P (h|v, l) and P (v, l|h) until it reaches equilibrium.\nContrastive divergence uses the parameter updates given by three half-steps of this chain,\nwith the chain initialized from a training case (rather than a random state). As explained\nin section 2.1, both of these distributions are easy to sample from. The steps for computing\nthe CD parameter updates are summarized below:\n\nContrastive divergence learning of P (v, l):\n\nk = 1}, sample h+ \u223c P (h|v+, l+\n\n1. Given a labeled training pair {v+, l+\n2. Compute the outer product D+\n3. Sample {v\u2212, l\u2212} \u223c P (v, l|h+). Let m be the index of the component of l\u2212 set to 1.\n4. Sample h\u2212 \u223c P (h|v\u2212, l\u2212\nm = 1).\n5. Compute the outer product D\u2212\n\nk = 1).\n\nk = v+h+T .\n\nm = v\u2212h\u2212T .\n\nLet W\u00b7,\u00b7,k denote the Nh \u00d7 Nv matrix of parameters corresponding to the kth slice along the\nlabel dimension of the 3D tensor. Then the CD update for W\u00b7,\u00b7,k is:\n\n\u2206W\u00b7,\u00b7,k = D+\n\nk \u2212 D\u2212\nk ,\n\n(5)\n\n(6)\nwhere \u03b7 is a learning rate parameter. Typically, the updates computed from a \u201cmini-batch\u201d\nof training cases (a small subset of the entire training set) are averaged together into one\nupdate and then applied to the parameters.\n\nW\u00b7,\u00b7,k \u2190 W\u00b7,\u00b7,k + \u03b7\u2206W\u00b7,\u00b7,k,\n\n3 Combining Gradients for Generative and Discriminative Models\n\nIn practice the Markov chain used in the learning of P (v, l) can su\ufb00er from slow mixing. In\nparticular, the label l\u2212 generated in step 3 above is unlikely to be di\ufb00erent from the true\nlabel l+ of the training case used in step 1. Empirically, the chain has a tendency to stay\n\u201cstuck\u201d on the same state for the label variable because in the positive phase the hidden\nactivities are inferred with the label clamped to its true value. So the hidden activities\ncontain information about the true label, which gives it an advantage over the other labels.\n\nConsider the extreme case where we initialize the Markov chain with a training pair\n{v+, l+\nk = 1} and the label variable never changes from its initial state during the chain\u2019s\nentire run. In e\ufb00ect, the model that ends up being learned is a class-conditional generative\ndistribution P (v|lk = 1), represented by the kth slice RBM. The parameter updates are\nidentical to those for training Nl independent RBMs, one per class, with only the training\ncases of each class being used to learn the RBM for that class. Note that this is very di\ufb00erent\nfrom the model in section 2: here the energy functions implemented by the class-conditional\nRBMs are learned independently and their energy units are not commensurate with each\nother.\n\nAlternatively, we can optimize the same set of parameters to represent yet another distri-\nbution, P (l|v). The advantage in this case is that the exact gradient needed for maximum\nlikelihood learning, \u2202logP (l|v)/\u2202W , can be computed in O(NvNhNl) time. The gradient\nexpression can be derived with some straightforward di\ufb00erentiation of equation 4. The dis-\nadvantage is that it cannot make use of unlabeled data. Also, as the results show, learning\na purely discriminative model at the top level of a DBN gives much worse performance.\n\nHowever, now a new way of learning P (v, l) becomes apparent: we can optimize the\nparameters by using a weighted sum of the gradients for log P (v|l) and log P (l|v). As\nexplained below, this approach 1) avoids the slow mixing of the CD learning for P (v, l), and\n2) allows learning with both labeled and unlabeled data. It resembles pseudo-likelihood in\nhow it optimizes the two conditional distributions in place of the joint distribution, except\nhere one of the conditionals (P (v|l)) is still learned only approximately. In our experiments,\na model trained with this hybrid learning algorithm has the highest classi\ufb01cation accuracy,\nbeating both a generative model trained using CD as well as a purely discriminative model.\n\n4\n\n\fThe main steps of the algorithm are listed below.\n\nHybrid learning algorithm for P (v, l):\nLet {v+, l+\nGenerative update: CD learning of P (v|l)\n\nk = 1} be a labeled training case.\n\n1. Sample h+ \u223c P (h|v+, l+\nk = 1).\n2. Compute the outer product D+\n3. Sample v\u2212 \u223c P (v|h+, l+\n4. Sample h\u2212 \u223c P (h|v\u2212, l+\n\nk = 1).\nk = 1).\n5. Compute the outer product D\u2212\n6. Compute update \u2206W g\n\u00b7,\u00b7,k = D+\n\nk = v+h+T .\n\nk = v\u2212h\u2212T .\nk \u2212 D\u2212\nk .\n\nDiscriminative update: ML learning of P (l|v)\n\n1. Compute log P (lc = 1|v+) for c \u2208 {1, ..., Nl}.\n2. Using the result from step 1 and the true label l+\n\n\u2206W d\n\n\u00b7,\u00b7,k = \u2202 log P (l|v)/\u2202W\u00b7,\u00b7,c for c \u2208 {1, ..., Nl}.\n\nk = 1, compute the update\n\nThe two types of update for the cth slice of the tensor W\u00b7,\u00b7,c are then combined by a weighted\nsum:\n\nW\u00b7,\u00b7,c \u2190 W\u00b7,\u00b7,c + \u03b7(\u2206W g\n\n(7)\nwhere \u03bb is a parameter that sets the relative weighting of the generative and discriminative\nupdates, and \u03b7 is the learning rate. As before, the updates from a mini-batch of training\ncases can be averaged together and applied as a single update to the parameters. In ex-\nperiments, we set \u03bb by trying di\ufb00erent values and evaluating classi\ufb01cation accuracy on a\nvalidation set.\n\n\u00b7,\u00b7,c + \u03bb\u2206W d\n\n\u00b7,\u00b7,c),\n\nNote that the generative part in the above algorithm is simply CD learning of the RBM for\nthe kth class. The earlier problem of slow mixing does not appear in the hybrid algorithm\nbecause the chain in the generative part does not involve sampling the label.\n\nSemi-supervised learning: The hybrid learning algorithm can also make use of unlabeled\ntraining cases by treating their labels as missing inputs. The model \ufb01rst infers the missing\nlabel by sampling P (l|vu) for an unlabeled training case vu. The generative update is then\ncomputed by treating the inferred label as the true label. (The discriminative update will\nalways be zero in this case.) Therefore the unlabeled training cases contribute an extra\ngenerative term to the parameter update.\n\n4 Sparsity\n\nDiscriminative performance is improved by using binary features that are only rarely active.\nSparse activities are achieved by specifying a desired probability of being active, p << 1, and\nthen adding an additional penalty term that encourages an exponentially decaying average,\nq, of the actual probability of being active to be close to p. The natural error measure to use\nis the cross entropy between the desired and actual distributions: p log q + (1 \u2212 p) log(1 \u2212 q).\nFor logistic units this has a simple derivative of p\u2212q with respect to the total input to a unit.\nThis derivative is used to adjust both the bias and the incoming weights of each hidden unit.\nWe tried various values for p and 0.1 worked well. In addition to specifying p it is necessary\nto specify how fast the estimate of q decays. We used qnew = 0.9 \u2217 qold + 0.1 \u2217 qcurrent where\nqcurrent is the average probability of activation for the current mini-batch of 100 training\ncases. It is also necessary to specify how strong the penalty term should be, but this is easy\nto set empirically. We multiply the penalty gradient by a coe\ufb03cient that is chosen to ensure\nthat, on average, q is close to p but there is still signi\ufb01cant variation among the q values for\ndi\ufb00erent hidden units. This prevents the penalty term from dominating the learning. One\n\n5\n\n\fadded advantage of this sparseness penalty is that it revives any hidden units whose average\nactivities are much lower than p.\n\n5 Evaluating DBNs on the NORB Object Recognition Task\n\n5.1 NORB Database\n\nFor a detailed description see [12]. The \ufb01ve object classes in NORB are animals, humans,\nplanes, trucks, and cars. The dataset comes in two di\ufb00erent versions, normalized-uniform\nand jittered-cluttered.\nIn this paper we use the normalized-uniform version, which has\nobjects centred in the images with a uniform background. There are 10 instances of each\nobject class, imaged under 6 illuminations and 162 viewpoints (18 azimuths \u00d7 9 elevations).\nThe instances are split into two disjoint sets (pre-speci\ufb01ed in the database) of \ufb01ve each to\nde\ufb01ne the training and test sets, both containing 24,300 cases. So at test time a trained\nmodel has to recognize unseen instances of the same object classes.\n\nPre-processing: A single training (and test) case is a stereo-pair of grayscale images, each\nof size 96\u00d796. To speed up experiments, we reduce dimensionality by using a \u201cfoveal\u201d image\nrepresentation. The central 64 \u00d7 64 portion of an image is kept at its original resolution.\nThe remaining 16 pixel-wide ring around it is compressed by replacing non-overlapping\nsquare blocks of pixels with the average value of a block. We split the ring into four smaller\nones: the outermost ring has 8 \u00d7 8 blocks, followed by a ring of 4 \u00d7 4 blocks, and \ufb01nally\ntwo innermost rings of 2 \u00d7 2 blocks. The foveal representation reduces the dimensionality\nof a stereo-pair from 18432 to 8976. All our models treat the stereo-pair images as 8976-\ndimensional vectors1.\n\n5.2 Training Details\n\nModel architecture: The two main decisions to make when training DBNs are the number\nof hidden layers to greedily pre-train and the number of hidden units to use in each layer.\nTo simplify the experiments we constrain the number of hidden units to be the same at\nall layers (including the top-level model). We have tried hidden layer sizes of 2000, 4000,\nand 8000 units. We have also tried models with two, one, or no greedily pre-trained hidden\nlayers. To avoid clutter, only the results for the best settings of these two parameters are\ngiven. The best classi\ufb01cation results are given by the DBN with one greedily pre-trained\nsparse hidden layer of 4000 units (regardless of the type of top-level model).\n\nA DBN trained on the pre-processed input with one greedily pre-trained layer of 4000\nhidden units and a third-order model on top of it, also with 4000 hidden units, has roughly\n116 million learnable parameters in total. This is roughly two orders of magnitude more\nparameters than some of the early DBNs trained on the MNIST images [6], [10]. Training\nsuch a model in Matlab on an Intel Xeon 3GHz machine takes almost two weeks. See a\nrecent paper by Raina et al.\n[15] that uses GPUs to train a deep model with roughly the\nsame number of parameters much more quickly.\n\nWe put Gaussian units at the lowest (pixel) layer of the DBN, which have been shown to be\ne\ufb00ective for modelling grayscale images [7]. See [7], [21] for details about Gaussian units.\n\n6 Results\n\nThe results are presented in three parts: part 1 compares deep models to shallow ones,\nall trained using CD. Part 2 compares CD to the hybrid learning algorithm for training\nthe top-level model of a DBN. Part 3 compares DBNs trained with and without unlabeled\ndata, using either CD or the hybrid algorithm at the top level. For comparison, here are\nsome published results for discriminative models on normalized-uniform NORB (without\nany pre-processing) [2], [12]: logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian kernel\nSVM 11.6%, convolutional neural net 6.0%, convolutional net + SVM hybrid 5.9%.\n\n1Knowledge about image topology is used only along the (mostly empty) borders, and not in\n\nthe central portion that actually contains the object.\n\n6\n\n\f6.1 Deep vs. Shallow Models Trained with CD\n\nWe consider here DBNs with one greedily pre-trained layer and a top-level model that\ncontains the greedily pretrained features as its \u201cvisible\u201d layer. The corresponding shallow\nversion trains the top-level model directly on the pixels (using Gaussian visible units), with\nno pre-trained layers in between. Using CD as the learning algorithm (for both greedy pre-\ntraining and at the top-level) with the two types of top-level models gives us four possibilities\nto compare. The test error rates for these four models(see table 1) show that one greedily\npre-trained layer reduces the error substantially, even without any subsequent \ufb01ne-tuning\nof the pre-trained layer.\n\nModel RBM with Third-order\n\nShallow\n\nDeep\n\nlabel unit\n\n22.8%\n11.9%\n\nRBM\n20.8%\n7.6%\n\nTable 1: NORB test set error rates for deep and shallow models trained using CD with two\ntypes of top-level models.\n\nThe third-order RBM outperforms the standard RBM top-level model when they both have\nthe same number of hidden units, but a better comparison might be to match the number\nof parameters by increasing the hidden layer size of the standard RBM model by \ufb01ve times\n(i.e. 20000 hidden units). We have tried training such an RBM, but the error rate is worse\nthan the RBM with 4000 hidden units.\n\n6.2 Hybrid vs. CD Learning for the Top-level Model\n\nWe now compare the two alternatives for training the top-level model of a DBN. There are\nfour possible combinations of top-level models and learning algorithms, and table 2 lists\ntheir error rates. All these DBNs share the same greedily pre-trained \ufb01rst layer \u2013 only the\ntop-level model di\ufb00ers among them.\n\nLearning RBM with Third-order\nalgorithm label unit\n\nCD\n\nHybrid\n\n11.9%\n10.4%\n\nRBM\n7.6%\n6.5%\n\nTable 2: NORB test set error rates for top-level models trained using CD and the hybrid\nlearning algorithms.\n\nThe lower error rates of hybrid learning are partly due to its ability to avoid the poor mixing\nof the label variable when CD is used to learn the joint density P (v, l) and partly due to its\ngreater emphasis on discrimination (but with strong regularization provided by also learning\nP (v|l)).\n\n6.3 Semi-supervised vs. Supervised Learning\n\nIn this \ufb01nal part, we create additional images from the original NORB training set by\napplying global translations of 2, 4, and 6 pixels in eight directions (two horizontal, two\nvertical and four diagonal directions) to the original stereo-pair images2. These \u201cjittered\u201d\nimages are treated as extra unlabeled training cases that are combined with the original\nlabeled cases to form a much larger training set. Note that we could have assigned the\njittered images the same class label as their source images. By treating them as unlabeled,\nthe goal is to test whether improving the unsupervised, generative part of the learning alone\ncan improve discriminative performance.\n\nThere are two ways to use unlabeled data:\n\n1. Use it for greedy pre-training of the lower layers only, and then train the top-level\n\nmodel as before, with only labeled data and the hybrid algorithm.\n\n2The same translation is applied to both images in the stereo-pair.\n\n7\n\n\f2. Use it for learning the top-level model as well, now with the semi-supervised variant\n\nof the hybrid algorithm at the top-level.\n\nTable 3 lists the results for both options.\n\nTop-level model Unlabeled\n(hyrbid learning\njitter for\n\nUnlabeled\njitter at the Error\n\nonly)\n\nRBM with\nlabel unit\nThird-order\n\nmodel\n\npre-training\nlower layer?\n\ntop-level?\n\nNo\nYes\nNo\nYes\nYes\n\nNo\nNo\nNo\nNo\nYes\n\n10.4%\n9.0%\n6.5%\n5.3%\n5.2%\n\nTable 3: NORB test set error rates for DBNs trained with and without unlabeled data, and\nusing the hybrid learning algorithm at the top-level.\n\nThe key conclusion from table 3 is that simply using more unlabeled training data in the\nunsupervised, greedy pre-training phase alone can signi\ufb01cantly improve the classi\ufb01cation\naccuracy of the DBN. It allows a third-order top-level model to reduce its error from 6.5%\nto 5.3%, which beats the current best published result for normalized-uniform NORB without\nusing any extra labeled data. Using more unlabeled data also at the top level further improves\naccuracy, but only slightly, to 5.2%.\n\nNow consider a discriminative model at the top, representing the distribution P (l|v). Unlike\nin the generative case, the exact gradient of the log-likelihood is tractable to compute.\nTable 4 shows the results of some discriminative models. These models use the same greedily\npre-trained lower layer, learned with unlabeled jitter. They di\ufb00er in how the top-level\nparameters are initialized, and whether they use the jittered images as extra labeled cases\nfor learning P (l|v).\n\nError\n\n13.4%\n7.1%\n\n5.0%\n\nNo\nYes\n\nYes\n\nRandom\nRandom\n\nimages as\nlabeled?\n\nInitialization Use jittered\nof top-level\nparameters\n\nWe compare training the discriminative top-\nlevel model \u201cfrom scratch\u201d (random initializa-\ntion) versus initializing its parameters to those\nof a generative model learned by the hybrid al-\ngorithm. We also compare the e\ufb00ect of using the\njittered images as extra labeled cases. As men-\ntioned before, it is possible to assign the jittered\nimages the same labels as the original NORB\nimages they are generated from, which expands\nthe labeled training set by 25 times. The bot-\ntom two rows of table 4 compare a discriminative\nTable 4: NORB test set error rates for dis-\nthird-order model initialized with and without\ncriminative third-order models at the top\npre-training. Pre-trained initialization (5.0%)\nlevel.\nsigni\ufb01cantly improves accuracy over random initialization (7.1%). But note that discrimina-\ntive training only makes a small additional improvement (5.2% to 5.0%) over the accuracy\nof the pre-trained model itself.\n\nModel with\n5.2% error\nfrom table 3\n\n7 Conclusions\n\nOur results make a strong case for the use of generative modeling in object recognition.\nThe main two points are: 1) Unsupervised, greedy, generative learning can extract an\nimage representation that supports more accurate object recognition than the raw pixel\nrepresentation. 2) Including P (v|l) in the objective function for training the top-level model\nresults in better classi\ufb01cation accuracy than using P (l|v) alone. In future work we plan to\nfactorize the third-order Boltzmann machine as described in [18] so that some of the top-level\nfeatures can be shared across classes.\n\n8\n\n\fReferences\n\n[1] Y. Bengio, P. Lamblin, P. Popovici, and H. Larochelle. Greedy Layer-Wise Training of\n\nDeep Networks. In NIPS, 2006.\n\n[2] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In Large-Scale Kernel\n\nMachines, 2007.\n\n[3] D. DeCoste and B. Scholkopf. Training Invariant Support Vector Machines. Machine\n\nLearning, 46:161\u2013190, 2002.\n\n[4] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14(8):1711\u20131800, 2002.\n\n[5] G. E. Hinton. To Recognize Shapes, First Learn to Generate Images. Technical Report\n\nUTML TR 2006-04, Dept. of Computer Science, University of Toronto, 2006.\n\n[6] G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets.\n\nNeural Computation, 18:1527\u20131554, 2006.\n\n[7] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural\n\nnetworks. Science, 313:504\u2013507, 2006.\n\n[8] M. Kelm, C. Pal, and A. McCallum. Combining Generative and Discriminative Methods\n\nfor Pixel Classi\ufb01cation with Multi-Conditional Learning. In ICPR, 2006.\n\n[9] H. Larochelle and Y. Bengio. Classi\ufb01cation Using Discriminative Restricted Boltzmann\n\nMachines. In ICML, pages 536\u2013543, 2008.\n\n[10] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evalu-\nation of deep architectures on problems with many factors of variation. In ICML, pages\n473\u2013480, 2007.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Ha\ufb00ner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[12] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition\n\nwith invariance to pose and lighting. In CVPR, Washington, D.C., 2004.\n\n[13] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional Deep Belief Networks for\n\nScalable Unsupervised Learning of Hierarchical Representations. In ICML, 2009.\n\n[14] V. Nair and G. E. Hinton.\n\nImplicit mixtures of restricted boltzmann machines.\n\nIn\n\nNeural information processing systems, 2008.\n\n[15] R. Raina, A. Madhavan, and A. Ng. Large-scale Deep Unsupervised Learning using\n\nGraphics Processors. In ICML, 2009.\n\n[16] Marc\u2019Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised\nlearning of invariant feature hierarchies with applications to object recognition. In Proc.\nComputer Vision and Pattern Recognition Conference (CVPR\u201907). IEEE Press, 2007.\n\n[17] T. J. Sejnowski. Higher-order Boltzmann Machines. In AIP Conference Proceedings,\n\npages 398\u2013403, 1987.\n\n[18] G. Taylor and G. E. Hinton. Factored Conditional Restricted Boltzmann Machines for\n\nModeling Motion Style. In ICML, 2009.\n\n[19] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.\n[20] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and Composing\n\nRobust Features with Denoising Autoencoders. In ICML, 2008.\n\n[21] M. Welling, M. Rosen-Zvi, and G. E. Hinton. Exponential family harmoniums with an\n\napplication to information retrieval. In NIPS 17, 2005.\n\n9\n\n\f", "award": [], "sourceid": 807, "authors": [{"given_name": "Vinod", "family_name": "Nair", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}