{"title": "Learning Stochastic Feedforward Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 530, "page_last": 538, "abstract": "Multilayer perceptrons (MLPs) or neural networks are popular models used for nonlinear regression and classification tasks. As regressors, MLPs model the conditional distribution of the predictor variables Y given the input variables X. However, this predictive distribution is assumed to be unimodal (e.g. Gaussian). For tasks such as structured prediction problems, the conditional distribution should be multimodal, forming one-to-many mappings. By using stochastic hidden variables rather than deterministic ones, Sigmoid Belief Nets (SBNs) can induce a rich multimodal distribution in the output space. However, previously proposed learning algorithms for SBNs are very slow and do not work well for real-valued data. In this paper, we propose a stochastic feedforward network with hidden layers having \\emph{both deterministic and stochastic} variables. A new Generalized EM training procedure using importance sampling allows us to efficiently learn complicated conditional distributions. We demonstrate the superiority of our model to conditional Restricted Boltzmann Machines and Mixture Density Networks on synthetic datasets and on modeling facial expressions. Moreover, we show that latent features of our model improves classification and provide additional qualitative results on color images.", "full_text": "Learning Stochastic Feedforward Neural Networks\n\nYichuan Tang\n\nRuslan Salakhutdinov\n\nDepartment of Computer Science\n\nDepartment of Computer Science and Statistics\n\nUniversity of Toronto\n\nToronto, Ontario, Canada.\ntang@cs.toronto.edu\n\nUniversity of Toronto\n\nToronto, Ontario, Canada.\n\nrsalakhu@cs.toronto.edu\n\nAbstract\n\nMultilayer perceptrons (MLPs) or neural networks are popular models used for\nnonlinear regression and classi\ufb01cation tasks. As regressors, MLPs model the\nconditional distribution of the predictor variables Y given the input variables X.\nHowever, this predictive distribution is assumed to be unimodal (e.g. Gaussian).\nFor tasks involving structured prediction, the conditional distribution should be\nmulti-modal, resulting in one-to-many mappings. By using stochastic hidden vari-\nables rather than deterministic ones, Sigmoid Belief Nets (SBNs) can induce a rich\nmultimodal distribution in the output space. However, previously proposed learn-\ning algorithms for SBNs are not ef\ufb01cient and unsuitable for modeling real-valued\ndata. In this paper, we propose a stochastic feedforward network with hidden lay-\ners composed of both deterministic and stochastic variables. A new Generalized\nEM training procedure using importance sampling allows us to ef\ufb01ciently learn\ncomplicated conditional distributions. Our model achieves superior performance\non synthetic and facial expressions datasets compared to conditional Restricted\nBoltzmann Machines and Mixture Density Networks. In addition, the latent fea-\ntures of our model improves classi\ufb01cation and can learn to generate colorful tex-\ntures of objects.\n\nIntroduction\n\n1\nMultilayer perceptrons (MLPs) are general purpose function approximators. The outputs of a MLP\ncan be interpreted as the suf\ufb01cient statistics of a member of the exponential family (conditioned on\nthe input X), thereby inducing a distribution over the output space Y . Since the nonlinear activations\nare all deterministic, MLPs model the conditional distribution p(Y |X) with a unimodal assumption\n(e.g. an isotropic Gaussian)1.\nFor many structured prediction problems, we are interested in a conditional distribution p(Y |X)\nthat is multimodal and may have complicated structure2. One way to model the multi-modality is to\nmake the hidden variables stochastic. Conditioned on a particular input X, different hidden con\ufb01g-\nurations lead to different Y . Sigmoid Belief Nets (SBNs) [3, 2] are models capable of satisfying the\nmulti-modality requirement. With binary input, hidden, and output variables, they can be viewed\nas directed graphical models where the sigmoid function is used to compute the degrees of \u201cbelief\u201d\nof a child variable given the parent nodes. Inference in such models is generally intractable. The\noriginal paper by Neal [2] proposed a Gibbs sampler which cycles through the hidden nodes one\nat a time. This is problematic as Gibbs sampling can be very slow when learning large models\nor \ufb01tting moderately-sized datasets. In addition, slow mixing of the Gibbs chain would typically\nlead to a biased estimation of gradients during learning. A variational learning algorithm based on\n\n\u03c3(cid:0)W2\u03c3(W1x)(cid:1), \u03c3(a) = 1/(1 + exp(\u2212a)) is the sigmoid function. Note that the Mixture of Density Network\n\n1For example, in a MLP with one input, one output and one hidden layer: p(y|x) \u223c N (y|\u00b5y, \u03c32\n\ny), \u00b5y =\n\nis an exception to the unimodal assumption [1].\n\n2An equivalent problem is learning one-to-many functions from X (cid:55)\u2192 Y .\n\n1\n\n\fFigure 1: Stochastic Feedforward Neural Networks. Left: Network diagram. Red nodes are stochastic and\nbinary, while the rest of the hiddens are deterministic sigmoid nodes. Right: motivation as to why multimodal\noutputs are needed. Given the top half of the face x, the mouth in y can be different, leading to different\nexpressions.\nthe mean-\ufb01eld approximation was proposed in [4] to improve the learning of SBNs. A drawback\nof the variational approach is that, similar to Gibbs, it has to cycle through the hidden nodes one\nat a time. Moreover, beside the standard mean-\ufb01eld variational parameters, additional parameters\nmust be introduced to lower-bound an intractable term that shows up in the expected free energy,\nmaking the lower-bound looser. Gaussian \ufb01elds are used in [5] for inference by making Gaussian\napproximations to units\u2019 input, but there is no longer a lower bound on the likelihood.\nIn this paper, we introduce the Stochastic Feedforward Neural Network (SFNN) for modeling con-\nditional distributions p(y|x) over continuous real-valued Y output space. Unlike SBNs, to better\nmodel continuous data, SFNNs have hidden layers with both stochastic and deterministic units. The\nleft panel of Fig. 1 shows a diagram of SFNNs with multiple hidden layers. Given an input vector x,\ndifferent states of the stochastic units can generates different modes in Y . For learning, we present\na novel Monte Carlo variant of the Generalized Expectation Maximization algorithm. Importance\nsampling is used for the E-step for inference, while error backpropagation is used by the M-step\nto improve a variational lower bound on the data log-likelihood. SFNNs have several attractive\nproperties, including:\n\u2022 We can draw samples from the exact model distribution without resorting to MCMC.\n\u2022 Stochastic units form a distributed code to represent an exponential number of mixture compo-\n\u2022 As a directed model, learning does not need to deal with a global partition function.\n\u2022 Combination of stochastic and deterministic hidden units can be jointly trained using the back-\n\nnents in output space.\n\npropagation algorithm, as in standard feed-forward neural networks.\n\nThe two main alternative models are Conditional Gaussian Restricted Boltzmann Machines (C-\nGRBMs) [6] and Mixture Density Networks (MDNs) [1]. Note that Gaussian Processes [7] and\nGaussian Random Fields [8] are unimodal and therefore incapable of modeling a multimodal Y .\nConditional Random Fields [9] are widely used in NLP and vision, but often assume Y to be dis-\ncrete rather than continuous. C-GRBMs are popular models used for human motion modeling [6],\nstructured prediction [10], and as a higher-order potential in image segmentation [11]. While C-\nGRBMs have the advantage of exact inference, they are energy based models that de\ufb01ne different\npartition functions for different input X. Learning also requires Gibbs sampling which is prone to\npoor mixing. MDNs use a mixture of Gaussians to represent the output Y . The components\u2019 means,\nmixing proportions, and the output variances are all predicted by a MLP conditioned on X. As with\nSFNNs, the backpropagation algorithm can be used to train MDNs ef\ufb01ciently. However, the number\nof mixture components in the output Y space must be pre-speci\ufb01ed and the number of parameters is\nlinear in the number of mixture components. In contrast, with Nh stochastic hidden nodes, SFNNs\ncan use its distributed representation to model up to 2Nh mixture components in the output Y .\n\n2 Stochastic Feedforward Neural Networks\nSFNNs contain binary stochastic hidden variables h \u2208 {0, 1}Nh, where Nh is the number of hidden\nnodes. For clarity of presentation, we construct a SFNN from a one-hidden-layer MLP by replacing\nthe sigmoid nodes with stochastic binary ones. Note that other types stochastic units can also be\nused. The conditional distribution of interest, p(y|x), is obtained by marginalizing out the latent\nh p(y, h|x). SFNNs are directed graphical models where\nthe generative process starts from x, \ufb02ows through h, and then generates output y. Thus, we can\nfactorize the joint distribution as: p(y, h|x) = p(y|h)p(h|x). To model real-valued y, we have\n\nstochastic hidden variables: p(y|x) =(cid:80)\n\n2\n\nxystochasticstochastic\fp(y|h) = N (y|W2h + b2, \u03c32\ny) and p(h|x) = \u03c3(W1x + b1), where b is the bias. Since h \u2208 {0, 1}Nh\nis a vector of Bernoulli random variables, p(y|x) has potentially 2Nh different modes3, one for every\npossible binary con\ufb01gurations of h. The fact that h can take on different states in SFNN is the reason\nwhy we can learn one-to-many mappings, which would be impossible with standard MLPs.\nThe modeling \ufb02exibility of SFNN comes with computational costs. Since we have a mixture model\nwith potentially 2Nh components conditioned on any x, p(y|x) does not have a closed-form expres-\nsion. We can use Monte Carlo approximation with M samples for its estimation:\n\np(y|h(m)),\n\nh(m) \u223c p(h|x).\n\n(1)\n\nM(cid:88)\n\nm=1\n\np(y|x) (cid:39) 1\nM\n\nThis estimator is unbiased and has relatively low variance. This is because the accuracy of the\nestimator does not depend on the dimensionality of h and that p(h|x) is factorial, meaning that we\ncan draw samples from the exact distribution.\nIf y is discrete, it is suf\ufb01cient for all of the hiddens to be discrete. However, using only discrete\nhiddens is suboptimal when modeling real-valued output Y . This is due to the fact that while y is\ncontinuous, there are only a \ufb01nite number of discrete hidden states, each one (e.g. h(cid:48)) leads to a\ncomponent which is a Gaussian: p(y|h(cid:48)) = N (y|\u00b5(h(cid:48)), \u03c32\ny). The mean of a Gaussian component\nis a function of the hidden state: \u00b5(h(cid:48)) = W T\n2 h(cid:48) + b2. When x varies, only the probability of\nchoosing a speci\ufb01c hidden state h(cid:48) changes via p(h(cid:48)|x), not \u00b5(h(cid:48)). However, if we allow \u00b5(h(cid:48)) to\nbe a deterministic function of x as well, we can learn a smoother p(y|x), even when it is desirable\ny. This can be accomplished by allowing for both stochastic and\nto learn small residual variances \u03c32\ndeterministic units in a single SFNN hidden layer, allowing the mean \u00b5(h(cid:48), x) to have contributions\nfrom two components, one from the hidden state h(cid:48), and another one from de\ufb01ning a deterministic\nmapping from x. As we demonstrate in our experimental results, this is crucial for learning good\ndensity models of the real-valued Y .\nIn SFNNs with only one hidden layer, p(h|x) is a factorial Bernoulli distribution. If p(h|x) has low\nentropy, only a few discrete h states out of the 2Nh total states would have any signi\ufb01cant probability\nmass. We can increase the entropy over the stochastic hidden variables by adding a second hidden\nlayer. The second hidden layer takes the stochastic and any deterministic hidden nodes of the \ufb01rst\nlayer as its input. This leads to our proposed SFNN model, shown in Fig. 1.\nIn our SFNNs, we assume a conditional diagonal Gaussian distribution for the output Y :\nlog p(y|h, x) \u221d \u2212 1\ni . We note that we can also use any\nother parameterized distribution (e.g. Student\u2019s t) for the output variables. This is a win compared\nto the Boltzmann Machine family of models, which require the output distribution to be from the\nexponential family.\n\n(cid:80)\ni (yi \u2212 \u00b5(h, x))2/\u03c32\n\ni log \u03c32\n\ni \u2212 1\n\n2\n\n(cid:80)\n\n2\n\n2.1 Learning\nWe present a Monte Carlo variant of the Generalized EM algorithm [12] for learning SFNNs. Specif-\nically, importance sampling is used during the E-step to approximate the posterior p(h|y, x), while\nthe Backprop algorithm is used during the M-step to calculate the derivatives of the parameters of\nboth the stochastic and deterministic nodes. Gradient ascent using the derivatives will guarantee\nthat the variational lower bound of the model log-likelihood will be improved. The drawback of our\nlearning algorithm is the requirement of sampling the stochastic nodes M times for every weight\nupdate. However, as we will show in the experimental results, 20 samples is suf\ufb01cient for learning\ngood SFNNs.\nThe requirement of sampling is typical for models capable of structured learning. As a comparison,\nenergy based models, such as conditional Restricted Boltzmann Machines, require MCMC sampling\nper weight update to estimate the gradient of the log-partition function. These MCMC samples do\nnot converge to the true distribution, resulting in a biased estimate of the gradient.\nFor clarity, we provide the following derivations for SFNNs with one hidden layer containing only\nstochastic nodes4. For any approximating distribution q(h), we can write down the following varia-\n\n3In practice, due to weight sharing, we will not be able to have close to that many modes for a large Nh.\n4It is straightforward to extend the model to multiple and hybid hidden layered SFNNs.\n\n3\n\n\ftional lower-bound on the data log-likelihood:\n\nlog p(y|x) = log\n\np(y, h|x) =\n\nh\n\nh\n\n(cid:88)\n\np(h|y, x) log\n\n\u2265(cid:88)\n\nh\n\np(y, h|x)\np(h|y, x)\n\nq(h) log\n\np(y, h|x; \u03b8)\n\nq(h)\n\n,\n\n(2)\n\n(cid:88)\n\nwhere q(h) can be any arbitrary distribution. For the tightest lower-bound, q(h) need to be the exact\nposterior p(h|y, x). While the posterior p(h|y, x) is hard to compute, the \u201cconditional prior\u201d p(h|x)\nis easy (corresponds to a simple feedforward pass). We can therefore set q(h) (cid:44) p(h|x). However,\nthis would be a very bad approximation as learning proceeds, since the learning of the likelihood\np(y|h, x) will increase the KL divergence between the conditional prior and the posterior. Instead,\nit is critical to use importance sampling with the conditional prior as the proposal distribution.\nLet Q be the expected complete data log-likelihood, which is a lower bound on the log-likelihood\nthat we wish to maximize:\n\nQ(\u03b8, \u03b8old) =\n\np(h|y, x; \u03b8old)\np(h|x; \u03b8old)\n\np(h|x; \u03b8old) log p(y, h|x; \u03b8) (cid:39) 1\nM\n\nw(m) log p(y, h(m)|x; \u03b8),\n\nM(cid:88)\n\nm=1\n\n(cid:88)\n\nh\n\nM(cid:88)\n\n(3)\nwhere h(m) \u223c p(h|x; \u03b8old) and w(m) is the importance weight of the m-th sample from the proposal\ndistribution p(h|x; \u03b8old). Using Bayes Theorem, we have\np(y|h(m), x; \u03b8old)\n\n(4)\nEq. 1 is used to approximate p(y|x; \u03b8old). For convenience, we de\ufb01ne the partial objective\nproximate our objective function Q(\u03b8, \u03b8old) with M samples from the proposal: Q(\u03b8, \u03b8old) (cid:39)\n\nof the m-th sample as Q(m) (cid:44) w(m)(cid:0) log p(y|h(m); \u03b8) + log p(h(m)|x; \u03b8)(cid:1). We can then ap-\n(cid:80)M\n\nm=1 Q(m)(\u03b8, \u03b8old). For our generalized M-step, we seek to perform gradient ascent on Q:\n\n(cid:80)M\np(y|h(m); \u03b8old)\nm=1 p(y|h(m); \u03b8old)\n\np(h(m)|y, x; \u03b8old)\np(h(m)|x; \u03b8old)\n\np(y|x; \u03b8old)\n\nw(m) =\n\n(cid:39)\n\n1\nM\n\n=\n\n.\n\n1\nM\n\n=\n\n\u2202\u03b8\n\nm=1\n\nm=1\n\n(5)\n\n1\nM\n\n\u2202Q\n\u2202\u03b8\n\n\u2202Q(m)(\u03b8, \u03b8old)\n\n(cid:39) 1\nM\n\nw(m) \u2202\n\u2202\u03b8\n\nlog p(y|h(m); \u03b8) + log p(h(m)|x; \u03b8)\n\n(cid:8) \u00b7(cid:9) is computed using error backpropagation of two sub-terms. The \ufb01rst\n(cid:8) log p(y|h(m); \u03b8)(cid:9), treats y as the targets and h(m) as the input data, while the second part,\n(cid:8) log p(h(m)|x; \u03b8)(cid:9), treats h(m) as the targets and x as the input data. In SFNNs with a mixture\n\nThe gradient term \u2202\n\u2202\u03b8\npart, \u2202\n\u2202\u03b8\n\u2202\n\u2202\u03b8\nof deterministic and stochastic units, backprop will additionally propagate error information from\nthe \ufb01rst part to the second part.\nThe full gradient is a weighted summation of the M partial derivatives, where the weighting comes\nfrom how well a particular state h(m) can generate the data y. This is intuitively appealing, since\nlearning adjusts both the \u201cpreferred\u201d states\u2019 abilities to generate the data (\ufb01rst part in the braces), as\nwell as increase their probability of being picked conditioning on x (second part in the braces). The\ndetailed EM learning algorithm for SFNNs is listed in Alg. 1 of the Supplementary Materials.\n\nM(cid:88)\n\n(cid:110)\n\n(cid:111)\n\n.\n\n2.2 Cooperation during learning\nWe note that for importance sampling to work well in general, a key requirement is that the proposal\ndistribution is not small where the true distribution has signi\ufb01cant mass. However, things are slightly\ndifferent when using importance sampling during learning. Our proposal distribution p(h|x) and the\nposterior p(h|y, x) are not \ufb01xed but rather governed by the model parameters. Learning adapts these\ndistribution in a synergistic and cooperative fashion.\nLet us hypothesize that at a particular learning iteration, the conditional prior p(h|x) is small in\ncertain regions where p(h|y, x) is large, which is undesirable for importance sampling. The E-\nstep will draw M samples and weight them according to Eq. 4. While all samples h(m) will\nhave very low log-likelihood due to the bad conditional prior, there will be a certain preferred\nstate \u02c6h with the largest weight. Learning using Eq. 5 will accomplish two things: (1) it will\nadjust the generative weights to allow preferred states to better generate the observed y; (2) it\nwill make the conditional prior better by making it more likely to predict \u02c6h given x. Since\n\n4\n\n\f(a) Dataset A\n\n(b) Dataset B\n\n(c) Dataset C\n\nFigure 3: Three synthetic datasets of 1-dimensional one-to-many mappings. For any given x, multiple modes\nin y exist. Blue stars are the training data, red pluses are exact samples from SFNNs. Best viewed in color.\nthe generative weights are shared, the fact that \u02c6h generates y accurately will probably reduce\nthe likelihood of y under another state \u02dch. The updated conditional prior tends to be a bet-\nter proposal distribution for the updated model. The cooperative interaction between the condi-\ntional prior and posterior during learning provides some robustness to the importance sampler.\nEmpirically, we can see this effect as learning progress on\nDataset A of Sec. 3.1 in Fig. 2. The plot shows the model log-\nlikelihood given the training data as learning progresses until\n3000 weight updates. 30 importance samples are used during\nlearning with 2 hidden layers of 5 stochastic nodes. We chose\n5 nodes because it is small enough that the true log-likelihood\ncan be computed using brute-force integration. As learning\nprogresses, the Monte Carlo approximation is very close to\nthe true log-likelihood using only 30 samples. As expected,\nthe KL from the posterior and prior diverges as the generative\nFigure 2: KL divergence and log-\nweights better models the multi-modalities around x = 0.5.\nlikelihoods. Best viewed in color.\nWe also compared the KL divergence between our empirical\nweighted importance sampled distribution and true posterior, which converges toward zero. This\ndemonstrate that the prior distribution have learned to not be small in regions of large posterior. In\nother words, this shows that the E-step in the learning of SFNNs is close to exact for this dataset and\nmodel.\n3 Experiments\nWe \ufb01rst demonstrate the effectiveness of SFNN on synthetic one dimensional one-to-many map-\nping data. We then use SFNNs to model face images with varying facial expressions and emotions.\nSFNNs outperform other competing density models by a large margin. We also demonstrate the use-\nfulness of latent features learned by SFNNs for expression classi\ufb01cation. Finally, we train SFNNs\non a dataset with in-depth head rotations, a database with colored objects, and a image segmen-\ntation database. By drawing samples from these trained SFNNs, we obtain qualitative results and\ninsights into the modeling capacity of SFNNs. We provide computation times for learning in the\nSupplementary Materials.\n3.1 Synthetic datasets\nAs a proof of concept, we used three one dimensional one-to-many mapping datasets, shown in\nFig. 3. Our goal is to model p(y|x). Dataset A was used by [1] to evaluate the performance of the\nMixture Density Networks (MDNs). Dataset B has a large number of tight modes conditioned on\nany given x, which is useful for testing a model\u2019s ability to learn many modes and a small residual\nvariance. Dataset C is used for testing whether a model can learn modes that are far apart from\neach other. We randomly split the data into a training, validation, and a test set. We report the\naverage test set log-probability averaged over 5 folds for different models in Table 1. The method\ncalled \u2018Gaussian\u2019 is a 2D Gaussian estimated on (x, y) jointly, and we report log p(y|x) which\ncan be obtained easily in closed-form. For Conditional Gaussian Restricted Boltzmann Machine\n(C-GRBM) we used 25-step Contrastive Divergence [13] (CD-25) to estimate the gradient of the\nlog partition function. We used Annealed Importance Sampling [14, 15] with 50,000 intermediate\ntemperatures to estimate the partition function. SBN is a Sigmoid Belief Net with three hidden\nstochastic binary layers between the input and the output layer. It is trained in the same way as\nSFNN, but there are no deterministic units. Finally, SFNN has four hidden layers with the inner\n\n5\n\n\ftwo being hybrid stochastic/deterministic layers (See Fig. 1). We used 30 importance samples to\napproximate the posterior during the E-step. All other hyper-parameters for all of the models were\nchosen to maximize the validation performance.\n\nSBN\n\nMDN\n\nGaussian\n\nC-GRBM\n\nSFNN\nA 0.078\u00b10.02 1.05\u00b10.02 0.57\u00b10.01 0.79\u00b10.03 1.04\u00b10.03\nB -2.40\u00b10.07 -1.58\u00b10.11 -2.14\u00b10.04 -1.33\u00b10.10 -0.98\u00b10.06\nC 0.37\u00b10.07 2.03\u00b10.05 1.36\u00b10.05 1.74\u00b10.08 2.21\u00b10.16\nTable 1: Average test log-probability density on synthetic 1D\ndatasets.\nthat having deterministic hidden nodes is a big win for modeling continuous y.\n\nTable 1 reveals that SFNNs consis-\ntently outperform all other methods.\nFig. 3 further shows samples drawn\nfrom SFNNs as red \u2018pluses\u2019. Note that\nSFNNs can learn small residual vari-\nances to accurately model Dataset B.\nComparing SBNs to SFNNs, it is clear\n\ny was initialized to the variance of y.\n\n3.2 Modeling Facial Expression\nConditioned on a subject\u2019s face with neutral expression, the distribution of all possible emotions or\nexpressions of this particular individual is multimodal in pixel space. We learn SFNNs to model\nfacial expressions in the Toronto Face Database [16]. The Toronto Face Database consist of 4000\nimages of 900 individuals with 7 different expressions. Of the 900 subjects, there are 124 with 10 or\nmore images per subject, which we used as our data. We randomly selected 100 subjects with 1385\ntotal images for training, while 24 subjects with a total of 344 images were selected as the test set.\nFor each subject, we take the average of their face images as x (mean face), and learn to model this\nsubject\u2019s varying expressions y. Both x and y are grayscale and downsampled to a resolution of\n48\u00d7 48. We trained a SFNN with 4 hidden layers of size 128 on these facial expression images. The\nsecond and third \u201chybrid\u201d hidden layers contained 32 stochastic binary and 96 deterministic hidden\nnodes, while the \ufb01rst and the fourth hidden layers consisted of only deterministic sigmoids. We refer\nto this model as SFNN2. We also tested the same model but with only one hybrid hidden layer, that\nwe call SFNN1. We used mini-batches of size 100 and and 30 importance samples for the E-step.\nA total of 2500 weight updates were performed. Weights were randomly initialized with standard\ndeviation of 0.1, and the residual variance \u03c32\nFor comparisons with other models, we trained a Mixture of Factor Analyzers (MFA) [17], Mixture\nDensity Networks (MDN), and Conditional Gaussian Restricted Boltzmann Machines (C-GRBM)\non this task. For the Mixture of Factor Analyzers model, we trained a mixture with 100 components,\none for each training individual. Given a new test face xtest, we \ufb01rst \ufb01nd the training \u02c6x which\nis closest in Euclidean distance. We then take the parameters of the \u02c6x\u2019s FA component, while\nreplacing the FA\u2019s mean with xtest. Mixture Density Networks is trained using code provided by\nthe NETLAB package [18]. The number of Gaussian mixture components and the number of hidden\nnodes were selected using a validation set. Optimization is performed using the scaled conjugate\ngradient algorithm until convergence. For C-GRBMs, we used CD-25 for training. The optimal\nnumber of hidden units, selected via validation, was 1024. A population sparsity objective on the\nhidden activations was also part of the objective [19]. The residual diagonal covariance matrix is\nalso learned. Optimization used stochastic gradient descent with mini-batches of 100 samples each.\nTable 2 displays the average log-\nprobabilities along with standard er-\nrors of the 344 test images. We also\nrecorded the total training time of each\nalgorithm, although this depends on the\nnumber of weight updates and whether\nor not GPUs are used (see the Sup-\nplementary Materials for more details).\nFor MFA and MDN, the log-probabilities were computed exactly. For SFNNs, we used Eq. 1 with\n1000 samples. We can see that SFNNs substantially outperform all other models. Having two hy-\nbrid hidden layers (SFNN2) improves model performance over SFNN1, which has only one hybrid\nhidden layer.\nQualitatively, Fig. 4 shows samples drawn from the trained models. The leftmost column are the\nmean faces of 3 test subjects, followed by 7 samples from the distribution p(y|x). For C-GRBM,\nsamples are generated from a Gibbs chain, where each successive image is taken after 1000 steps.\nFor the other 2 models, displayed samples are exact. MFAs over\ufb01t on the training set, generating\n\nSFNN2\nNats 1406\u00b152 1321\u00b116 1146\u00b1113 1488\u00b118 1534\u00b127\nTime 10 secs.\n158 mins. 112 secs. 113 secs.\nTable 2: Average test log-probability and total training time on\nfacial expression images. Note that for continuous data, these\nare probability densities and can be positive.\n\nC-GRBM SFNN1\n\n6 mins.\n\nMDN\n\nMFA\n\n6\n\n\f(a) Conditional Gaussian RBM.\n\n(b) MFA.\n\n(c) SFNN.\n\nFigure 4: Samples generated from various models.\n\n(a)\n\nFigure 5: Plots demonstrate how hyperparameters affect the evaluation and learning of SFNNs.\n\n(b)\n\n(c)\n\nsamples with signi\ufb01cant artifacts. Samples produced by C-GRBMs suffer from poor mixing and\nget stuck at a local mode. SFNN samples show that the model was able to capture a combination\nof mutli-modality and preserved much of the identity of the test subjects. We also note that SFNN\ngenerated faces are not simple memorization of the training data. This is validated by its superior\nperformance on the test set in Table 2.\nWe further explored how different hyperparameters (e.g. # of stochastic layers, # of Monte Carlo\nsamples) can affect the learning and evaluation of SFNNs. We used face images and SFNN2 for\nthese experiments. First, we wanted to know the number of M in Eq. 1 needed to give a reasonable\nestimate of the log-probabilities. Fig. 5(a) shows the estimates of the log-probability as a function of\nthe number of samples. We can see that having about 500 samples is reasonable, but more samples\nprovides a slightly better estimate. The general shape of the plot is similar for all other datasets\nand SFNN models. When M is small, we typically underestimate the true log-probabilities. While\n500 or more samples are needed for accurate model evaluation, only 20 or 30 samples are suf\ufb01cient\nfor learning good models (as shown in Fig. 5(b). This is because while M = 20 gives suboptimal\napproximation to the true posterior, learning still improves the variational lower-bound. In fact, we\ncan see that the difference between using 30 and 200 samples during learning results in only about 20\nnats of the \ufb01nal average test log-probability. In Fig. 5(c), we varied the number of binary stochastic\nhidden variables in the 2 inner hybrid layers. We did not observe signi\ufb01cant improvements beyond\nmore than 32 nodes. With more hidden nodes, over-\ufb01tting can also be a problem.\n\n3.2.1 Expression Classi\ufb01cation\nThe internal hidden representations learned by SFNNs are also useful for classi\ufb01cation of facial\nexpressions. For each {x, y} image pair, there are 7 possible expression types: neutral, angry,\nhappy, sad, surprised, fear, and disgust. As baselines, we used regularized linear softmax classi\ufb01ers\nand multilayer perceptron classi\ufb01er taking pixels as input. The mean of every pixel across all cases\nwas set to 0 and standard deviation was set to 1.0. We then append the learned hidden features of\nSFNNs and C-GRBMs to the image pixels and re-train the same classi\ufb01ers. The results are shown in\nthe \ufb01rst row of Table 3. Adding hidden features from the SFNN trained in an unsupervised manner\n(without expression labels) improves accuracy for both linear and nonlinear classi\ufb01ers.\n\n(a) Random noise\n\nclean\n\n+Linear +Linear\n\nLinear C-GRBM SFNN MLP SFNN\n+MLP\n80.0% 81.4% 82.4% 83.2% 83.8 %\n10% noise 78.9% 79.7% 80.8% 82.0% 81.7 %\n50% noise 72.4% 74.3% 71.8% 79.1% 78.5%\n75% noise 52.6% 58.1% 59.8% 71.9% 73.1%\n10% occl. 76.2% 79.5% 80.1% 80.3% 81.5%\n50% occl. 54.1% 59.9% 62.5% 58.5% 63.4%\n75% occl. 28.2% 33.9% 37.5% 33.2% 39.2%\n\n(b) Block occlusion\n\nFigure 6: Left: Noisy test images y. Posterior infer-\nence in SFNN \ufb01nds Ep(h|x,y)[h]. Right: generated y\nimages from the expected hidden activations.\n\nTable 3: Recognition accuracy over 5 folds. Bold\nnumbers indicate that the difference in accuracy is\nstatistically signi\ufb01cant than the competitor mod-\nels, for both linear and nonlinear classi\ufb01ers.\n\n7\n\n\f(a) Generated Objects\n\n(b) Generated Horses\n\nFigure 7: Samples generated from a SFNN after training on object and horse databases. Conditioned on a\ngiven foreground mask, the appearance is multimodal (different color and texture). Best viewed in color.\n\nSFNNs are also useful when dealing with noise. As a generative model of y, it is somewhat robust\nto noisy and occluded pixels. For example, the left panels of Fig. 6, show corrupted test images\ny. Using the importance sampler described in Sec. 2.1, we can compute the expected values of the\nbinary stochastic hidden variables given the corrupted test y images5. In the right panels of Fig. 6,\nwe show the corresponding generated y from the inferred average hidden states. After this denoising\nprocess, we can then feed the denoised y and E[h] to the classi\ufb01ers. This compares favorably to\nsimply \ufb01lling in the missing pixels with the average of that pixel from the training set. Classi\ufb01cation\naccuracies under noise are also presented in Table 3. For example 10% noise means that 10 percent\nof the pixels of both x and y are corrupted, selected randomly. 50% occlusion means that a square\nblock with 50% of the original area is randomly positioned in both x and y. Gains in recognition\nperformance from using SFNN are particularly pronounced when dealing with large amounts of\nrandom noise and occlusions.\n\n3.3 Additional Qualitative Experiments\n\nNot only are SFNNs capable of modeling facial expressions of aligned face images, they can also\nmodel complex real-valued conditional distributions. Here, we present some qualitative samples\ndrawn from SFNNs trained on more complicated distributions (an additional example on rotated\nfaces is presented in the Supplementary Materials).\nWe trained SFNNs to generate colorful images of common objects from the Amsterdam Library of\nObjects database [20], conditioned on the foreground masks. This is a database of 1000 everyday\nobjects under various lighting, rotations, and viewpoints. Every object also comes with a foreground\nsegmentation mask. For every object, we selected the image under frontal lighting without any rota-\ntions, and trained a SFNN conditioned on the foreground mask. Our goal is to model the appearance\n(color and texture) of these objects. Of the 1000 objects, there are many objects with similar fore-\nground masks (e.g.\nround or rectangular). Conditioned on the test foreground masks, Fig. 7(a)\nshows random samples from the learned SFNN model. We also tested on the Weizmann segmenta-\ntion database [21] of horses, learning a conditional distribution of horse appearances conditioned on\nthe segmentation mask. The results are shown in Fig. 7(b).\n\n4 Discussions\nIn this paper we introduced a novel model with hybrid stochastic and deterministic hidden nodes.\nWe have also proposed an ef\ufb01cient learning algorithm that allows us to learn rich multi-modal condi-\ntional distributions, supported by quantitative and qualitative empirical results. The major drawback\nof SFNNs is that inference is not trivial and M samples are needed for the importance sampler.\nWhile this is suf\ufb01ciently fast for our experiments we can potentially accelerate inference by learning\na separate recognition network to perform inference in one feedforward pass. These techniques have\npreviously been used by [22, 23] with success.\n\n5For this task we assume that we have knowledge of which pixels is corrupted.\n\n8\n\n\fReferences\n[1] C. M. Bishop. Mixture density networks. Technical Report NCRG/94/004, Aston University, 1994.\n[2] R. M. Neal. Connectionist learning of belief networks. volume 56, pages 71\u2013113, July 1992.\n[3] R. M. Neal. Learning stochastic feedforward networks. Technical report, University of Toronto, 1990.\n[4] Lawrence K. Saul, Tommi Jaakkola, and Michael I. Jordan. Mean \ufb01eld theory for sigmoid belief networks.\n\nJournal of Arti\ufb01cial Intelligence Research, 4:61\u201376, 1996.\n\n[5] David Barber and Peter Sollich. Gaussian \ufb01elds for approximate inference in layered sigmoid belief\nnetworks. In Sara A. Solla, Todd K. Leen, and Klaus-Robert M\u00a8uller, editors, NIPS, pages 393\u2013399. The\nMIT Press, 1999.\n\n[6] G. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. In NIPS,\n\n2006.\n\n[7] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.\n[8] H. Rue and L. Held. Gaussian Markov Random Fields: Theory and Applications, volume 104 of Mono-\n\ngraphs on Statistics and Applied Probability. Chapman & Hall, London, 2005.\n\n[9] John Lafferty. Conditional random \ufb01elds: Probabilistic models for segmenting and labeling sequence\n\ndata. pages 282\u2013289. Morgan Kaufmann, 2001.\n\n[10] Volodymyr Mnih, Hugo Larochelle, and Geoffrey Hinton. Conditional restricted boltzmann machines for\nstructured output prediction. In Proceedings of the International Conference on Uncertainty in Arti\ufb01cial\nIntelligence, 2011.\n\n[11] Yujia Li, Daniel Tarlow, and Richard Zemel. Exploring compositional high order pattern potentials for\nstructured output learning. In Proceedings of International Conference on Computer Vision and Pattern\nRecognition, 2013.\n\n[12] R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justi\ufb01es incremental, sparse and other\n\nvariants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355\u2013368. 1998.\n\n[13] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14:1771\u20131800, 2002.\n\n[14] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125\u2013139, 2001.\n[15] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of\n\nthe Intl. Conf. on Machine Learning, volume 25, 2008.\n\n[16] J.M. Susskind. The Toronto Face Database. Technical report, 2011. http://aclab.ca/users/josh/TFD.html.\n[17] Zoubin Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical\n\nReport CRG-TR-96-1, University of Toronto, 1996.\n\n[18] Ian Nabney. NETLAB: algorithms for pattern recognitions. Advances in pattern recognition. Springer-\n\nVerlag, 2002.\n\n[19] V. Nair and G. E. Hinton. 3-D object recognition with deep belief nets. In NIPS 22, 2009.\n[20] J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The amsterdam library of object images.\n\nInternational Journal of Computer Vision, 61(1), January 2005.\n\n[21] Eran Borenstein and Shimon Ullman. Class-speci\ufb01c, top-down segmentation. In In ECCV, pages 109\u2013\n\n124, 2002.\n\n[22] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The wake-sleep algorithm for unsupervised neural\n\nnetworks. Science, 268(5214):1158\u20131161, 1995.\n\n[23] R. Salakhutdinov and H. Larochelle. Ef\ufb01cient learning of deep boltzmann machines. AISTATS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 345, "authors": [{"given_name": "Charlie", "family_name": "Tang", "institution": "University of Toronto"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}]}