{"title": "Learning Structured Output Representation using Deep Conditional Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3483, "page_last": 3491, "abstract": "Supervised deep learning has been successfully applied for many recognition problems in machine learning and computer vision. Although it can approximate a complex many-to-one function very well when large number of training data is provided, the lack of probabilistic inference of the current supervised deep learning methods makes it difficult to model a complex structured output representations. In this work, we develop a scalable deep conditional generative model for structured output variables using Gaussian latent variables. The model is trained efficiently in the framework of stochastic gradient variational Bayes, and allows a fast prediction using stochastic feed-forward inference. In addition, we provide novel strategies to build a robust structured prediction algorithms, such as recurrent prediction network architecture, input noise-injection and multi-scale prediction training methods. In experiments, we demonstrate the effectiveness of our proposed algorithm in comparison to the deterministic deep neural network counterparts in generating diverse but realistic output representations using stochastic inference. Furthermore, the proposed schemes in training methods and architecture design were complimentary, which leads to achieve strong pixel-level object segmentation and semantic labeling performance on Caltech-UCSD Birds 200 and the subset of Labeled Faces in the Wild dataset.", "full_text": "Learning Structured Output Representation\nusing Deep Conditional Generative Models\n\nKihyuk Sohn\u2217\u2020\n\nXinchen Yan\u2020\n\nHonglak Lee\u2020\n\n\u2217 NEC Laboratories America, Inc.\n\u2020 University of Michigan, Ann Arbor\n\nksohn@nec-labs.com, {xcyan,honglak}@umich.edu\n\nAbstract\n\nSupervised deep learning has been successfully applied to many recognition prob-\nlems. Although it can approximate a complex many-to-one function well when a\nlarge amount of training data is provided, it is still challenging to model com-\nplex structured output representations that effectively perform probabilistic infer-\nence and make diverse predictions. In this work, we develop a deep conditional\ngenerative model for structured output prediction using Gaussian latent variables.\nThe model is trained ef\ufb01ciently in the framework of stochastic gradient varia-\ntional Bayes, and allows for fast prediction using stochastic feed-forward infer-\nence. In addition, we provide novel strategies to build robust structured prediction\nalgorithms, such as input noise-injection and multi-scale prediction objective at\ntraining. In experiments, we demonstrate the effectiveness of our proposed al-\ngorithm in comparison to the deterministic deep neural network counterparts in\ngenerating diverse but realistic structured output predictions using stochastic in-\nference. Furthermore, the proposed training methods are complimentary, which\nleads to strong pixel-level object segmentation and semantic labeling performance\non Caltech-UCSD Birds 200 and the subset of Labeled Faces in the Wild dataset.\n\nIntroduction\n\n1\nIn structured output prediction, it is important to learn a model that can perform probabilistic in-\nference and make diverse predictions. This is because we are not simply modeling a many-to-one\nfunction as in classi\ufb01cation tasks, but we may need to model a mapping from single input to many\npossible outputs. Recently, the convolutional neural networks (CNNs) have been greatly successful\nfor large-scale image classi\ufb01cation tasks [17, 30, 27] and have also demonstrated promising results\nfor structured prediction tasks (e.g., [4, 23, 22]). However, the CNNs are not suitable in modeling a\ndistribution with multiple modes [32].\nTo address this problem, we propose novel deep conditional generative models (CGMs) for output\nrepresentation learning and structured prediction. In other words, we model the distribution of high-\ndimensional output space as a generative model conditioned on the input observation. Building\nupon recent development in variational inference and learning of directed graphical models [16,\n24, 15], we propose a conditional variational auto-encoder (CVAE). The CVAE is a conditional\ndirected graphical model whose input observations modulate the prior on Gaussian latent variables\nthat generate the outputs. It is trained to maximize the conditional log-likelihood, and we formulate\nthe variational learning objective of the CVAE in the framework of stochastic gradient variational\nBayes (SGVB) [16]. In addition, we introduce several strategies, such as input noise-injection and\nmulti-scale prediction training methods, to build a more robust prediction model.\nIn experiments, we demonstrate the effectiveness of our proposed algorithm in comparison to the\ndeterministic neural network counterparts in generating diverse but realistic output predictions using\nstochastic inference. We demonstrate the importance of stochastic neurons in modeling the struc-\ntured output when the input data is partially provided. Furthermore, we show that the proposed\ntraining schemes are complimentary, leading to strong pixel-level object segmentation and labeling\nperformance on Caltech-UCSD Birds 200 and the subset of Labeled Faces in the Wild dataset.\n\n1\n\n\fIn summary, the contribution of the paper is as follows:\n\n\u2022 We propose CVAE and its variants that are trainable ef\ufb01ciently in the SGVB framework,\nand introduce novel strategies to enhance robustness of the models for structured prediction.\n\u2022 We demonstrate the effectiveness of our proposed algorithm with Gaussian stochastic neu-\n\nrons in modeling multi-modal distribution of structured output variables.\n\n\u2022 We achieve strong semantic object segmentation performance on CUB and LFW datasets.\nThe paper is organized as follows. We \ufb01rst review related work in Section 2. We provide prelimi-\nnaries in Section 3 and develop our deep conditional generative model in Section 4. In Section 5,\nwe evaluate our proposed models and report experimental results. Section 6 concludes the paper.\n2 Related work\nSince the recent success of supervised deep learning on large-scale visual recognition [17, 30, 27],\nthere have been many approaches to tackle mid-level computer vision tasks, such as object de-\ntection [6, 26, 31, 9] and semantic segmentation [4, 3, 23, 22], using supervised deep learning\ntechniques. Our work falls into this category of research in developing advanced algorithms for\nstructured output prediction, but we incorporate the stochastic neurons to model the conditional dis-\ntributions of complex output representation whose distribution possibly has multiple modes. In this\nsense, our work shares a similar motivation to the recent work on image segmentation tasks using\nhybrid models of CRF and Boltzmann machine [13, 21, 37]. Compared to these, our proposed model\nis an end-to-end system for segmentation using convolutional architecture and achieves signi\ufb01cantly\nimproved performance on challenging benchmark tasks.\nAlong with the recent breakthroughs in supervised deep learning methods, there has been a progress\nin deep generative models, such as deep belief networks [10, 20] and deep Boltzmann machines [25].\nRecently, the advances in inference and learning algorithms for various deep generative models\nsigni\ufb01cantly enhanced this line of research [2, 7, 8, 18].\nIn particular, the variational learning\nframework of deep directed graphical model with Gaussian latent variables (e.g., variational auto-\nencoder [16, 15] and deep latent Gaussian models [24]) has been recently developed. Using the\nvariational lower bound of the log-likelihood as the training objective and the reparameterization\ntrick, these models can be easily trained via stochastic optimization. Our model builds upon this\nframework, but we focus on modeling the conditional distribution of output variables for structured\nprediction problems. Here, the main goal is not only to model the complex output representation but\nalso to make a discriminative prediction. In addition, our model can effectively handle large-sized\nimages by exploiting the convolutional architecture.\nThe stochastic feed-forward neural network (SFNN) [32] is a conditional directed graphical model\nwith a combination of real-valued deterministic neurons and the binary stochastic neurons. The\nSFNN is trained using the Monte Carlo variant of generalized EM by drawing multiple samples\nfrom the feed-forward proposal distribution and weighing them differently with importance weights.\nAlthough our proposed Gaussian stochastic neural network (which will be described in Section 4.2)\nlooks similar on surface, there are practical advantages in optimization of using Gaussian latent\nvariables over the binary stochastic neurons. In addition, thanks to the recognition model used in\nour framework, it is suf\ufb01cient to draw only a few samples during training, which is critical in training\nvery deep convolutional networks.\n3 Preliminary: Variational Auto-encoder\nThe variational auto-encoder (VAE) [16, 24] is a directed graphical model with certain types of\nlatent variables, such as Gaussian latent variables. A generative process of the VAE is as follows: a\nset of latent variable z is generated from the prior distribution p\u03b8(z) and the data x is generated by\nthe generative distribution p\u03b8(x|z) conditioned on z: z \u223c p\u03b8(z), x \u223c p\u03b8(x|z).\nIn general, parameter estimation of directed graphical models is often challenging due to intractable\nposterior inference. However, the parameters of the VAE can be estimated ef\ufb01ciently in the stochas-\ntic gradient variational Bayes (SGVB) [16] framework, where the variational lower bound of the\nlog-likelihood is used as a surrogate objective function. The variational lower bound is written as:\n\nlog p\u03b8(x) = KL (q\u03c6(z|x)(cid:107)p\u03b8(z|x)) + Eq\u03c6(z|x)\n\u2265 \u2212KL (q\u03c6(z|x)(cid:107)p\u03b8(z)) + Eq\u03c6(z|x)\n\n(cid:2) \u2212 log q\u03c6(z|x) + log p\u03b8(x, z)(cid:3)\n(cid:2) log p\u03b8(x|z)(cid:3)\n\n(1)\n(2)\n\n2\n\n\fIn this framework, a proposal distribution q\u03c6(z|x), which is also known as a \u201crecognition\u201d model, is\nintroduced to approximate the true posterior p\u03b8(z|x). The multilayer perceptrons (MLPs) are used\nto model the recognition and the generation models. Assuming Gaussian latent variables, the \ufb01rst\nterm of Equation (2) can be marginalized, while the second term is not. Instead, the second term can\nbe approximated by drawing samples z(l) (l = 1, ..., L) by the recognition distribution q\u03c6(z|x), and\nthe empirical objective of the VAE with Gaussian latent variables is written as follows:\n\nlog p\u03b8(x|z(l)),\n\n(3)\n\n(cid:101)LVAE(x; \u03b8, \u03c6) = \u2212KL (q\u03c6(z|x)(cid:107)p\u03b8(z)) +\n\nL(cid:88)\n\nl=1\n\n1\nL\n\nwhere z(l) = g\u03c6(x, \u0001(l)), \u0001(l) \u223c N (0, I). Note that the recognition distribution q\u03c6(z|x) is repa-\nrameterized with a deterministic, differentiable function g\u03c6(\u00b7,\u00b7), whose arguments are data x and\nthe noise variable \u0001. This trick allows error backpropagation through the Gaussian latent variables,\nwhich is essential in VAE training as it is composed of multiple MLPs for recognition and generation\nmodels. As a result, the VAE can be trained ef\ufb01ciently using stochastic gradient descent (SGD).\n4 Deep Conditional Generative Models for Structured Output Prediction\nAs illustrated in Figure 1, there are three types of variables in a deep conditional generative model\n(CGM): input variables x, output variables y, and latent variables z. The conditional generative\nprocess of the model is given in Figure 1(b) as follows: for given observation x, z is drawn from the\nprior distribution p\u03b8(z|x), and the output y is generated from the distribution p\u03b8(y|x, z). Compared\nto the baseline CNN (Figure 1(a)), the latent variables z allow for modeling multiple modes in\nconditional distribution of output variables y given input x, making the proposed CGM suitable\nfor modeling one-to-many mapping. The prior of the latent variables z is modulated by the input\nx in our formulation; however, the constraint can be easily relaxed to make the latent variables\nstatistically independent of input variables, i.e., p\u03b8(z|x) = p\u03b8(z) [15].\nDeep CGMs are trained to maximize the conditional log-likelihood. Often the objective function is\nintractable, and we apply the SGVB framework to train the model. The variational lower bound of\nthe model is written as follows (complete derivation can be found in the supplementary material):\n\n(cid:2) log p\u03b8(y|x, z)(cid:3)\nlog p\u03b8(y|x) \u2265 \u2212KL (q\u03c6(z|x, y)(cid:107)p\u03b8(z|x)) + Eq\u03c6(z|x,y)\nL(cid:88)\n(cid:101)LCVAE(x, y; \u03b8, \u03c6) = \u2212KL (q\u03c6(z|x, y)(cid:107)p\u03b8(z|x)) +\n\nlog p\u03b8(y|x, z(l)),\n\nand the empirical lower bound is written as:\n\n(4)\n\n(5)\n\n1\nL\n\nl=1\n\nwhere z(l) = g\u03c6(x, y, \u0001(l)), \u0001(l) \u223c N (0, I) and L is the number of samples. We call this model\nconditional variational auto-encoder1 (CVAE). The CVAE is composed of multiple MLPs, such\nas recognition network q\u03c6(z|x, y), (conditional) prior network p\u03b8(z|x), and generation network\np\u03b8(y|x, z). In designing the network architecture, we build the network components of the CVAE\non top of the baseline CNN. Speci\ufb01cally, as shown in Figure 1(d), not only the direct input x, but also\nthe initial guess \u02c6y made by the CNN are fed into the prior network. Such a recurrent connection has\nbeen applied for structured output prediction problems [23, 13, 28] to sequentially update the predic-\ntion by revising the previous guess while effectively deepening the convolutional network. We also\nfound that a recurrent connection, even one iteration, showed signi\ufb01cant performance improvement.\nDetails about network architectures can be found in the supplementary material.\n4.1 Output inference and estimation of the conditional likelihood\nOnce the model parameters are learned, we can make a prediction of an output y from an input x by\nfollowing the generative process of the CGM. To evaluate the model on structured output prediction\ntasks (i.e., in testing time), we can measure a prediction accuracy by performing a deterministic\n\ninference without sampling z, i.e., y\u2217 = arg maxy p\u03b8(y|x, z\u2217), z\u2217 = E(cid:2)z|x(cid:3).2\n\n1Although the model is not trained to reconstruct the input x, our model can be viewed as a type of VAE\n\nthat performs auto-encoding of the output variables y conditioned on the input x at training time.\nmake a prediction, i.e., y\u2217 = arg maxy\n\n(cid:80)L\nl=1 p\u03b8(y|x, z(l)), z(l) \u223c p\u03b8(z|x).\n\n2Alternatively, we can draw multiple z\u2019s from the prior distribution and use the average of the posteriors to\n\n1\nL\n\n3\n\n\f(a) CNN\n\n(b) CGM (generation)\n\n(c) CGM (recognition)\n\n(d) recurrent connection\n\nFigure 1: Illustration of the conditional graphical models (CGMs). (a) the predictive process of\noutput Y for the baseline CNN; (b) the generative process of CGMs; (c) an approximate inference\nof Z (also known as recognition process [16]); (d) the generative process with recurrent connection.\n\nAnother way to evaluate the CGMs is to compare the conditional likelihoods of the test data. A\nstraightforward approach is to draw samples z\u2019s using the prior network and take the average of the\nlikelihoods. We call this method the Monte Carlo (MC) sampling:\n\np\u03b8(y|x, z(s)), z(s) \u223c p\u03b8(z|x)\n\n(6)\n\nS(cid:88)\n\ns=1\n\np\u03b8(y|x) \u2248 1\nS\n\nS(cid:88)\n\ns=1\n\nIt usually requires a large number of samples for the Monte Carlo log-likelihood estimation to be\naccurate. Alternatively, we use the importance sampling to estimate the conditional likelihoods [24]:\n\np\u03b8(y|x) \u2248 1\nS\n\np\u03b8(y|x, z(s))p\u03b8(z(s)|x)\n\nq\u03c6(z(s)|x, y)\n\n, z(s) \u223c q\u03c6(z|x, y)\n\n(7)\n\n4.2 Learning to predict structured output\nAlthough the SGVB learning framework has shown to be effective in training deep generative mod-\nels [16, 24], the conditional auto-encoding of output variables at training may not be optimal to\nmake a prediction at testing in deep CGMs. In other words, the CVAE uses the recognition network\nq\u03c6(z|x, y) at training, but it uses the prior network p\u03b8(z|x) at testing to draw samples z\u2019s and make\nan output prediction. Since y is given as an input for the recognition network, the objective at train-\ning can be viewed as a reconstruction of y, which is an easier task than prediction. The negative KL\ndivergence term in Equation (5) tries to close the gap between two pipelines, and one could consider\nallocating more weights on the negative KL term of an objective function to mitigate the discrepancy\nin encoding of latent variables at training and testing, i.e., \u2212(1 + \u03b2)KL (q\u03c6(z|x, y)(cid:107)p\u03b8(z|x)) with\n\u03b2 \u2265 0. However, we found this approach ineffective in our experiments.\nInstead, we propose to train the networks in a way that the prediction pipelines at training and testing\nare consistent. This can be done by setting the recognition network the same as the prior network,\ni.e., q\u03c6(z|x, y) = p\u03b8(z|x), and we get the following objective function:\n\nlog p\u03b8(y|x, z(l)) , where z(l) = g\u03b8(x, \u0001(l)), \u0001(l) \u223c N (0, I)\n\n(8)\n\n(cid:101)LGSNN(x, y; \u03b8, \u03c6) =\n\nL(cid:88)\n\nl=1\n\n1\nL\n\nWe call this model Gaussian stochastic neural network (GSNN).3 Note that the GSNN can be de-\nrived from the CVAE by setting the recognition network and the prior network equal. Therefore,\nthe learning tricks, such as reparameterization trick, of the CVAE can be used to train the GSNN.\nSimilarly, the inference (at testing) and the conditional likelihood estimation are the same as those\nof CVAE. Finally, we combine the objective functions of two models to obtain a hybrid objective:\n\n(cid:101)Lhybrid = \u03b1(cid:101)LCVAE + (1 \u2212 \u03b1)(cid:101)LGSNN,\n\n(9)\nwhere \u03b1 balances the two objectives. Note that when \u03b1 = 1, we recover the CVAE objective; when\n\u03b1 = 0, the trained model will be simply a GSNN without the recognition network.\n4.3 CVAE for image segmentation and labeling\nSemantic segmentation [5, 23, 6] is an important structured output prediction task.\nIn this sec-\ntion, we provide strategies to train a robust prediction model for semantic segmentation problems.\nSpeci\ufb01cally, to learn a high-capacity neural network that can be generalized well to unseen data, we\npropose to train the network with 1) multi-scale prediction objective and 2) structured input noise.\n3If we assume a covariance matrix of auxiliary Gaussian latent variables \u0001 to 0, we have a deterministic\n\ncounterpart of GSNN, which we call a Gaussian deterministic neural network (GDNN).\n\n4\n\nYXp(cid:7578)(y|x)YXZp(cid:7578)(y|x,z)p(cid:7578)(z|x)YXZq(cid:7600)(z|x,y)YXZp(cid:7578)(y|x,z)p(cid:7578)(z|x)Y\f4.3.1 Training with multi-scale prediction objective\n\nAs the image size gets larger (e.g., 128 \u00d7 128), it becomes\nmore challenging to make a \ufb01ne-grained pixel-level predic-\ntion (e.g., image reconstruction, semantic label prediction).\nThe multi-scale approaches have been used in the sense of\nforming a multi-scale image pyramid for an input [5], but not\nmuch for multi-scale output prediction. Here, we propose to\ntrain the network to predict outputs at different scales. By do-\ning so, we can make a global-to-local, coarse-to-\ufb01ne-grained\nprediction of pixel-level semantic labels. Figure 2 describes\n\nFigure 2: Multi-scale prediction.\nthe multi-scale prediction at 3 different scales (1/4, 1/2, and original) for the training.\n4.3.2 Training with input omission noise\nAdding noise to neurons is a widely used technique to regularize deep neural networks during the\ntraining [17, 29]. Similarly, we propose a simple regularization technique for semantic segmenta-\ntion: corrupt the input data x into \u02dcx according to noise process and optimize the network with the\n\nfollowing objective: (cid:101)L(\u02dcx, y). The noise process could be arbitrary, but for semantic image segmen-\n\ntation, we consider random block omission noise. Speci\ufb01cally, we randomly generate a squared\nmask of width and height less than 40% of the image width and height, respectively, at random po-\nsition and set pixel values of the input image inside the mask to 0. This can be viewed as providing\nmore challenging output prediction task during training that simulates block occlusion or missing\ninput. The proposed training strategy also is related to the denoising training methods [34], but in\nour case, we inject noise to the input data only and do not reconstruct the missing input.\n5 Experiments\nWe demonstrate the effectiveness of our approach in modeling the distribution of the structured\noutput variables. For the proof of concept, we create an arti\ufb01cial experimental setting for struc-\ntured output prediction using MNIST database [19]. Then, we evaluate the proposed CVAE models\non several benchmark datasets for visual object segmentation and labeling, such as Caltech-UCSD\nBirds (CUB) [36] and Labeled Faces in the Wild (LFW) [12]. Our implementation is based on Mat-\nConvNet [33], a MATLAB toolbox for convolutional neural networks, and Adam [14] for adaptive\nlearning rate scheduling algorithm of SGD optimization.\n5.1 Toy example: MNIST\nTo highlight the importance of probabilistic inference through stochastic neurons for structured out-\nput variables, we perform an experiment using MNIST database. Speci\ufb01cally, we divide each digit\nimage into four quadrants, and take one, two, or three quadrant(s) as an input and the remaining\nquadrants as an output.4 As we increase the number of quadrants for an output, the input to output\nmapping becomes more diverse (in terms of one-to-many mapping).\nWe trained the proposed models (CVAE, GSNN) and the baseline deep neural network and compare\ntheir performance. The same network architecture, the MLP with two-layers of 1, 000 ReLUs for\nrecognition, conditional prior, or generation networks, followed by 200 Gaussian latent variables,\nwas used for all the models in various experimental settings. The early stopping is used during the\ntraining based on the estimation of the conditional likelihoods on the validation set.\n\nnegative CLL\nNN (baseline)\nGSNN (Monte Carlo)\nCVAE (Monte Carlo)\nCVAE (Importance Sampling)\nPerformance gap\n- per pixel\n\n1 quadrant\n\nvalidation\n100.03\n100.03\n68.62\n64.05\n35.98\n0.061\n\ntest\n99.75\n99.82\n68.39\n63.91\n35.91\n0.061\n\n2 quadrants\n\nvalidation\n\n3 quadrants\n\nvalidation\n\n26.01\n26.20\n20.97\n20.97\n5.23\n0.027\n\ntest\n25.99\n26.29\n20.96\n20.95\n5.33\n0.027\n\n62.14\n62.48\n45.57\n44.96\n17.51\n0.045\n\ntest\n62.18\n62.41\n45.34\n44.73\n17.68\n0.045\n\nTable 1: The negative CLL on MNIST database. We increase the number of quadrants for an input\nfrom 1 to 3. The performance gap between CVAE (importance sampling) and NN is reported.\n\n4Similar experimental setting has been used in the multimodal learning framework, where the left- and right\n\nhalves of the digit images are used as two data modalities [1, 28].\n\n5\n\nY1/2Y1/4X1/41lossloss1/2loss++Y...\fFigure 3: Visualization of generated samples with (left) 1 quadrant and (right) 2 quadrants for an\ninput. We show in each row the input and the ground truth output overlaid with gray color (\ufb01rst),\nsamples generated by the baseline NNs (second), and samples drawn from the CVAEs (rest).\n\nFor qualitative analysis, we visualize the generated output samples in Figure 3. As we can see, the\nbaseline NNs can only make a single deterministic prediction, and as a result the output looks blurry\nand doesn\u2019t look realistic in many cases. In contrast, the samples generated by the CVAE models\nare more realistic and diverse in shape; sometimes they can even change their identity (digit labels),\nsuch as from 3 to 5 or from 4 to 9, and vice versa.\nWe also provide a quantitative evidence by estimating the conditional log-likelihoods (CLLs) in Ta-\nble 1. The CLLs of the proposed models are estimated in two ways as described in Section 4.1. For\nthe MC estimation, we draw 10, 000 samples per example to get an accurate estimate. For the im-\nportance sampling, however, 100 samples per example were enough to obtain an accurate estimation\nof the CLL. We observed that the estimated CLLs of the CVAE signi\ufb01cantly outperforms the base-\nline NN. Moreover, as measured by the per pixel performance gap, the performance improvement\nbecomes more signi\ufb01cant as we use smaller number of quadrants for an input, which is expected as\nthe input-output mapping becomes more diverse.\n5.2 Visual Object Segmentation and Labeling\nCaltech-UCSD Birds (CUB)\ndatabase [36] includes 6, 033 images of birds from 200 species with\nannotations such as a bounding box of birds and a segmentation mask. Later, Yang et al. [37]\nannotated these images with more \ufb01ne-grained segmentation masks by cropping the bird patches\nusing the bounding boxes and resized them into 128 \u00d7 128 pixels. The training/test split proposed\nin [36] was used in our experiment, and for validation purpose, we partition the training set into 10\nfolds and cross-validated with the mean intersection over union (IoU) score over the folds. The \ufb01nal\nprediction on the test set was made by averaging the posterior from ensemble of 10 networks that are\ntrained on each of the 10 folds separately. We increase the number of training examples via \u201cdata\naugmentation\u201d by horizontally \ufb02ipping the input and output images.\nWe extensively evaluate the variations of our proposed methods, such as CVAE, GSNN, and the\nhybrid model, and provide a summary results on segmentation mask prediction task in Table 2.\nSpeci\ufb01cally, we report the performance of the models with different network architectures and train-\ning methods (e.g., multi-scale prediction or noise-injection training).\nFirst, we note that the baseline CNN already beat the previous state-of-the-art that is obtained by\nthe max-margin Boltzmann machine (MMBM; pixel accuracy: 90.42, IoU: 75.92 with GraphCut\nfor post-processing) [37] even without post-processing. On top of that, we observed signi\ufb01cant per-\nformance improvement with our proposed deep CGMs.5 In terms of prediction accuracy, the GSNN\nperformed the best among our proposed models, and performed even better when it is trained with\nhybrid objective function. In addition, the noise-injection training (Section 4.3) further improves\nthe performance. Compared to the baseline CNN, the proposed deep CGMs signi\ufb01cantly reduce the\nprediction error, e.g., 21% reduction in test pixel-level accuracy at the expense of 60% more time\nfor inference.6 Finally, the performance of our two winning entries (GSNN and hybrid) on the vali-\ndation sets are both signi\ufb01cantly better than their deterministic counterparts (GDNN) with p-values\nless than 0.05, which suggests the bene\ufb01t of stochastic latent variables.\n\n5As in the case of baseline CNNs, we found that using the multi-scale prediction was consistently better\n\nthan the single-scale counterpart for all our models. So, we used the multi-scale prediction by default.\n\n6Mean inference time per image: 2.32 (ms) for CNN and 3.69 (ms) for deep CGMs, measured using\nGeForce GTX TITAN X card with MatConvNet; we provide more information in the supplementary material.\n\n6\n\nground-truthNNCVAEground-truthNNCVAE\fModel (training)\nMMBM [37]\nGLOC [13]\nCNN (baseline)\nCNN (msc)\nGDNN (msc)\nGSNN (msc)\nCVAE (msc)\nhybrid (msc)\nGDNN (msc, NI)\nGSNN (msc, NI)\nCVAE (msc, NI)\nhybrid (msc, NI)\n\nCUB (val)\n\npixel\n\n\u2013\n\u2013\n\n91.17 \u00b10.09\n91.37 \u00b10.09\n92.25 \u00b10.09\n92.46 \u00b10.07\n92.24 \u00b10.09\n92.60 \u00b10.08\n92.92 \u00b10.07\n93.09 \u00b10.09\n92.72 \u00b10.08\n93.05 \u00b10.07\n\nIoU\n\u2013\n\u2013\n\n79.64 \u00b10.24\n80.09 \u00b10.25\n81.89 \u00b10.21\n82.31 \u00b10.19\n81.86 \u00b10.23\n82.57 \u00b10.26\n83.20 \u00b10.19\n83.62 \u00b10.21\n82.90 \u00b10.22\n83.49 \u00b10.19\n\nCUB (test)\n\npixel\n90.42\n\n\u2013\n\nIoU\n75.92\n\n\u2013\n\n92.30\n92.52\n93.24\n93.39\n93.03\n93.35\n93.78\n93.91\n93.48\n93.78\n\n81.90\n82.43\n83.96\n84.26\n83.53\n84.16\n85.07\n85.39\n84.47\n85.07\n\nLFW\n\npixel (val)\n\npixel (test)\n\n\u2013\n\u2013\n\n92.09 \u00b10.13\n92.19 \u00b10.10\n92.72 \u00b10.12\n92.88 \u00b10.08\n92.80 \u00b10.30\n92.95 \u00b10.21\n93.59 \u00b10.12\n93.71 \u00b10.09\n93.29 \u00b10.17\n93.69 \u00b10.12\n\n\u2013\n\n90.70\n\n91.90 \u00b10.08\n92.05 \u00b10.06\n92.54 \u00b10.04\n92.61 \u00b10.09\n92.62 \u00b10.06\n92.77 \u00b10.06\n93.25 \u00b10.06\n93.51 \u00b10.07\n93.22 \u00b10.08\n93.42 \u00b10.07\n\nTable 2: Mean and standard error of labeling accuracy on CUB and LFW database. The performance\nof the best or statistically similar (i.e., p-value \u2265 0.05 to the best performing model) models are\nbold-faced. \u201cmsc\u201d refers multi-scale prediction training and \u201cNI\u201d refers the noise-injection training.\n\nModels\nCNN (baseline)\nGDNN (msc, NI)\nGSNN (msc, NI)\nCVAE (msc, NI)\nhybrid (msc, NI)\n\nCUB (val)\n\n4269.43 \u00b1130.90\n3386.19 \u00b144.11\n3400.24 \u00b159.42\n801.48 \u00b14.34\n1019.93 \u00b18.46\n\nCUB (test)\n\n4329.94 \u00b191.71\n3450.41 \u00b133.36\n3461.87 \u00b125.57\n801.31 \u00b11.86\n1021.44 \u00b14.81\n\nLFW (val)\n\n6370.63 \u00b1790.53\n4710.46 \u00b1192.77\n4582.96 \u00b1225.62\n1262.98 \u00b164.43\n1836.98 \u00b1127.53\n\nLFW (test)\n\n6434.09 \u00b1756.57\n5170.26 \u00b1166.81\n4829.45 \u00b196.98\n1267.58 \u00b157.92\n1867.47 \u00b1111.26\n\nTable 3: Mean and standard error of negative CLL on CUB and LFW database. The performance of\nthe best and statistically similar models are bold-faced.\n\nWe also evaluate the negative CLL and summarize the results in Table 3. As expected, the proposed\nCGMs signi\ufb01cantly outperform the baseline CNN while the CVAE showed the highest CLL.\nLabeled Faces in the Wild (LFW) database [12] has been widely used for face recognition and\nveri\ufb01cation benchmark. As mentioned in [11], the face images that are segmented and labeled into\nsemantically meaningful region labels (e.g., hair, skin, clothes) can greatly help understanding of\nthe image through the visual attributes, which can be easily obtained from the face shape.\nFollowing region labeling protocols [35, 13], we evaluate the performance of face parts labeling\non the subset of LFW database [35], which contains 1, 046 images that are labeled into 4 semantic\ncategories, such as hair, skin, clothes, and background. We resized images into 128 \u00d7 128 and used\nthe same network architecture to the one used in the CUB experiment.\nWe provide summary results of pixel-level segmentation accuracy in Table 2 and the negative CLL\nin Table 3. We observe a similar trend as previously shown for the CUB database; the proposed deep\nCGMs outperform the baseline CNN in terms of segmentation accuracy as well as CLL. However,\nalthough the accuracies of the CGM variants are higher, the performance of GDNN was not signi\ufb01-\ncantly behind than those of GSNN and hybrid models. This may be because the level of variations in\nthe output space of LFW database is less than that of CUB database as the face shapes are more sim-\nilar and better aligned across examples. Finally, our methods signi\ufb01cantly outperform other existing\nmethods, which report 90.0% in [35] or 90.7% in [13], setting the state-of-the-art performance on\nthe LFW segmentation benchmark.\n5.3 Object Segmentation with Partial Observations\nWe experimented on object segmentation under uncertainties (e.g., partial input and output obser-\nvations) to highlight the importance of recognition network in CVAE and the stochastic neurons for\nmissing value imputation. We randomly omit the input pixels at different levels of omission noise\n(25%, 50%, 70%) and different block sizes (1, 4, 8), and the task is to predict the output segmenta-\ntion labels for the omitted pixel locations while given the partial labels for the observed input pixels.\nThis can also be viewed as a segmentation task with noisy or partial observations (e.g., occlusions).\nTo make a prediction for CVAE with partial output observation (yo), we perform iterative inference\nof unobserved output (yu) and the latent variables (z) (in a similar fashion to [24]), i.e.,\n\nyu \u223c p\u03b8(yu|x, z) \u2194 z \u223c q\u03c6(z|x, yo, yu).\n\n(10)\n\n7\n\n\fFigure 4: Visualization of the conditionally generated samples: (\ufb01rst row) input image with omission\nnoise (noise level: 50%, block size: 8), (second row) ground truth segmentation, (third) prediction\nby GDNN, and (fourth to sixth) the generated samples by CVAE on CUB (left) and LFW (right).\n\nGDNN CVAE\n\nGDNN CVAE\n\nnoise\nlevel\n\n25%\n\n50%\n\n70%\n\nDataset\n\nCUB (IoU)\n\nLFW (pixel)\n\n89.37\n88.74\n90.72\n74.95\n70.48\n76.07\n62.11\n57.68\n63.59\n\n98.52\n98.07\n96.78\n95.95\n94.25\n89.10\n89.44\n84.36\n76.87\n\nblock\nsize\n1\n4\n8\n1\n4\n8\n1\n4\n8\n\nWe report the summary results in Table 4.\nThe CVAE performs well even when the\nnoise level is high (e.g., 50%), where the\nGDNN signi\ufb01cantly fails. This is because\nthe CVAE utilizes the partial segmentation\ninformation to iteratively re\ufb01ne the predic-\ntion of the rest. We visualize the gener-\nated samples at noise level of 50% in Fig-\nure 4. The prediction made by the GDNN\nis blurry, but the samples generated by\nthe CVAE are sharper while maintaining\nreasonable shapes. This suggests that the\nCVAE can also be potentially useful for in-\nteractive segmentation (i.e., by iteratively\nincorporating partial output labels).\n6 Conclusion\nModeling multi-modal distribution of the structured output variables is an important research ques-\ntion to achieve good performance on structured output prediction problems. In this work, we pro-\nposed stochastic neural networks for structured output prediction based on the conditional deep\ngenerative model with Gaussian latent variables. The proposed model is scalable and ef\ufb01cient in\ninference and learning. We demonstrated the importance of probabilistic inference when the distri-\nbution of output space has multiple modes, and showed strong performance in terms of segmentation\naccuracy, estimation of conditional log-likelihood, and visualization of generated samples.\n\nTable 4: Segmentation results with omission noise on\nCUB and LFW database. We report the pixel-level ac-\ncuracy on the \ufb01rst validation set.\n\n99.22\n99.09\n98.73\n97.29\n97.08\n96.15\n89.71\n93.16\n92.06\n\n96.93\n96.55\n97.14\n91.84\n90.87\n92.68\n85.27\n85.70\n87.83\n\nAcknowledgments This work was supported in part by ONR grant N00014-13-1-0762 and NSF\nCAREER grant IIS-1453651. We thank NVIDIA for donating a Tesla K40 GPU.\n\nReferences\n[1] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, 2013.\n[2] Y. Bengio, E. Thibodeau-Laufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable\n\nby backprop. In ICML, 2014.\n\n[3] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal\n\nmembranes in electron microscopy images. In NIPS, 2012.\n\n[4] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale feature learning, purity\n\ntrees, and optimal covers. In ICML, 2012.\n\n[5] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. T.\n\nPAMI, 35(8):1915\u20131929, 2013.\n\n[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-based convolutional networks for accurate\n\nobject detection and segmentation. T. PAMI, PP(99):1\u20131, 2015.\n\n[7] I. Goodfellow, M. Mirza, A. Courville, and Y. Bengio. Multi-prediction deep Boltzmann machines. In\n\nNIPS, 2013.\n\n8\n\nInputground-truthCNNCVAEInputground-truthCNNCVAE\f[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-\n\ngio. Generative adversarial nets. In NIPS, 2014.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. In ECCV, 2014.\n\n[10] G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation,\n\n18(7):1527\u20131554, 2006.\n\n[11] G. B. Huang, M. Narayana, and E. Learned-Miller. Towards unconstrained face recognition. In CVPR\n\nWorkshop on Perceptual Organization in Computer Vision, 2008.\n\n[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for\nstudying face recognition in unconstrained environments. Technical Report 07-49, University of Mas-\nsachusetts, Amherst, 2007.\n\n[13] A. Kae, K. Sohn, H. Lee, and E. Learned-Miller. Augmenting CRFs with Boltzmann machine shape\n\npriors for image labeling. In CVPR, 2013.\n\n[14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[15] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep genera-\n\ntive models. In NIPS, 2014.\n\n[16] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2013.\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[18] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. JMLR, 15:29\u201337, 2011.\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[20] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Unsupervised learning of hierarchical representations\n\nwith convolutional deep belief networks. Communications of the ACM, 54(10):95\u2013103, 2011.\n\n[21] Y. Li, D. Tarlow, and R. Zemel. Exploring compositional high order pattern potentials for structured\n\noutput learning. In CVPR, 2013.\n\n[22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\n2015.\n\n[23] P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene parsing. In ICML, 2013.\n[24] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In ICML, 2014.\n\n[25] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009.\n[26] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2013.\n\n[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2014.\n\n[28] K. Sohn, W. Shang, and H. Lee. Improved multimodal deep learning with variation of information. In\n\nNIPS, 2014.\n\n[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014.\n\n[30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-\n\nnovich. Going deeper with convolutions. In CVPR, 2015.\n\n[31] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS, 2013.\n[32] Y. Tang and R. Salakhutdinov. Learning stochastic feedforward neural networks. In NIPS, 2013.\n[33] A. Vedaldi and K. Lenc. MatConvNet \u2013 convolutional neural networks for MATLAB. In ACMMM, 2015.\n[34] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with\n\ndenoising autoencoders. In ICML, 2008.\n\n[35] N. Wang, H. Ai, and F. Tang. What are good parts for hair shape modeling? In CVPR, 2012.\n[36] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds\n\n200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[37] J. Yang, S. S\u00b4af\u00b4ar, and M.-H. Yang. Max-margin Boltzmann machines for object segmentation. In CVPR,\n\n2014.\n\n9\n\n\f", "award": [], "sourceid": 1935, "authors": [{"given_name": "Kihyuk", "family_name": "Sohn", "institution": "University of Michigan"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "U. Michigan"}, {"given_name": "Xinchen", "family_name": "Yan", "institution": "University of Michigan"}]}