{"title": "Toward Robustness against Label Noise in Training Deep Discriminative Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5596, "page_last": 5605, "abstract": "Collecting large training datasets, annotated with high-quality labels, is costly and time-consuming. This paper proposes a novel framework for training deep convolutional neural networks from noisy labeled datasets that can be obtained cheaply. The problem is formulated using an undirected graphical model that represents the relationship between noisy and clean labels, trained in a semi-supervised setting. In our formulation, the inference over latent clean labels is tractable and is regularized during training using auxiliary sources of information. The proposed model is applied to the image labeling problem and is shown to be effective in labeling unseen images as well as reducing label noise in training on CIFAR-10 and MS COCO datasets.", "full_text": "Toward Robustness against Label Noise in\n\nTraining Deep Discriminative Neural Networks\n\nArash Vahdat\n\nD-Wave Systems Inc.\nBurnaby, BC, Canada\n\navahdat@dwavesys.com\n\nAbstract\n\nCollecting large training datasets, annotated with high-quality labels, is costly\nand time-consuming. This paper proposes a novel framework for training deep\nconvolutional neural networks from noisy labeled datasets that can be obtained\ncheaply. The problem is formulated using an undirected graphical model that\nrepresents the relationship between noisy and clean labels, trained in a semi-\nsupervised setting. In our formulation, the inference over latent clean labels is\ntractable and is regularized during training using auxiliary sources of information.\nThe proposed model is applied to the image labeling problem and is shown to be\neffective in labeling unseen images as well as reducing label noise in training on\nCIFAR-10 and MS COCO datasets.\n\n1\n\nIntroduction\n\nThe availability of large annotated data collections such as ImageNet [1] is one of the key reasons\nwhy deep convolutional neural networks (CNNs) have been successful in the image classi\ufb01cation\nproblem. However, collecting training data with such high-quality annotation is very costly and time\nconsuming. In some applications, annotators are required to be trained before identifying classes in\ndata, and feedback from many annotators is aggregated to reduce labeling error. On the other hand,\nmany inexpensive approaches for collecting labeled data exist, such as data mining on social media\nwebsites, search engines, querying fewer annotators per instance, or the use of amateur annotators\ninstead of experts. However, all these low-cost approaches have one common side effect: label noise.\nThis paper tackles the problem of training deep CNNs for the image labeling task from datapoints\nwith noisy labels. Most previous work in this area has focused on modeling label noise for multiclass\nclassi\ufb01cation1 using a directed graphical model similar to Fig. 1.a. It is typically assumed that the\nclean labels are hidden during training, and they are marginalized by enumerating all possible classes.\nThese techniques cannot be extended to the multilabel classi\ufb01cation problem, where exponentially\nmany con\ufb01gurations exist for labels, and the explaining-away phenomenon makes inference over\nlatent clean labels dif\ufb01cult.\nWe propose a conditional random \ufb01eld (CRF) [2] model to represent the relationship between noisy\nand clean labels, and we show how modern deep CNNs can gain robustness against label noise using\nour proposed structure. We model the clean labels as latent variables during training, and we design\nour structure such that the latent variables can be inferred ef\ufb01ciently.\nThe main challenge in modeling clean labels as latent is the lack of semantics on latent variables.\nIn other words, latent variables may not semantically correspond to the clean labels when the joint\nprobability of clean and noisy labels is parameterized such that latent clean labels can take any\n\n1Each sample is assumed to belong to only one class.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fxxx\n\n\u02c6yyy\n\n(a)\n\nyyy\n\nxxx\n\n\u02c6yyy\n\nyyy\n\nhhh\n\n(b)\n\nFigure 1: a) The general directed graphical model used for modeling noisy labels. xxx, \u02c6yyy, yyy represent\na data instance, its clean label, and its noisy label, respectively. b) We represent the interactions\nbetween clean and noisy labels using an undirected graphical model with hidden binary random\nvariables (hhh).\n\ncon\ufb01guration. To solve this problem, most previous work relies on either carefully initializing the\nconditionals [3], \ufb01ne-tuning the model on the noisy set after pretraining on a clean set\n[4], or\nregularizing the transition parameters [5]. In contrast, we inject semantics to the latent variables\nby formulating the training problem as a semi-supervised learning problem, in which the model is\ntrained using a large set of noisy training examples and a small set of clean training examples. To\novercome the problem of inferring clean labels, we introduce a novel framework equipped with an\nauxiliary distribution that represents the relation between noisy and clean labels while relying on\ninformation sources different than the image content.\nThis paper makes the following contributions: i) A generic CRF model is proposed for training deep\nneural networks that is robust against label noise. The model can be applied to both multiclass and\nmultilabel classi\ufb01cation problems, and it can be understood as a robust loss layer, which can be\nplugged into any existing network. ii) We propose a novel objective function for training the deep\nstructured model that bene\ufb01ts from sources of information representing the relation between clean\nand noisy labels. iii) We demonstrate that the model outperforms previous techniques.\n\n2 Previous Work\n\nLearning from Noisy Labels: Learning discriminative models from noisy-labeled data is an active\narea of research. A comprehensive overview of previous work in this area can be found in [6].\nPrevious research on modeling label noise can be grouped into two main groups: class-conditional\nand class-and-instance-conditional label noise models. In the former group, the label noise is assumed\nto be independent of the instance, and the transition probability from clean classes to the noisy classes\nis modeled. For example, class conditional models for binary classi\ufb01cation problems are considered\nin [7, 8] whereas multiclass counterparts are targeted in [9, 5]. In the class-and-instance-conditional\ngroup, label noise is explicitly conditioned on each instance. For example, Xiao et al. [3] developed a\nmodel in which the noisy observed annotation is conditioned on binary random variables indicating\nif an instance\u2019s label is mistaken. Reed et al. [10] \ufb01xes noisy labels by \u201cbootstrapping\u201d on the\nlabels predicted by a neural network. These techniques are all applied to either binary or multiclass\nclassi\ufb01cation problems in which marginalization over classes is possible. Among methods proposed\nfor noise-robust training, Misra et al. [4] target the image multilabeling problem but model the label\nnoise for each label independently. In contrast, our proposed CRF model represents the relation\nbetween all noisy and clean labels while the inference over latent clean labels is still tractable.\nMany works have focused on semi-supervised learning using a small clean dataset combined with\nnoisy labeled data, typically obtained from the web. Zhu et al. [11] used a pairwise similarity measure\nto propagate labels from labeled dataset to unlabeled one. Fergus et al. [12] proposed a graph-based\nlabel propagation, and Chen and Gupta [13] employed the weighted cross entropy loss. Recently Veit\net al. [14] proposed a multi-task network containing i) a regression model that maps noisy labels and\nimage features to clean labels ii) an image classi\ufb01cation model that labels input. However, the model\nin this paper is trained using a principled objective function that regularizes the inference model using\nextra sources of information without the requirement for oversampling clean instances.\nDeep Structured Models: Conditional random \ufb01elds (CRFs) [2] are discriminative undirected\ngraphical models, originally proposed for modeling sequential and structured data. Recently, they have\nshown state-of-the-art results in segmentation [15, 16] when combined with deep neural networks [17,\n18, 19]. The main challenge in training deep CNN-CRFs is how to do inference and back-propagate\ngradients of the loss function through the inference. Previous approaches have focused on mean-\ufb01eld\n\n2\n\n\fapproximation [16, 20], belief propagation [21, 22], unrolled inference [23, 24], and sampling [25].\nThe CNN-CRFs used in this work are extensions of hidden CRFs introduced in [26, 27].\n\n3 Robust Discriminative Neural Network\n\nOur goal in this paper is to train deep neural networks given a set of noisy labeled data and a small\nset of cleaned data. A datapoint (an image in our case) is represented by xxx, and its noisy annotation\nby a binary vector yyy = {y1, y2, . . . , yN} \u2208 YN , where yi \u2208 {0, 1} indicates whether the ith label is\npresent in the noisy annotation. We are interested in inferring a set of clean labels for each datapoint.\nThe clean labels may be de\ufb01ned on a set different than the set of noisy labels. This is typically the\ncase in the image annotation problem where noisy labels obtained from user tags are de\ufb01ned over a\nlarge set of textual tags (e.g., \u201ccat\u201d, \u201ckitten, \u201ckitty\u201d, \u201cpuppy\u201d, \u201cpup\u201d, etc.), whereas clean labels are\nde\ufb01ned on a small set of representative labels (e.g., \u201ccat\u201d, \u201cdog\u201d, etc.). In this paper, the clean label is\nrepresented by a stochastic binary vector \u02c6yyy = {\u02c6y1, \u02c6y2, . . . , \u02c6yC} \u2208 YC.\nWe use the CRF model shown in Fig. 1.b. In our formulation, both \u02c6yyy and yyy may conditionally depend\non the image xxx. The link between \u02c6yyy and yyy captures the correlations between clean and noisy labels.\nThese correlations help us infer latent clean labels when only the noisy labels are observed. Since\nnoisy labels are de\ufb01ned over a large set of overlapping (e.g., \u201ccat\u201d and \u201cpet\u201d) or co-occurring (e.g.,\n\u201croad\u201d and \u201ccar\u201d) entities, p(yyy|\u02c6yyy, xxx) may have a multimodal form. To keep the inference simple and\nstill be able to model these correlations, we introduce a set of hidden binary variables represented by\nhhh \u2208 H. In this case, the correlations between components of yyy are modeled through hhh. These hidden\nvariables are not connected to \u02c6yyy in order to keep the CRF graph bipartite.\nThe CRF model shown in Fig. 1.b de\ufb01nes the joint probability distribution of yyy, \u02c6yyy, and hhh conditioned\non xxx using a parameterized energy function E\u03b8\u03b8\u03b8 : YN \u00d7 YC \u00d7 H \u00d7 X \u2192 R. The energy function\nassigns a potential score E\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx) to the con\ufb01guration of (yyy, \u02c6yyy, hhh, xxx), and is parameterized by a\nparameter vector \u03b8\u03b8\u03b8. This conditional probability distribution is de\ufb01ned using a Boltzmann distribution:\n\np\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh|xxx) =\n\n1\n\nZ\u03b8\u03b8\u03b8(xxx)\n\nexp(\u2212E\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx))\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n(1)\n\nexp(\u2212E\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx)). The\n\nwhere Z\u03b8\u03b8\u03b8(xxx) is the partition function de\ufb01ned by Z\u03b8\u03b8\u03b8(xxx) =\n\nyyy\u2208YN\nenergy function in Fig. 1.b is de\ufb01ned by the quadratic function:\n\n\u02c6yyy\u2208YC\n\nhhh\u2208H\n\nE\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx) = \u2212aaaT\n\n\u03c6\u03c6\u03c6 (xxx)\u02c6yyy \u2212 bbbT\n\n\u03c6\u03c6\u03c6 (xxx)yyy \u2212 cccT hhh \u2212 \u02c6yyyT WWWyyy \u2212 hhhT W (cid:48)W (cid:48)W (cid:48)yyy\n\n(2)\n\nwhere the vectors aaa\u03c6\u03c6\u03c6(xxx), bbb\u03c6\u03c6\u03c6(xxx), ccc are the bias terms and the matrices WWW and W (cid:48)W (cid:48)W (cid:48) are the pairwise\ninteractions. In our formulation, the bias terms on the clean and noisy labels are functions of input\nxxx and are de\ufb01ned using a deep CNN parameterized by \u03c6\u03c6\u03c6. The deep neural network together with\nthe introduced CRF forms our CNN-CRF model, parameterized by \u03b8\u03b8\u03b8 = {\u03c6\u03c6\u03c6, ccc, WWW , W (cid:48)W (cid:48)W (cid:48)}. Note that in\norder to regularize WWW and W (cid:48)W (cid:48)W (cid:48), these matrices are not a function of xxx.\nThe structure of this graph is designed such that the conditional distribution p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx) takes\n(cid:81)\ni p\u03b8\u03b8\u03b8(\u02c6yi|yyy, xxx)(cid:81)\na simple factorial distribution that can be calculated analytically given \u03b8\u03b8\u03b8 using: p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx) =\nj p\u03b8\u03b8\u03b8(hj|yyy) where p\u03b8\u03b8\u03b8(\u02c6yi = 1|yyy, xxx) = \u03c3(aaa\u03c6\u03c6\u03c6(xxx)(i) +WWW (i,:)yyy), p\u03b8\u03b8\u03b8(hj|yyy) = \u03c3(ccc(j) +WWW\n(cid:48)\n(j,:)yyy),\n1+exp(\u2212u) is the logistic function, and aaa\u03c6\u03c6\u03c6(xxx)(i) or WWW (i,:) indicate the ith element\n\nin which \u03c3(u) =\nand row in the corresponding vector or matrix respectively.\n\n1\n\n3.1 Semi-Supervised Learning Approach\n\nThe main challenge here is how to train the parameters of the CNN-CRF model de\ufb01ned in Eq. 1. To\ntackle this problem, we de\ufb01ne the training problem as a semi-supervised learning problem where\nclean labels are observed in a small subset of a larger training set annotated with noisy labels. In this\ncase, one can form an objective function by combining the marginal data likelihood de\ufb01ned on both\nthe fully labeled clean set and noisy labeled set, and using the maximum likelihood method to learn\nthe parameters of the model. Assume that DN = {(xxx(n), yyy(n))} and DC = {(xxx(c), \u02c6yyy(c), yyy(c))} are two\ndisjoint sets representing the noisy labeled and clean labeled training datasets respectively. In the\n\n3\n\n\fmaximum likelihood method, the parameters are trained by maximizing the marginal log likelihood:\n\nlog p\u03b8\u03b8\u03b8(yyy(c), \u02c6yyy(c)|xxx(c))\n\n(3)\n\nmax\n\nwhere p\u03b8\u03b8\u03b8(yyy(n)|xxx(n)) =(cid:80)\n\n\u03b8\u03b8\u03b8\n\n1\n|DN|\n\n(cid:88)\nyyy,hhh p\u03b8\u03b8\u03b8(yyy(n), yyy, hhh|xxx(n)) and p\u03b8\u03b8\u03b8(yyy(c), \u02c6yyy(c)|xxx(c)) =(cid:80)\n\nlog p\u03b8\u03b8\u03b8(yyy(n)|xxx(n)) +\n\n(cid:88)\n\n1\n|DC|\n\nn\n\nc\n\nhhh p\u03b8\u03b8\u03b8(yyy(c), \u02c6yyy(c), hhh|xxx(c)). Due\nto the marginalization of hidden variables in log terms, the objective function cannot be analytically\noptimized. A common approach to optimizing the log marginals is to use the stochastic maximum\nlikelihood method which is also known as persistent contrastive divergence (PCD) [28, 29, 25].\nThe stochastic maximum likelihood method, or equivalently PCD, can be fundamentally viewed\nas an Expectation-Maximization (EM) approach to training. The EM algorithm maximizes the\nvariational lower bound that is formed by subtracting the Kullback\u2013Leibler (KL) divergence between\na variational approximating distribution q and the true conditional distribution from the log marginal\nprobability. For example, consider the bound for the \ufb01rst term in the objective function:\n\nlog p\u03b8\u03b8\u03b8(yyy|xxx) \u2265 log p\u03b8\u03b8\u03b8(yyy|xxx) \u2212 KL[q(\u02c6yyy, hhh|yyy, xxx)||p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx)]\n\n= Eq(\u02c6yyy,hhh|yyy,xxx)[log p\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh|xxx)] \u2212 Eq(\u02c6yyy,hhh|yyy,xxx)[log q(\u02c6yyy, hhh|yyy, xxx)] = U\u03b8\u03b8\u03b8(xxx, yyy).\n\n(4)\n(5)\nIf the incremental EM approach[30] is taken for training the parameters \u03b8\u03b8\u03b8, the lower bound U\u03b8\u03b8\u03b8(xxx, yyy)\nis maximized over the noisy training set by iterating between two steps. In the Expectation step\n(E step), \u03b8\u03b8\u03b8 is \ufb01xed and the lower bound is optimized with respect to the conditional distribution\nq(\u02c6yyy, hhh|yyy, xxx). Since this distribution is only present in the KL term in Eq. 4, the lower bound\nis maximized simply by setting q(\u02c6yyy, hhh|yyy, xxx) to the analytic p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx).\nIn the Maximization\nstep (M step), q is \ufb01xed, and the bound is maximized with respect to the model parameters \u03b8\u03b8\u03b8,\nwhich occurs only in the \ufb01rst expectation term in Eq. 5. This expectation can be written as\nEq(\u02c6yyy,hhh|yyy,xxx)[\u2212E\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx)] \u2212 log Z\u03b8\u03b8\u03b8(xxx), which is maximized by updating \u03b8\u03b8\u03b8 in the direction of its\ngradient, computed using \u2212Eq(\u02c6yyy,hhh|xxx,yyy)[ \u2202\n\u2202\u03b8\u03b8\u03b8 E\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx)]. Noting that\nq(\u02c6yyy, hhh|yyy, xxx) is set to p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx) in the E step, it becomes clear that the M step is equivalent to the\nparameter updates in PCD.\n\n\u2202\u03b8\u03b8\u03b8 E\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx)] + Ep(yyy,\u02c6yyy,hhh|xxx)[ \u2202\n\n3.2 Semi-Supervised Learning Regularized by Auxiliary Distributions\nThe semi-supervised approach infers the latent variables using the conditional q(\u02c6yyy, hhh|yyy, xxx) =\np\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx). However, at the beginning of training when the model\u2019s parameters are not trained yet,\nsampling from the conditional distributions p(\u02c6yyy, hhh|yyy, xxx) does not necessarily generate the clean labels\naccurately. The problem is more severe with the strong representation power of CNN-CRFs, as they\ncan easily \ufb01t to poor conditional distributions that occur at the beginning of training. That is why the\nimpact of the noisy set on training must be reduced by oversampling clean instances [14, 3].\nIn contrast, there may exist auxiliary sources of information that can be used to extract the relationship\nbetween noisy and clean labels. For example, non-image-related sources may be formed from\nsemantic relatedness of labels [31]. We assume that, in using such sources, we can form an auxiliary\ndistribution paux(yyy, \u02c6yyy, hhh) representing the joint probability of noisy and clean labels and some hidden\nbinary states. Here, we propose a framework to use this distribution to train parameters in the semi-\nsupervised setting by guiding the variational distribution to infer the clean labels more accurately. To\ndo so, we add a new regularization term in the lower bound that penalizes the variational distribution\nfor being different from the conditional distribution resulting from the auxiliary distribution as\nfollows:\nlog p\u03b8\u03b8\u03b8(yyy|xxx) \u2265 U aux\n(xxx, yyy) = log p\u03b8\u03b8\u03b8(yyy|xxx)\u2212KL[q(\u02c6yyy, hhh|yyy, xxx)||p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx)]\u2212\u03b1KL[q(\u02c6yyy, hhh|yyy, xxx)||paux(\u02c6yyy, hhh|yyy)]\nwhere \u03b1 is a non-negative scalar hyper-parameter that controls the impact of the added KL term.\nSetting \u03b1 = 0 recovers the original variational lower bound de\ufb01ned in Eq. 4 whereas \u03b1 \u2192 \u221e forces\nthe variational distribution q to ignore the p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx) term. A value between these two extremes\nmakes the inference distribution intermediate between p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx) and paux(\u02c6yyy, hhh|yyy). Note that this\nnew lower bound is actually looser than the original bound. This may be undesired if we were actually\ninterested in predicting noisy labels. However, our goal is to predict clean labels, and the proposed\nframework bene\ufb01ts from the regularization that is imposed on the variational distribution. Similar\nideas have been explored in the posterior regularization approach [32].\nSimilarly, we also de\ufb01ne a new lower bound on the second log marginal in Eq. 3 by:\n\n\u03b8\u03b8\u03b8\n\nlog p\u03b8\u03b8\u03b8(yyy, \u02c6yyy|xxx) \u2265 Laux\n\n\u03b8\u03b8\u03b8\n\n(xxx, yyy, \u02c6yyy) = log p\u03b8\u03b8\u03b8(yyy, \u02c6yyy|xxx) \u2212 KL[q(hhh|yyy)||p\u03b8\u03b8\u03b8(hhh|yyy)] \u2212 \u03b1KL[q(hhh|yyy)||paux(hhh|yyy)].\n\n4\n\n\fAuxiliary Distribution: In this paper, the auxiliary joint distribution paux(yyy, \u02c6yyy, hhh) is modeled by\nan undirected graphical model in a special form of a restricted Boltzmann machine (RBM), and is\ntrained on the clean training set. The structure of the RBM is similar to the CRF model shown in\nFig. 1.b with the fundamental difference that parameters of the model do not depend on xxx:\n\npaux(yyy, \u02c6yyy, hhh) =\n\n1\n\nZaux\n\nexp(\u2212Eaux(yyy, \u02c6yyy, hhh))\n\n(6)\n\nwhere the energy function is de\ufb01ned by the quadratic function:\n\nEaux(yyy, \u02c6yyy, hhh) = \u2212aaaT\n\n(7)\nand Zaux is the partition function, de\ufb01ned similarly to the CRF\u2019s partition function. The number of\nhidden variables is set to 200 and the parameters of this generative model are trained using the PCD\nalgorithm [28], and are \ufb01xed while the CNN-CRF model is being trained.\n\nauxyyy\n\nauxyyy \u2212 cccT\n\nauxhhh \u2212 \u02c6yyyT WWW auxyyy \u2212 hhhT W (cid:48)W (cid:48)W (cid:48)\n\naux\u02c6yyy \u2212 bbbT\n\n3.3 Training Robust CNN-CRF\n\nIn training, we seek \u03b8\u03b8\u03b8 that maximizes the proposed lower bounds on the noisy and clean training sets:\n\n(cid:88)\n\nn\n\n(cid:88)\n\nc\n\nmax\n\n\u03b8\u03b8\u03b8\n\n1\n|DN|\n\nU aux\n\n\u03b8\u03b8\u03b8\n\n(xxx(n), yyy(n)) +\n\n1\n|DC|\n\nLaux\n\n\u03b8\u03b8\u03b8\n\n(xxx(c), yyy(c), \u02c6yyy(c)).\n\nThe optimization problem is solved in a two-step iterative procedure as follows:\nE step: The objective function is optimized with respect to q(\u02c6yyy, hhh|yyy, xxx) for a \ufb01xed \u03b8\u03b8\u03b8. For U aux\nthis is done by solving the following problem:\n\n\u03b8\u03b8\u03b8\n\nKL[q(\u02c6yyy, hhh|yyy, xxx)||p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx)] + \u03b1KL[q(\u02c6yyy, hhh|yyy, xxx)||paux(\u02c6yyy, hhh|yyy)].\n\nmin\n\nq\n\nThe weighted average of KL terms above is minimized with respect to q when:\n\n(10)\nwhich is a weighted geometric mean of the true conditional distribution and auxiliary distribution.\nGiven the factorial structure of these distributions, q(\u02c6yyy, hhh|yyy, xxx) is also a factorial distribution:\n\nq(\u02c6yyy, hhh|yyy, xxx) \u221d [p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx) \u00b7 p\u03b1\n\naux(\u02c6yyy, hhh|yyy)]( 1\n\n\u03b1+1 ) ,\n\n(8)\n\n(xxx, yyy),\n\n(9)\n\n(cid:19)\n\n(cid:19)\n\n.\n\n(11)\n\nq(\u02c6yi = 1|yyy, xxx) = \u03c3\n\nq(hj = 1|yyy) = \u03c3\n\n(cid:18) 1\n(cid:18) 1\n\n\u03b1 + 1\n\n\u03b1 + 1\n\n(aaa\u03c6\u03c6\u03c6(xxx)(i) + WWW (i,:)yyy + \u03b1aaaaux(i) + \u03b1WWW aux(i,:)yyy)\n\n(ccc(j) + WWW\n\n(cid:48)\n(j,:)yyy + \u03b1cccaux(j) + \u03b1WWW\n\n(cid:48)\naux(j,:)yyy)\n\nOptimizing Laux\n\n\u03b8\u03b8\u03b8\n\n(xxx, yyy, \u02c6yyy) w.r.t q(hhh|yyy) gives a similar factorial result:\n\u03b1+1 ) .\n\nq(hhh|yyy) \u221d [p\u03b8\u03b8\u03b8(hhh|yyy) \u00b7 p\u03b1\n\naux(hhh|yyy)]( 1\n\nM step: Holding q \ufb01xed, the objective function is optimized with respect to \u03b8\u03b8\u03b8. This is achieved by\nupdating \u03b8\u03b8\u03b8 in the direction of the gradient of Eq(\u02c6yyy,hhh|xxx,yyy)[log p\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh|xxx)], which is:\n\nU aux\n\n\u03b8\u03b8\u03b8\n\n\u2202\n\u2202\u03b8\u03b8\u03b8\n\n(xxx, yyy) =\n\n\u2202\n\u2202\u03b8\u03b8\u03b8\n\nEq(\u02c6yyy,hhh|xxx,yyy)[log p\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh|xxx)]\n\n= \u2212Eq(\u02c6yyy,hhh|xxx,yyy)[\n\n\u2202\n\u2202\u03b8\u03b8\u03b8\n\nE\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx)] + Ep(yyy,\u02c6yyy,hhh|xxx)[\n\n\u2202\n\u2202\u03b8\u03b8\u03b8\n\nE\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx)],\n\n(12)\n\nwhere the \ufb01rst expectation (the positive phase) is de\ufb01ned under the variational distribution q and\nthe second expectation (the negative phase) is de\ufb01ned under the CRF model p(yyy, \u02c6yyy, hhh|xxx). With the\nfactorial form of q, the \ufb01rst expectation is analytically tractable. The second expectation is estimated\nby PCD [28, 29, 25]. This approach requires maintaining a set of particles for each training instance\nthat are used for seeding the Markov chains at each iteration of training.\nThe gradient of the lower bound on the clean set is de\ufb01ned similarly:\n\nLaux\n\n\u03b8\u03b8\u03b8\n\n\u2202\n\u2202\u03b8\u03b8\u03b8\n\n(xxx, yyy, \u02c6yyy) =\n\nEq(hhh|yyy)[log p\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh|xxx)]\n\n\u2202\n\u2202\u03b8\u03b8\u03b8\n\n= \u2212Eq(hhh|yyy)[\n\n\u2202\n\u2202\u03b8\u03b8\u03b8\n\nE\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx)] + Ep(yyy,\u02c6yyy,hhh|xxx)[\n\n\u2202\n\u2202\u03b8\u03b8\u03b8\n\nE\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh, xxx)]\n\n(13)\n\n5\n\n\fwith the minor difference that in the positive phase the clean label \u02c6yyy is given for each instance and the\nvariational distribution is de\ufb01ned over only the hidden variables.\nScheduling \u03b1\u03b1\u03b1: Instead of setting \u03b1 to a \ufb01xed value during training, it is set to a very large value at\nthe beginning of training and is slowly decreased to smaller values. The rationale behind this is that at\nthe beginning of training, when p\u03b8\u03b8\u03b8(\u02c6yyy, hhh|yyy, xxx) cannot predict the clean labels accurately, it is intuitive\nto rely more on pretrained paux(\u02c6yyy, hhh|yyy) when inferring the latent variables. As training proceeds we\nshift the variational distribution q more toward the true conditional distribution.\nAlgorithm 1 summarizes the learning procedure proposed for training our CRF-CNN. The training is\ndone end-to-end for both CNN and CRF parameters together. In the test time, samples generated by\nGibbs sampling from p\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh|xxx) for the test image xxx are used to compute the marginal p\u03b8\u03b8\u03b8(\u02c6yyy|xxx).\n\nAlgorithm 1: Train robust CNN-CRF with simple gradient descent\n: Noisy dataset DN and clean dataset DC, auxiliary distribution paux(yyy, \u02c6yyy, hhh), a learning\nInput\nrate parameter \u03b5 and a schedule for \u03b1\nOutput :Model parameters: \u03b8\u03b8\u03b8 = {\u03c6\u03c6\u03c6, ccc, WWW , W (cid:48)W (cid:48)W (cid:48)}\nInitialize model parameters\nwhile Stopping criteria is not met do\n\nforeach minibatch {(xxx(n), yyy(n)), (xxx(c), \u02c6yyy(c), yyy(c))} = getMinibatch(DN , DC) do\n\nCompute q(\u02c6yyy, hhh|yyy(n), xxx(n)) by Eq.10 for each noisy instance\nCompute q(hhh|yyy(c)) by Eq. 11 for each clean instance\nDo Gibbs sweeps to sample from the current p\u03b8\u03b8\u03b8(yyy, \u02c6yyy, hhh|xxx(\u00b7)) for each clean/noisy instance\n(mn, mc) \u2190 (# noisy instances in minibatch, # clean instances in minibatch)\n\n\u2202\n\n\u2202\u03b8\u03b8\u03b8Laux\n\n\u03b8\u03b8\u03b8\n\n(xxx(c), yyy(c), \u02c6yyy(c)) by Eq.12 and 13\n\n\u03b8\u03b8\u03b8 \u2190 \u03b8\u03b8\u03b8 + \u03b5(cid:0) 1\n\n(cid:80)\n\nmn\n\nn\n\n(xxx(n), yyy(n))(cid:1) + 1\n\n(cid:80)\n\nmc\n\nc\n\n\u2202\n\n\u2202\u03b8\u03b8\u03b8U aux\n\n\u03b8\u03b8\u03b8\n\nend\n\nend\n\n4 Experiments\n\nIn this section, we examine the proposed robust CNN-CRF model for the image labeling problem.\n\n4.1 Microsoft COCO Dataset\n\nThe Microsoft COCO 2014 dataset is one of the largest publicly available datasets that contains both\nnoisy and clean object labels. Created from challenging Flickr images, it is annotated with 80 object\ncategories as well as captions describing the images. Following [4], we use the 1000 most common\nwords in the captions as the set of noisy labels. We form a binary vector of this length for each\nimage representing the words present in the caption. We use 73 object categories as the set of clean\nlabels, and form binary vectors indicating whether the object categories are present in the image.\nWe follow the same 87K/20K/20K train/validation/test split as [4], and use mean average precision\n(mAP) measure over these 73 object categories as the performance assessment. Finally, we use 20%\nof the training data as the clean labeled training set (DC). The rest of data was used as the noisy\ntraining set (DN ), in which clean labels were ignored in training.\nNetwork Architectures: We use the implementation of ResNet-50 [33] and VGG-16 [34] in Tensor-\nFlow as the neural networks that compute the bias coef\ufb01cients in the energy function of our CRF\n(Eq. 2). These two networks are applied in a fully convolutional setting to each image. Their features\nin the \ufb01nal layer are pooled in the spatial domain using an average pooling operation, and these are\npassed through a fully connected linear layer to generate the bias terms. VGG-16 is used intentionally\nin order to compare our method directly with [4] that uses the same network. ResNet-50 experiments\nenable us to examine how our model works with other modern architectures. Misra et al. [4] have\nreported results when the images were upsampled to 565 pixels. Using upsampled images improves\nthe performance signi\ufb01cantly, but they make cross validation signi\ufb01cantly slower. Here, we report\nour results for image sizes of both 224 (small) and 565 pixels (large).\nParameters Update: The parameters of all the networks were initialized from ImageNet-trained\nmodels that are provided in TensorFlow. The other terms in the energy function of our CRF were all\n\n6\n\n\fxxx\n\n\u02c6yyy\n\nxxx\n\nyyy\n\nxxx\n\nxxx\n\nxxx\n\nxxx\n\n\u02c6yyy\n\nyyy\n\n\u02c6yyy\n\nyyy\n\n\u02c6yyy\n\nyyy\n\nhhh\n\n(a) Clean\n\n(b) Noisy\n\n(c) No link\n\n(d) CRF w/o hhh\n\n(e) CRF w/ hhh\n\nyyy\n\n\u02c6yyy\nhhh\n(f) CRF w/o xxx \u2212 yyy\n\nFigure 2: Visualization of different variations of the model examined in the experiments.\n\ninitialized to zero. Our gradient estimates can be high variance as they are based on a Monte Carlo\nestimate. For training, we use Adam [35] updates that are shown to be robust against noisy gradients.\nThe learning rate and epsilon for the optimizer are set to (0.001, 1) and (0.0003, 0.1) respectively in\nVGG-16 and ResNet-50. We anneal \u03b1 from 40 to 5 in 11 epochs.\nSampling Overhead: Fifty Markov chains per datapoint are maintained for PCD. In each iteration\nof the training, the chains are retrieved for the instances in the current minibatch, and 100 iterations\nof Gibbs sampling are applied for negative phase samples. After parameter updates, the \ufb01nal state of\nchains is stored in memory for the next epoch. Note that we are only required to store the state of the\nchains for either (\u02c6yyy, hhh) or yyy. In this experiment, since the size of hhh is 200, the former case is more\nmemory ef\ufb01cient. Storing persistent chains in this dataset requires only about 1 GB of memory. In\nResNet-50, sampling increases the training time only by 16% and 8% for small and large images\nrespectively. The overhead is 9% and 5% for small and large images in VGG-16.\nBaselines: Our proposed method is compared against several baselines visualized in Fig. 2:\n\nwith the all clean labels. This de\ufb01nes a performance upper bound for each network.\n\n\u2022 Cross entropy loss with clean labels: The networks are trained using cross entropy loss\n\u2022 Cross entropy loss with noisy labels: The model is trained using only noisy labels. Then,\npredictions on the noisy labels are mapped to clean labels using the manual mapping in [4].\n\u2022 No pairwise terms: All the pairwise terms are removed and the model is trained using\n\u2022 CRF without hidden: WWW is trained but WWW\n\u2022 CRF with hidden: Both WWW and WWW\n\u2022 CRF without xxx \u2212 yyy link: Same as the previous model but bbb is not a function of xxx.\n\u2022 CRF without xxx \u2212 yyy link (\u03b1 = 0\n\u03b1 = 0): Same as the previous model but trained with \u03b1 = 0.\n\u03b1 = 0\n\nanalytic gradients without any sampling using our proposed objective function in Eq. 8.\n\n(cid:48) is omitted from the model.\n\n(cid:48) are present in the model.\n\nThe experimental results are reported in Table 1 under \u201cCaption Labels.\u201d A performance increase is\nobserved after adding each component to the model. However, removing the xxx \u2212 yyy link generally\nimproves the performance signi\ufb01cantly. This may be because removing this link forces the model\nto rely on \u02c6yyy and its correlations with yyy for predicting yyy on the noisy labeled set. This can translate\nto better recognition of clean labels. Last but not least, the CRF model with no xxx \u2212 yyy connection\ntrained using \u03b1 = 0 performed very poorly on this dataset. This demonstrates the importance of the\nintroduced regularization in training.\n\n4.2 Microsoft COCO Dataset with Flickr Tags\n\nThe images in the COCO dataset were originally gathered and annotated from the Flickr website. This\nmeans that these image have actual noisy Flickr tags. To examine the performance of our model on\nactual noisy labels, we collected these tags for the COCO images using Flickr\u2019s public API. Similar\nto the previous section, we used the 1024 most common tags as the set of noisy labels. We observed\nthat these tags have signi\ufb01cantly more noise compared to the noisy labels in the previous section;\ntherefore, it is more challenging to predict clean labels from them using the auxiliary distribution.\nIn this section, we only examine the ResNet-50 architecture for both small and large image sizes.\nThe different baselines introduced in the previous section are compared against each other in Table 1\nunder \u201cFlickr Tags.\u201d\nAuxiliary Distribution vs. Variational Distribution: As the auxiliary distribution paux is \ufb01xed,\nand the variational distribution q is updated using Eq. 10 in each iteration, a natural question is how\n\n7\n\n\fTable 1: The performance of different baselines on the COCO dataset in terms of mAP (%).\n\nCaption Labels (Sec. 4.1)\n\nFlickr Tags (Sec. 4.2)\n\nResNet-50\n\nBaseline\nCross entropy loss w/ clean\nCross entropy loss w/ noisy\nNo pairwise link\nCRF w/o hidden\nCRF w/ hidden\nCRF w/o xxx \u2212 yyy link\nCRF w/o xxx \u2212 yyy link (\u03b1 = 0)\nMisra et al. [4]\nFang et al. [36] reported in [4]\n\nResNet-50\n\nSmall Large\n68.57\n78.38\n64.13\n56.88\n73.19\n63.67\n73.23\n64.26\n74.04\n65.73\n66.61\n75.00\n56.53\n48.53\n\n-\n-\n\n-\n-\n\nVGG-16\n\nSmall Large\n71.99\n75.50\n62.75\n58.59\n71.78\n66.18\n71.78\n67.73\n71.92\n68.35\n69.89\n73.16\n56.39\n56.76\n66.8\n63.7\n\n-\n-\n\nSmall\n68.57\n\n58.01\n59.04\n59.19\n60.97\n47.25\n\n-\n\n-\n-\n\nLarge\n78.38\n\n67.84\n67.22\n67.33\n67.57\n58.74\n\n-\n\n-\n-\n\nq differs from paux. Since, we have access to the clean labels in the COCO dataset, we examine\nthe accuracy of q in terms of predicting clean labels on the noisy training set (DN ) using the mAP\nmeasurement at the beginning and end of training the CRF-CNN model (ResNet-50 on large images).\nWe observed that at the beginning of training, when \u03b1 is big, q is almost equal to paux, which obtains\n49.4% mAP on this set. As training iterations proceed, the accuracy of q increases to 69.4% mAP.\nNote that the 20.0% gain in terms of mAP is very signi\ufb01cant, and it demonstrates that combining the\nauxiliary distribution with our proposed CRF can yield a signi\ufb01cant performance gain in inferring\nlatent clean labels. In other words, our proposed model is capable of cleaning the noisy labels\nand proposing more accurate labels on the noisy set as training continues. Please refer to our\nsupplementary material for a qualitative comparison between q and paux.\n\n4.3 CIFAR-10 Dataset\n\nWe also apply our proposed learning framework to the object classi\ufb01cation problem in the CIFAR-10\ndataset. This dataset contains images of 10 objects resized to 32x32-pixel images. We follow the\nsettings in [9] and we inject synthesized noise to the original labels in training. Moreover, we\nimplement the forward and backward losses proposed in [9] and we use them to train ResNet [33] of\ndepth 32 with the ground-truth noise transition matrix.\nHere, we only train the variant of our model shown in Fig. 2.c that can be trained analytically. For\nthe auxiliary distribution, we trained a simple linear multinomial logistic regression representing the\nconditional paux(\u02c6yyy|yyy) with no hidden variables (hhh) . We trained this distribution such that the output\nprobabilities match the ground-truth noise transition matrix. We trained all models for 200 epochs.\nFor our model, we anneal \u03b1 from 8 to 1 in 10 epochs. Similar to the previous section, we empirically\nobserved that it is better to stop annealing \u03b1 before it reaches zero. Here, to compare our method\nwith the previous work, we do not work in a semi-supervised setting, and we assume that we have\naccess only to the noisy training dataset.\nOur goal for this experiment is to demonstrate that a simple variant of our model can be used for\ntraining from images with only noisy labels and to show that our model can clean the noisy labels.\nTo do so, we report not only the average accuracy on the clean test dataset, but also the recovery\naccuracy. The recovery accuracy for our method is de\ufb01ned as the accuracy of q in predicting the\nclean labels in the noisy training set at the end of learning. For the baselines, we measure the accuracy\nof the trained neural network p(\u02c6yyy|xxx) on the same set. The results are reported in Table 2. Overall, our\nmethod achieves slightly better prediction accuracy on the CIFAR-10 dataset than the baselines. And,\nin terms of recovering clean labels on the noisy training set, our model signi\ufb01cantly outperforms the\nbaselines. Examples of the recovered clean labels are visualized for the CIFAR-10 experiment in the\nsupplementary material.\n\n5 Conclusion\n\nWe have proposed a general undirected graphical model for modeling label noise in training deep\nneural networks. We formulated the problem as a semi-supervised learning problem, and we proposed\na novel objective function equipped with a regularization term that helps our variational distribution\n\n8\n\n\fTable 2: Prediction and recovery accuracy of different baselines on the CIFAR-10 dataset.\n\nNoise (%)\nCross entropy loss\nBackward [9]\nForward [9]\nOur model\n\n10\n91.2\n87.4\n90.9\n91.6\n\nPrediction Accuracy (%)\n\n20\n90.0\n87.4\n90.3\n91.0\n\n30\n89.1\n84.6\n89.4\n90.6\n\n40\n87.1\n76.5\n88.4\n89.4\n\nRecovery Accuracy (%)\n\n50\n80.2\n45.6\n80.0\n84.3\n\n10\n94.1\n88.0\n94.6\n97.7\n\n20\n92.4\n87.4\n93.6\n96.4\n\n30\n89.6\n84.0\n92.3\n95.1\n\n40\n85.2\n75.3\n91.1\n93.5\n\n50\n74.6\n44.0\n83.1\n88.1\n\ninfer latent clean labels more accurately using auxiliary sources of information. Our model not only\npredicts clean labels on unseen instances more accurately, but also recovers clean labels on noisy\ntraining sets with a higher precision. We believe the ability to clean noisy annotations is a very\nvaluable property of our framework that will be useful in many application domains.\n\nAcknowledgments\n\nThe author thanks Jason Rolfe, William Macready, Zhengbing Bian, and Fabian Chudak for their\nhelpful discussions and comments. This work would not be possible without the excellent technical\nsupport provided by Mani Ranjbar and Oren Shklarsky.\n\nReferences\n[1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical\n\nimage database. In Computer Vision and Pattern Recognition (CVPR), 2009.\n\n[2] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In International Conference on Machine Learning (ICML), 2001.\n\n[3] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled\n\ndata for image classi\ufb01cation. In Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[4] Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. Seeing through the human\n\nreporting bias: Visual classi\ufb01ers from noisy human-centric labels. In CVPR, 2016.\n\n[5] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolu-\n\ntional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.\n\n[6] B. Frenay and M. Verleysen. Classi\ufb01cation in the presence of label noise: A survey. IEEE Transactions on\n\nNeural Networks and Learning Systems, 25(5):845\u2013869, 2014.\n\n[7] Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. Learning with noisy\n\nlabels. In Advances in neural information processing systems, pages 1196\u20131204, 2013.\n\n[8] Volodymyr Mnih and Geoffrey E. Hinton. Learning to label aerial images from noisy data. In International\n\nConference on Machine Learning (ICML), pages 567\u2013574, 2012.\n\n[9] Giorgio Patrini, Alessandro Rozza, Aditya Menon, Richard Nock, and Lizhen Qu. Making neural networks\n\nrobust to label noise: A loss correction approach. In Computer Vision and Pattern Recognition, 2017.\n\n[10] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich.\nTraining deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.\n\n[11] Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semi-supervised\n\nlearning using Gaussian \ufb01elds and harmonic functions. In ICML, 2003.\n\n[12] Rob Fergus, Yair Weiss, and Antonio Torralba. Semi-supervised learning in gigantic image collections. In\n\nAdvances in neural information processing systems, pages 522\u2013530, 2009.\n\n[13] Xinlei Chen and Abhinav Gupta. Webly supervised learning of convolutional networks. In International\n\nConference on Computer Vision (ICCV), 2015.\n\n[14] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. Learning from\n\nnoisy large-scale datasets with minimal supervision. arXiv preprint arXiv:1701.01619, 2017.\n\n[15] Guosheng Lin, Chunhua Shen, Anton van den Hengel, and Ian Reid. Ef\ufb01cient piecewise training of deep\nstructured models for semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2016.\n\n9\n\n\f[16] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du,\nChang Huang, and Philip HS Torr. Conditional random \ufb01elds as recurrent neural networks. In International\nConference on Computer Vision (ICCV), 2015.\n\n[17] Jian Peng, Liefeng Bo, and Jinbo Xu. Conditional neural \ufb01elds. In Advances in neural information\n\nprocessing systems, pages 1419\u20131427, 2009.\n\n[18] Thierry Artieres et al. Neural conditional random \ufb01elds. In Proceedings of the Thirteenth International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 177\u2013184, 2010.\n\n[19] Rohit Prabhavalkar and Eric Fosler-Lussier. Backpropagation training for multilayer conditional ran-\ndom \ufb01eld based phone recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE\nInternational Conference on, pages 5534\u20135537. IEEE, 2010.\n\n[20] Philipp Kr\u00e4henb\u00fchl and Vladlen Koltun. Ef\ufb01cient inference in fully connected CRFs with Gaussian edge\n\npotentials. In Advances in Neural Information Processing Systems (NIPS), pages 109\u2013117, 2011.\n\n[21] Liang-Chieh Chen, Alexander G. Schwing, Alan L. Yuille, and Raquel Urtasun. Learning deep structured\n\nmodels. In ICML, pages 1785\u20131794, 2015.\n\n[22] Alexander G. Schwing and Raquel Urtasun. Fully connected deep structured networks. arXiv preprint\n\narXiv:1503.02351, 2015.\n\n[23] Zhiwei Deng, Arash Vahdat, Hexiang Hu, and Greg Mori. Structure inference machines: Recurrent neural\n\nnetworks for analyzing relations in group activity recognition. In CVPR, 2016.\n\n[24] Stephane Ross, Daniel Munoz, Martial Hebert, and J. Andrew Bagnell. Learning message-passing inference\n\nmachines for structured prediction. In Computer Vision and Pattern Recognition (CVPR), 2011.\n\n[25] Alexander Kirillov, Dmitrij Schlesinger, Shuai Zheng, Bogdan Savchynskyy, Philip HS Torr, and Carsten\nJoint training of generic CNN-CRF models with stochastic optimization. arXiv preprint\n\nRother.\narXiv:1511.05067, 2015.\n\n[26] Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Morency Collins, and Trevor Darrell. Hidden\nconditional random \ufb01elds. IEEE transactions on pattern analysis and machine intelligence, 29(10), 2007.\n\n[27] Laurens Maaten, Max Welling, and Lawrence K. Saul. Hidden-unit conditional random \ufb01elds.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 479\u2013488, 2011.\n\nIn\n\n[28] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient.\nIn Proceedings of the 25th international conference on Machine learning, pages 1064\u20131071. ACM, 2008.\n\n[29] Laurent Younes. Parametric inference for imperfectly observed Gibbsian \ufb01elds. Probability theory and\n\nrelated \ufb01elds, 1989.\n\n[30] Radford M. Neal and Geoffrey E. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse,\n\nand other variants. In Learning in graphical models. 1998.\n\n[31] Marcus Rohrbach, Michael Stark, Gy\u00f6rgy Szarvas, Iryna Gurevych, and Bernt Schiele. What helps where\u2013\nand why? Semantic relatedness for knowledge transfer. In Computer Vision and Pattern Recognition\n(CVPR), 2010.\n\n[32] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. Posterior regularization for structured latent\n\nvariable models. Journal of Machine Learning Research, 2010.\n\n[33] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn Computer Vision and Pattern Recognition, 2016.\n\n[34] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[35] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[36] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Doll\u00e1r, Jianfeng Gao,\nXiaodong He, Margaret Mitchell, John C Platt, et al. From captions to visual concepts and back. In\nConference on Computer Vision and Pattern Recognition, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2874, "authors": [{"given_name": "Arash", "family_name": "Vahdat", "institution": "D-Wave Systems Inc."}]}