{"title": "DISCO Nets : DISsimilarity COefficients Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 352, "page_last": 360, "abstract": "We present a new type of probabilistic model which we call DISsimilarity COefficient Networks (DISCO Nets). DISCO Nets allow us to efficiently sample from a posterior distribution parametrised by a neural network. During training, DISCO Nets are learned by minimising the dissimilarity coefficient between the true distribution and the estimated distribution. This allows us to tailor the training to the loss related to the task at hand. We empirically show that (i) by modeling uncertainty on the output value, DISCO Nets outperform equivalent non-probabilistic predictive networks and (ii) DISCO Nets accurately model the uncertainty of the output, outperforming existing probabilistic models based on deep neural networks.", "full_text": "DISCO Nets: DISsimilarity COef\ufb01cient Networks\n\nDiane Bouchacourt\nUniversity of Oxford\n\nM. Pawan Kumar\nUniversity of Oxford\n\ndiane@robots.ox.ac.uk\n\npawan@robots.ox.ac.uk\n\nSebastian Nowozin\n\nMicrosoft Research Cambridge\n\nsebastian.nowozin@microsoft.com\n\nAbstract\n\nWe present a new type of probabilistic model which we call DISsimilarity COef\ufb01-\ncient Networks (DISCO Nets). DISCO Nets allow us to ef\ufb01ciently sample from a\nposterior distribution parametrised by a neural network. During training, DISCO\nNets are learned by minimising the dissimilarity coef\ufb01cient between the true distri-\nbution and the estimated distribution. This allows us to tailor the training to the loss\nrelated to the task at hand. We empirically show that (i) by modeling uncertainty on\nthe output value, DISCO Nets outperform equivalent non-probabilistic predictive\nnetworks and (ii) DISCO Nets accurately model the uncertainty of the output,\noutperforming existing probabilistic models based on deep neural networks.\n\nIntroduction\n\n1\nWe are interested in the class of problems that require the prediction of a structured output y \u2208 Y\ngiven an input x \u2208 X . Complex applications often have large uncertainty on the correct value of y.\nFor example, consider the task of hand pose estimation from depth images, where one wants to\naccurately estimate the pose y of a hand given a depth image x. The depth image often has some\nocclusions and missing depth values and this results in some uncertainty on the pose of the hand. It is,\ntherefore, natural to use probabilistic models that are capable of representing this uncertainty. Often,\nthe capacity of the model is restricted and cannot represent the true distribution perfectly. In this case,\nthe choice of the learning objective in\ufb02uences \ufb01nal performance. Similar to Lacoste-Julien et al. [12],\nwe argue that the learning objective should be tailored to the evaluation loss in order to obtain the best\nperformance with respect to this loss. In details, we denote by \u2206training the loss function employed\nduring model training, and by \u2206task the loss employed to evaluate the model\u2019s performance.\n\nWe present a simple example to illustrate the point made above. We consider a data distri-\nbution that is a mixture of two bidimensional Gaussians. We now consider two models to capture\nthe data probability distribution. Each model is able to represent a bidimensional Gaussian\ndistribution with diagonal covariance parametrised by (\u00b51, \u00b52, \u03c31, \u03c32).\nIn this case, neither of\nthe models will be able to recover the true data distribution since they do not have the ability to\nrepresent a mixture of Gaussians. In other words, we cannot avoid model error, similarly to the\nreal data scenario. Each model uses its own training loss \u2206training. Model A employs a loss that\n2) \u2208 R2\nemphasises on the \ufb01rst dimension of the data, speci\ufb01ed for x = (x1, x2), x(cid:48) = (x(cid:48)\nby \u2206A(x\u2212 x(cid:48)) = (10\u00d7 (x1 \u2212 x(cid:48)\n2)2)1/2. Model B does the opposite and employs\nthe loss function \u2206B(x \u2212 x(cid:48)) = (0.1 \u00d7 (x1 \u2212 x(cid:48)\n2)2)1/2. Each model performs a\ngrid search over the best parameters values for (\u00b51, \u00b52, \u03c31, \u03c32). Figure 1 shows the contours of the\nMixture of Gaussians distribution of the data (in black), and the contour of the Gaussian \ufb01tted by\neach model (in red and green). Detailed setting of this example is available in the supplementary\nmaterial.\n\n1)2 + 10 \u00d7 (x2 \u2212 x(cid:48)\n\n1)2 + 0.1\u00d7 (x2 \u2212 x(cid:48)\n\n1, x(cid:48)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTable 1: \u2206task\u00b1 SEM (standard error of\nthe mean) with respect to \u2206training employed.\nEvaluation is done the test set.\n\n\u2206task\n\n\u2206training\n\n\u2206A\n\u2206B\n\n\u2206A\n\n\u2206B\n\n11.6 \u00b1 0.287 13.7 \u00b1 0.331\n12.1 \u00b1 0.305 11.0 \u00b1 0.257\n\nFigure 1: Contour lines of the Gaussian distribution \ufb01tted by each\nmodel on the Mixture of Gaussians data distribution. Best viewed\nin color.\n\nAs expected, the \ufb01tted Gaussian distributions differ according to \u2206training employed. Table 1 shows\nthat the loss on the test set, evaluated with \u2206task, is minimised if \u2206training = \u2206task. This simple\nexample illustrates the advantage to being able to tailor the model\u2019s training objective function to\nhave \u2206training = \u2206task. This is in contrast to the commonly employed learning objectives we present\nin Section 2, that are agnostic of the evaluation loss.\nIn order to alleviate the aforementioned de\ufb01ciency of the state-of-the-art, we introduce DISCO Nets,\na new class of probabilistic model. DISCO Nets represent P , the true posterior distribution of the\ndata, with a distribution Q parametrised by a neural network. We design a learning objective based\non a dissimilarity coef\ufb01cient between P and Q. The dissimilarity coef\ufb01cient we employ was \ufb01rst\nintroduced by Rao [23] and is de\ufb01ned for any non-negative symmetric loss function. Thus, any such\nloss can be incorporated in our setting, allowing the user to tailor DISCO Nets to his or her needs.\nFinally, contrarily to existing probabilistic models presented in Section 2, DISCO Nets do not require\nany speci\ufb01c architecture or training procedure, making them an ef\ufb01cient and easy-to-use class of\nmodel.\n2 Related Work\nDeep neural networks, and in particular, Convolutional Neural Networks (CNNs) are comprised of\nseveral convolutional layers, followed by one or more fully connected (dense) layers, interleaved by\nnon-linear function(s) and (optionally) pooling. Recent probabilistic models use CNNs to represent\nnon-linear functions of the data. We observe that such models separate into two types. The \ufb01rst\ntype of model does not explicitly compute the probability distribution of interest. Rather, these\nmodels allow the user to sample from this distribution by feeding the CNN with some noise z.\nAmong such models, Generative Adversarial Networks (GAN) presented in Goodfellow et al. [7] are\nvery popular and have been used in several computer vision applications, for example in Denton\net al. [1], Radford et al. [22], Springenberg [25] and Yan et al. [28]. A GAN model consists of\ntwo networks, simultaneously trained in an adversarial manner. A generative model, referred as\nthe Generator G, is trained to replicate the data from noise, while an adversarial discriminative\nmodel, referred as the Discriminator D, is trained to identify whether a sample comes from the\ntrue data or from G. The GAN training objective is based on a minimax game between the two\nnetworks and approximately optimizes a Jensen-Shannon divergence. However, as mentioned\nin Goodfellow et al. [7] and Radford et al. [22], GAN models require very careful design of the\nnetworks\u2019 architecture. Their training procedure is tedious and tends to oscillate. GAN models have\nbeen generalized to conditional GAN (cGAN) in Mirza and Osindero [16], where some additional\ninput information can be fed to the Generator and the Discriminator. For example in Mirza and\nOsindero [16] a cGAN model generates tags corresponding to an image. Gauthier [4] applies cGAN\nto face generation. Reed et al. [24] propose to generate images of \ufb02owers with a cGAN model, where\nthe conditional information is a word description of the \ufb02ower to generate1. While the application of\ncGAN is very promising, little quantitative evaluation has been done. Furthermore, cGAN models\nsuffer from the same dif\ufb01culties we mentioned for GAN. Another line of work has developed towards\nthe use of statistical hypothesis testing to learn probabilistic models. In Dziugaite et al. [2] and Li\net al. [14], the authors propose to train generative deep networks with an objective function based on\nthe Maximum Mean Discrepancy (MMD) criterion. The MMD method (see Gretton et al. [8, 9]) is\na statistical hypothesis test assessing if two probabilistic distributions are similar. As mentioned\nin Dziugaite et al. [2], the MMD test can been seen as playing the role of an adversary.\n\n1At the time writing, we do not have access to the full paper of Reed et al. [24] and therefore cannot take\n\nadvantage of this work in our experimental comparison.\n\n2\n\n\fThe second type of model approximates intractable posterior distributions with use of varia-\ntional inference. The Variational Auto-Encoders (VAE) presented in Kingma and Welling [10] is\ncomposed of a probabilistic encoder and a probabilistic decoder. The probabilistic encoder is\nfed with the input x \u2208 X and produces a posterior distribution P (z|x) over the possible values\nof noise z that could have generated x. The probabilistic decoder learns to map the noise z back\nto the data space X . The training of VAE uses an objective function based on a Kullback-Leibler\nDivergence. VAE and GAN models have been combined in Makhzani et al. [15], where the authors\npropose to regularise autoencoders with an adversarial network. The adversarial network ensures that\nthe posterior distribution P (z|x) matches an arbitrary prior P (z).\n\nIn hand pose estimation, imagine the user wants to obtain accurate positions of the thumb\nand the index \ufb01nger but does not need accurate locations of the other \ufb01ngers. The task loss \u2206task\nmight be based on a weighted L2-norm between the predicted and the ground-truth poses, with high\nweights on the thumb and the index. Existing probabilistic models cannot be tailored to task-speci\ufb01c\nlosses and we propose the DISsimilarity COef\ufb01cient Networks (DISCO Nets) to alleviate this\nde\ufb01ciency.\n3 DISCO Nets\nWe begin the description of our model by specifying how it can be used to generate samples from the\nposterior distribution, and how the samples can in turn be employed to provide a pointwise estimate.\nIn the subsequent subsection, we describe how to estimate the parameters of the model.\n\n3.1 Prediction\nSampling. A DISCO Net consists of several convolutional and dense layers (interleaved by non-\nlinear function(s) and possibly pooling) and takes as input a pair (x, z) \u2208 X \u00d7 Z, where x is input\ndata and z is some random noise. Given one pair (x, z), the DISCO Net produces a value for the\noutput y. In the example of hand pose estimation, the input depth image x is fed to the convolutional\nlayers. The output of the last convolutional layer is \ufb02attened and concatenated with a noise sample z.\nThe resulting vector is fed to several dense layers, and the last dense layer outputs a pose y. From\na single depth image x, by using different noise samples, the DISCO Net produces different pose\ncandidates for the depth image. This process is illustrated in Figure 2. Importantly, DISCO Nets are\n\ufb02exible in the choice of the architecture. For example, the noise could be concatenated at any stage\nof the network, including at the start.\n\nFigure 2: For a single depth image x, using 3 different noise samples (z1, z2, z3), DISCO Nets output 3 different\ncandidate poses (y1, y2, y3) (shown superimposed on the depth image). The depth image is from the NYU Hand\nPose Dataset of Tompson et al. [27], preprocessed as in Oberweger et al. [17]. Best viewed in color.\n\nWe denote Q the distribution that is parametrised by the DISCO Net\u2019s neural network. For a given\ninput x, DISCO Nets provide the user with samples y drawn from Q(y|x) without requiring the\nexpensive computation of the (often intractable) partition function. In the remainder of the paper we\nconsider x \u2208 Rdx , y \u2208 Rdy and z \u2208 Rdz.\nPointwise Prediction.\nIn order to obtain a single prediction y for a given input x, DISCO Nets\nuse the principle of Maximum Expected Utility (MEU), similarly to Premachandran et al. [21].\nThe prediction y\u2206task maximises the expected utility, or rather minimises the expected task-speci\ufb01c\nloss \u2206task, estimated using the sampled candidates. Formally, the prediction is made as follows:\n\n3\n\n\fy\u2206task = argmax\nk\u2208[1,K]\n\nEU(yk) = argmin\nk\u2208[1,K]\n\nK(cid:88)\n\nk(cid:48)=1\n\n\u2206task(yk, y(cid:48)\nk)\n\n(1)\n\nwhere (y1, ..., yK) are the candidate outputs sampled for the single input x. Details on the MEU\nmethod are in the supplementary material.\n\n3.2 Learning DISCO Nets\nObjective Function. We want DISCO Nets to accurately model the true probability P (y|x)\nvia Q(y|x). In other words, Q(y|x) should be as similar as possible to P (y|x). This similar-\nity is evaluated with respect to the loss speci\ufb01c to the task at hand. Given any non-negative symmetric\nloss function between two outputs \u2206(y, y(cid:48)) with (y, y(cid:48)) \u2208 Y \u00d7 Y, we employ a diversity coef\ufb01cient\nthat is the expected loss between two samples drawn randomly from the two distributions. Formally,\nthe diversity coef\ufb01cient is de\ufb01ned as:\n\nDIV\u2206(P, Q, D) = Ex\u223cD(x)[Ey\u223cP (y|x)[Ey(cid:48)\u223cQ(y(cid:48)|x)[\u2206(y, y(cid:48))]]]\n\n(2)\nIntuitively, we should minimise DIV\u2206(P, Q, D) so that Q(y|x) is as similar as possible to P (y|x).\nHowever there is uncertainty on the output y to predict for a given x. In other words, P (y|x) is\ndiverse and Q(y|x) should be diverse as well. Thus we encourage Q(y|x) to provide sample outputs,\nfor a given x, that are diverse by minimising the following dissimilarity coef\ufb01cient:\n\nDISC\u2206(P, Q, D) = DIV\u2206(P, Q, D) \u2212 \u03b3DIV\u2206(Q, Q, D) \u2212 (1 \u2212 \u03b3)DIV\u2206(P, P, D)\n\n(3)\nwith \u03b3 \u2208 [0, 1]. The dissimilarity DISC\u2206(P, Q, D) is the difference between the diversity between P\nand Q, and an af\ufb01ne combination of the diversity of each distribution, given x \u223c D(x). These\ncoef\ufb01cients were introduced by Rao [23] with \u03b3 = 1/2 and used for latent variable models by Kumar\net al. [11]. We do not need to consider the term DIV\u2206(P, P, D) as it is a constant in our problem,\nand thus the DISCO Nets objective function is de\ufb01ned as follows:\n\nF = DIV\u2206(P, Q, D) \u2212 \u03b3DIV\u2206(Q, Q, D)\n\n(4)\nWhen minimising F , the term \u03b3DIV\u2206(Q, Q, D) encourages Q(y|x) to be diverse. The value of \u03b3\nbalances between the two goals of Q(y|x) that are providing accurate outputs while being diverse.\nOptimisation. Let us consider a training dataset composed of N examples input-output pairs D =\n{(xn, yn), n = 1..N}. In order to train DISCO Nets, we need to compute the objective func-\ntion of equation (4). We do not have knowledge of the true probability distributions P (y, x)\nand P (x). To overcome this de\ufb01ciency, we construct estimators of each diversity term DIV\u2206(P, Q)\nand DIV\u2206(Q, Q). First, we take an empirical distribution of the data, that is, taking ground-truth\npairs (xn, yn). We then estimate each distribution Q(y|xn) by sampling K outputs from our model\nfor each xn. This gives us an unbiased estimate of each diversity term, de\ufb01ned as:\n\n(cid:100)DIV\u2206(P, Q, D) =\n(cid:100)DIV\u2206(Q, Q, D) =\n\nK(cid:88)\n\nk=1\n\n1\nK\n\nN(cid:88)\nN(cid:88)\n\nn=1\n\nn=1\n\n1\nN\n\n1\nN\n\n1\n\nK(K \u2212 1)\n\n\u2206(yn, G(zk, xn; \u03b8))\n\nK(cid:88)\n\nK(cid:88)\n\nk=1\n\nk(cid:48)=1,k(cid:48)(cid:54)=k\n\n(5)\n\n(6)\n\n\u2206(G(zk, xn; \u03b8), G(zk(cid:48), xn; \u03b8))\n\nWe have an unbiased estimate of the DISCO Nets\u2019 objective function of equation (4):\n\n(cid:98)F (\u2206, \u03b8) = (cid:100)DIV\u2206(P, Q, D) \u2212 \u03b3(cid:100)DIV\u2206(Q, Q, D)\n\nwhere yk = G(zk, xn; \u03b8) is a candidate output sampled from DISCO Nets for (xn,zk), and \u03b8 are the\nparameters of DISCO Nets. It is important to note that the second term of equation (6) is summing\nover k and k(cid:48) (cid:54)= k to have an unbiased estimate, therefore we compute the loss between pairs of\ndifferent samples G(zk, xn; \u03b8) and G(zk(cid:48), xn; \u03b8). The parameters \u03b8 are learned by Gradient Descent.\nAlgorithm 1 shows the training of DISCO Nets. In steps 4 and 5 of Algorithm 1, we draw K random\nnoise vectors (zn,1, ...zn,k) per input example xn, and generate K candidate outputs G(zn,k, xn; \u03b8)\nper input. This allow us to compute an unbiased estimate of the gradient in step 7. For clarity, in the\nremainder of the paper we do not explicitely write the parameters \u03b8 and write G(zk, xn).\n\n4\n\n\fAlgorithm 1: DISCO Nets Training algorithm.\nfor t=1...T epochs do\n\nSample minibatch of b training example pairs {(x1, y1)...(xb, yb)}.\nfor n=1...b do\n\nSample K random noise vectors (zn,1, ...zn,k) for training example xn\nGenerate K candidate outputs G(zn,k, xn; \u03b8), k = 1..K for training example xn\n\nUpdate parameters \u03b8t \u2190 \u03b8t\u22121 by descending the gradient of equation (6) : \u2207\u03b8(cid:98)F (\u2206, \u03b8).\n\nend\n\nend\n\n1\n2\n3\n4\n5\n6\n\n7\n8\n\n3.3 Strictly Proper Scoring Rules.\nScoring Rule for Learning. A scoring rule S(Q, P ), as de\ufb01ned in Gneiting and Raftery [5],\nevaluates the quality of a predictive distribution Q with respect to a true distribution P . When using\na scoring rule one should ensure that it is proper, which means it is maximised when P = Q. A\nscoring rule is said to be strictly proper if P = Q is the unique maximiser of S. Hence maximising a\nproper scoring rule ensures that the model aims at predicting relevant forecast. Gneiting and Raftery\n[5] de\ufb01ne score divergences corresponding to a proper scoring rule S:\n\nd(Q, P ) = S(P, P ) \u2212 S(Q, P )\n\n(7)\nIf S is proper, d is a valid non-negative divergence function, with value 0 if (and only if, in the case\nof strictly proper) Q = P . For example the MMD criterion (see Gretton et al. [8, 9]) mentioned\nin Section 2 is an example of this type of divergence. In our case, any loss \u2206 expressed with an\nuniversal kernel will de\ufb01ne the DISCO Nets\u2019 objective function as such divergence (see Zawadzki\nand Lahaie [29]). For example, by Theorem 5 of Gneiting and Raftery [5], if we take as loss\ni=1 |(yi \u2212 y(cid:48)i|2)\u03b2/2 with \u03b2 \u2208 [0, 2] excluding 0 and 2, our\nfunction \u2206\u03b2(y, y(cid:48)) = ||y \u2212 y(cid:48)||\u03b2\n(cid:105)\ntraining objective is (the negative of) a strictly proper scoring rule, that is:\n(cid:98)F (\u2206, \u03b8) =\n\n(cid:80)\nk(cid:48)(cid:54)=k ||G(zk(cid:48), xn) \u2212 G(zk, xn)||\u03b2\n2\n(8)\nThis score has been referred in the litterature as the Energy Score in Gneiting and Raftery\n[5], Gneiting et al. [6], Pinson and Tastu [19].\n\n(cid:80)\nk ||yn \u2212 G(zk, xn)||\u03b2\n\n2 =(cid:80)dy\n\nK(K \u2212 1)\n\n(cid:104) 1\n\nK\n\n(cid:80)\n\nk\n\n2 \u2212 1\n2\n\n(cid:80)N\n\nn=1\n\n1\nN\n\n1\n\nBy employing a (strictly) proper scoring rule we ensure that our objective function is (only)\nminimised at the true distribution P , and expect DISCO Nets to generalise better on unseen data.\nWe show below that strictly proper scoring rules are also relevant to assess the quality of the\ndistribution Q captured by the model.\nDiscriminative power of proper scoring rules. As observed in Fukumizu et al. [3], kernel density\nestimation (KDE) fails in high dimensional output spaces. Our goal is to compare the quality of the\ndistribution captured between two models, Q1 and Q2. In our setting Q1 better models P than Q2\naccording to the scoring rule S and its associated divergence d if d(Q1, P ) < d(Q2, P ). As noted\nin Pinson and Tastu [19], S being proper does not ensure d(Q1, y) < d(Q2, y) for all observations y\ndrawn from P . However if the scoring rule is strictly proper scoring rule, this property should be\nensured in the neighbourhood of the true distribution.\n4 Experiments : Hand Pose Estimation\nGiven a depth image x, which often contains occlusions and missing values, we wish to predict the\nhand pose y. We use the NYU Hand Pose dataset of Tompson et al. [27] to estimate the ef\ufb01ciency of\nDISCO Nets for this task.\n4.1 Experimental Setup\nNYU Hand Pose Dataset. The NYU Hand Pose dataset of Tompson et al. [27] contains 8252\ntesting and 72,757 training frames of captured RGBD data with ground-truth hand pose information.\nThe training set is composed of images of one person whereas the testing set gathers samples from\ntwo persons. For each frame, the RGBD data from 3 Kinects is provided: a frontal view and 2 side\nviews. In our experiments we use only the depth data from the frontal view. While the ground truth\n\n5\n\n\fcontains J = 36 annotated joints, we follow the evaluation protocol of Oberweger et al. [17, 18] and\nuse the same subset of J = 14 joints. We also perform the same data preprocessing as in Oberweger\net al. [17, 18], and extract a \ufb01xed-size metric cube around the hand from the depth image. We resize\nthe depth values within the cube to a 128 \u00d7 128 patch and normalized them in [\u22121, 1]. Pixels deeper\nthan the back of the cube and missing depth values are both set to a depth of 1.\n\nMethods. We employ loss functions between two outputs of the form of the Energy score (8), that\nis, \u2206training = \u2206\u03b2(y, y(cid:48)) = ||y \u2212 y(cid:48)||\u03b2\n2 . Our \ufb01rst goal is to assess the advantages of DISCO Nets\nwith respect to non-probabilistic deep networks. One model, referred as DISCO\u03b2,\u03b3, is a DISCO Nets\nprobabilistic model, with \u03b3 (cid:54)= 0 in the dissimilarity coef\ufb01cient of equation (6). When taking \u03b3 = 0,\nnoise is injected and the model capacity is the same as DISCO\u03b2,\u03b3(cid:54)=0. The model BASE\u03b2, is a\nnon-probabilistic model, by taking \u03b3 = 0 in the objective function of equation (6) and no noise is\nconcatenated. This corresponds to a classic deep network which for a given input x generates a single\noutput y = G(x). Note that we write G(x) and not G(z, x) since no noise is concatenated.\n\nEvaluation Metrics. We report classic non-probabilistic metrics for hand pose estimation employed\nin Oberweger et al. [17, 18] and Taylor et al. [26], that are, the Mean Joint Euclidean Error (MeJEE),\nthe Max Joint Euclidean Error (MaJEE) and the Fraction of Frames within distance (FF). We refer\nthe reader to the supplementary material for detailed expression of these metrics. These metrics use\nthe Euclidean distance between the prediction and the ground-truth and require a single pointwise\nprediction. This pointwise prediction is chosen with the MEU method among K candidates. We\nadded the probabilistic metric ProbLoss. ProbLoss is de\ufb01ned as in Equation 8 with the Euclidean\nnorm and is the divergence associated with a strictly proper scoring rule. Thus, ProbLoss ranks the\nability of the models to represent the true distribution. ProbLoss is computed using K candidate\nposes for a given depth image. For the non-probabilistic model BASE\u03b2, only a single pointwise\npredicted output y is available. We construct the K candidates by adding some Gaussian random\nnoise of mean 0 and diagonal covariance \u03a3 = \u03c31, with \u03c3 \u2208 {1mm, 5mm, 10mm} and refer to the\nmodel as BASE\u03b2,\u03c3. 2\nLoss functions. As we employ standard evaluation metrics based on the Euclidean norm, we train\nwith the Euclidean norm (that is, \u2206training(y, y(cid:48)) = ||y \u2212 y(cid:48)||\u03b2\n2 our\nobjective function coincides with ProbLoss.\n\n2 with \u03b2 = 1). When \u03b3 = 1\n\nArchitecture. The novelty of DISCO Nets resides in their objective function. They do not require\nthe use of a speci\ufb01c network architecture. This allows us to design a simple network architecture\ninspired by Oberweger et al. [18]. The architecture is shown in Figure 2. The input depth image x\nis fed to 2 convolutional layers, each having 8 \ufb01lters, with kernels of size 5 \u00d7 5, with stride 1,\nfollowed by Recti\ufb01ed Linear Units (ReLUs) and Max Pooling layers of kernel size 3 \u00d7 3. A third\nand last convolutional layer has 8 \ufb01lters, with kernels of size 5 \u00d7 5, with stride 1, followed by a\nRecti\ufb01ed Linear Unit. The ouput of the convolution is concatenated to the random noise vector z\nof size dz = 200, drawn from a uniform distribution in [\u22121, 1]. The result of the concatenation\nis fed to 2 dense layers of output size 1024, with ReLUs, and a third dense layer that outputs the\ncandidate pose y \u2208 R3\u00d7J. For the non-probabilistic BASE\u03b2,\u03c3 model no noise is concatenated as\nonly a pointwise estimate is produced.\n\nTraining. We use 10,000 examples from the 72,757 training frames to construct a validation\ndataset and train only on 62,757 examples. Back-propagation is used with Stochastic Gradient\nDescent with a batchsize of 256. The learning rate is \ufb01xed to \u03bb = 0.01 and we use a momentum\nof m = 0.9 (see Polyak [20]). We also add L2-regularisation controlled by the parameter C. We\nuse C = [0.0001, 0.001, 0.01] which is a relevant range as the comparative model BASE\u03b2 is best\nperforming for C = 0.001. Note that DISCO Nets report consistent performances across the different\nvalues C, contrarily to BASE\u03b2. We use 3 different random seeds to initialize each model network\nparameters. We report the performance of each model with its best cross-validated seed and C. We\ntrain all models for 400 epochs as it results in a change of less than 3% in the value of the loss on the\nvalidation dataset for BASE\u03b2. We refer the reader to the supplementary material for details on the\nsetting.\n\n2We also evaluate the non-probabilistic model BASE\u03b2 using its pointwise prediction rather than the MEU\n\nmethod. Results are consistent and detailed in the supplementary material.\n\n6\n\n\fTable 2: Metrics values on the test set \u00b1 SEM. Best\nperformances in bold.\nModel\nProbLoss (mm) MeJEE (mm) MaJEE (mm) FF (80mm)\n103.8\u00b10.627\nBASE\u03b2=1,\u03c3=1\n99.3\u00b10.620\nBASE\u03b2=1,\u03c3=5\n96.3\u00b10.612\nBASE\u03b2=1,\u03c3=10\n92.9\u00b10.533\nDISCO\u03b2=1,\u03b3=0\nDISCO\u03b2=1,\u03b3=0.25 89.9\u00b10.510\n83.8 \u00b10.503\nDISCO\u03b2=1,\u03b3=0.5\n\n25.2\u00b10.152 52.7\u00b10.290\n25.5\u00b10.151 52.9\u00b10.289\n25.7\u00b10.149 53.2\u00b10.288\n21.6\u00b10.128 46.0\u00b10.251\n21.2\u00b10.122 46.4\u00b10.252\n20.9\u00b10.124 45.1\u00b10.246\n\n86.040\n85.773\n85.664\n92.971\n93.262\n94.438\n\nTable 3: Metrics values on the test set \u00b1 SEM for\ncGAN.\nModel\ncGAN\ncGANinit, \ufb01xed 128.9\u00b10.480\n\nProbLoss (mm) MeJEE (mm) MaJEE (mm) FF (80mm)\n442.7\u00b10.513 109.8\u00b10.128 201.4\u00b10.320\n31.8\u00b10.117 64.3\u00b10.230\n\n0.000\n78.454\n\n4.2 Results.\nQuantitative Evaluation. Table 2 reports performances on the test dataset, with parameters cross-\nvalidated on the validation set. All versions of the DISCO Net model outperform the BASE\u03b2 model.\nAmong the different values of \u03b3, we see that \u03b3 = 0.5 better captures the true distribution (lower\nProbLoss) while retaining accurate performance on the standard pointwise metrics. Interestingly,\nusing an all-zero noise at test-time gives similar performances on pointwise metrics. We link this to\nthe observation that both the MEAN and the MEU method perform equivalently on these metrics\n(see supplementary material).\nQualitative Evaluation.\nIn Figure 3 we show candidate poses generated by DISCO\u03b2=1,\u03b3=0.5 for\n3 testing examples. The left image shows the input depth image, and the right image shows the\nground-truth pose (in grey) with 100 candidate outputs (superimposed in transparent red). The model\npredict the joint locations and we interpolate the joints with edges. If an edge is thinner and more\nopaque, it means the different predictions overlap and that the uncertainty on the location of the\nedge\u2019s joints is low. We can see that DISCO\u03b2=1,\u03b3=0.5 captures relevant information on the structure\nof the hand.\n\n(a) When there are no occlusions,\nDISCO Nets model low uncer-\ntainty on all joints.\n\n(b) When the hand is half-\ufb01sted,\nDISCO Nets model the uncer-\ntainty on the location of the \ufb01n-\ngertips.\n\n(c) Here the \ufb01ngertips of all \ufb01n-\ngers but the fore\ufb01nger are oc-\ncluded and DISCO Nets model\nhigh uncertainty on them.\n\nFigure 3: Visualisation of DISCO\u03b2=1,\u03b3=0.5 predictions for 3 examples from the testing dataset. The left image\nshows the input depth image, and the right image shows the ground-truth pose in grey with 100 candidate outputs\nsuperimposed in transparent red. Best viewed in color.\nFigure 4 shows the matrices of Pearson product-moment correlation coef\ufb01cients between joints. We\nnote that DISCO Net with \u03b3 = 0.5 better captures the correlation between the joints of a \ufb01nger and\nbetween the \ufb01ngers.\n\nT\nP\nM\nP\nT\nR\nM\nR\nT\nM\nM\nM\nT\nI\n\nM\n\nI\nT\nT\nM\nT\nR\nT\nL\nP\nR\nP\nP\n\nT\nP\nM\nP\nT\nR\nM\nR\nT\nM\nM\nM\nT\nI\n\nM\n\nI\nT\nT\nM\nT\nR\nT\nL\nP\nR\nP\nP\n\nP PR PL TR TM TT IM IT MM MT RM RT PM PT\n\nP PR PL TR TM TT IM IT MM MT RM RT PM PT\n\n\u03b3 = 0\n\n\u03b3 = 0.5\n\nFigure 4: Pearson coef\ufb01cients matrices of the joints: Palm (no value as the empirical variance is null), Palm\nRight, Palm Left, Thumb Root, Thumb Mid, Index Mid, Index Tip, Middle Mid, Middle Tip, Ring Mid, Ring Tip,\nPinky Mid, Pinky Tip.\n\n7\n\n\f4.3 Comparison with existing probabilistic models.\nTo the best of our knowledge the conditional Generative Adversarial Networks (cGAN) from Mirza\nand Osindero [16] has not been applied to pose estimation. In order to compare cGAN to DISCO Nets,\nseveral issues must be overcome. First, we must design a network architecture for the Discriminator.\nThis is a \ufb01rst disadvantage of cGAN compared to DISCO Nets which require no adversary. Second,\nas mentioned in Goodfellow et al. [7] and Radford et al. [22], GAN (and thus cGAN) require very\ncareful design of the networks\u2019 architecture and training procedure. In order to do a fair comparison,\nwe followed the work in Mirza and Osindero [16] and practical advice for GAN presented in Larsen\nand S\u00f8nderby [13]. We try (i) cGAN, initialising all layers of D and G randomly, and (ii) cGANinit, \ufb01xed\ninitialising the convolutional layers of G and D with the trained best-performing DISCO\u03b2=1,\u03b3=0.5\nof Section 4.2, and keeping these layers \ufb01xed. That is, the convolutional parts of G and D are \ufb01xed\nfeature extractors for the depth image. This is a setting similar to the one employed for tag-annotation\nof images in Mirza and Osindero [16]. Details on the setting can be found in the supplementary\nmaterial. Table 3 shows that the cGAN model obtains relevant results only when the convolutional\nlayers of G and D are initialised with our trained model and kept \ufb01xed, that is cGANinit, \ufb01xed. These\nresults are still worse than DISCO Nets performances. While there may be a better architecture for\ncGAN, our experiments demonstrate the dif\ufb01culty of training cGAN over DISCO Nets.\n4.4 Reference state-of-the-art values.\nWe train the best-performing DISCO\u03b2=1,\u03b3=0.5 of Section 4.2 on the entire dataset, and compare\nperformances with state-of-the-art methods in Table 4 and Figure 5. These state-of-the-art methods\nare speci\ufb01cally designed for hand pose estimation. In Oberweger et al. [17] a constrained prior hand\nmodel, referred as NYU-Prior, is re\ufb01ned on each hand joint position to increase accuracy, referred\nas NYU-Prior-Re\ufb01ned. In Oberweger et al. [18], the input depth image is fed to a \ufb01rst network\nNYU-Init, that outputs a pose used to synthesize an image with a second network. The synthesized\nimage is used with the input depth image to derive a pose update. We refer to the whole model as\nNYU-Feedback. On the contrary, DISCO Nets uses a single network whose architecture is similar\nto NYU-Prior (without constraining on a pose prior). By accurately modeling the distribution of\nthe pose given the depth image, DISCO Nets obtain comparable performances to NYU-Prior and\nNYU-Prior-Re\ufb01ned. Without any extra effort, DISCO Nets could be embedded in the presented\nre\ufb01nement and feedback methods, possibly boosting state-of-the-art performances.\n\nTable 4: DISCO Nets compared to state-\nof-the-art performances \u00b1 SEM.\n\nModel\nNYU-Prior\nNYU-Prior-Re\ufb01ned\nNYU-Init\nNYU-Feedback\nDISCO\u03b2=1,\u03b3=0.5\n\nMeJEE (mm) MaJEE (mm)\n44.8\u00b10.289\n20.7\u00b10.150\n19.7\u00b10.157\n44.7\u00b10.327\n27.4\u00b10.152\n55.4\u00b10.265\n36.1\u00b10.208\n16.0\u00b10.096\n20.7\u00b10.121\n45.1\u00b10.246\n\nFF (80mm)\n\n91.190\n88.148\n86.537\n97.334\n93.250\n\nFigure 5: Fractions of frames within distance d in mm (by 5 mm). Best\nviewed in color.\n\n5 Discussion.\nWe presented DISCO Nets, a new family of probabilistic model based on deep networks. DISCO Nets\nemploy a prediction and training procedure based on the minimisation of a dissimilarity coef\ufb01cient.\nTheoretically, this ensures that DISCO Nets accurately capture uncertainty on the correct output\nto predict given an input. Experimental results on the task of hand pose estimation consistently\nsupport our theoretical hypothesis as DISCO Nets outperform non-probabilistic equivalent models,\nand existing probabilistic models. Furthermore, DISCO Nets can be tailored to the task to perform.\nThis allows a possible user to train them to tackle different problems of interest. As their novelty\nresides mainly in their objective function, DISCO Nets do not require any speci\ufb01c architecture and\ncan be easily applied to new problems. We contemplate several directions for future work. First, we\nwill apply DISCO Nets to other prediction problems where there is uncertainty on the output. Second,\nwe would like to extend DISCO Nets to latent variables models, allowing us to apply DISCO Nets to\ndiverse dataset where ground-truth annotations are missing or incomplete.\n6 Acknowlegements.\nThis work is funded by the Microsoft Research PhD Scholarship Programme. We would like to thank\nPankaj Pansari, Leonard Berrada and Ondra Miksik for their useful discussions and insights.\n\n8\n\n\fReferences.\n[1] E.L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a\n\nLaplacian pyramid of adversarial networks. In NIPS. 2015.\n\n[2] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via\n\nmaximum mean discrepancy optimization. In UAI, 2015.\n\n[3] K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes\u2019 rule: Bayesian inference with positive\n\nde\ufb01nite kernels. JMLR, 2013.\n\n[4] J. Gauthier. Conditional generative adversarial nets for convolutional face generation. Class\n\nProject for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, 2014.\n\n[5] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal\n\nof the American Statistical Association, 2007.\n\n[6] Tilmann Gneiting, Larissa I. Stanberry, Eric P. Grimit, Leonhard Held, and Nicholas A. Johnson.\nAssessing probabilistic forecasts of multivariate quantities, with an application to ensemble\npredictions of surface winds. TEST, 2008.\n\n[7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, Bing Xu, D. Warde-Farley, S. Ozair, A. Courville,\n\nand Y. Bengio. Generative adversarial nets. In NIPS. 2014.\n\n[8] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. J. Smola. A kernel method for\n\nthe two-sample problem. In NIPS, 2007.\n\n[9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. J. Smola. A kernel two-sample\n\n[10] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n[11] M. P. Kumar, B. Packer, and D. Koller. Modeling latent variable uncertainty for loss-based\n\n[12] S. Lacoste-Julien, F. Huszar, and Z. Ghahramani. Approximate inference for the loss-calibrated\n\ntest. In JMLR, 2012.\n\nlearning. In ICML, 2012.\n\nBayesian. In AISTATS, 2011.\n\n[13] A. B. L. Larsen and S. K. S\u00f8nderby. URL http://torch.ch/blog/2015/11/13/gan.\n\nhtml.\n\nWorkshop, 2015.\n\nWorkshop, 2014.\n\nIn ICCV, 2015.\n\n[14] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015.\n[15] A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow. Adversarial autoencoders. ICLR\n\n[16] M. Mirza and S. Osindero. Conditional generative adversarial nets. In NIPS Deep Learning\n\n[17] M. Oberweger, P. Wohlhart, and V. Lepetit. Hands deep in deep learning for hand pose\n\nestimation. In Computer Vision Winter Workshop, 2015.\n\n[18] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a Feedback Loop for Hand Pose Estimation.\n\n[19] Pierre Pinson and Julija Tastu. Discrimination ability of the Energy score. 2013.\n[20] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. 1964.\n[21] V. Premachandran, D. Tarlow, and D. Batra. Empirical minimum Bayes risk prediction: How\nto extract an extra few% performance from vision models with just three more parameters. In\nCVPR, 2014.\n\n[22] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. In ICLR, 2015.\n\n[23] C.R. Rao. Diversity and dissimilarity coef\ufb01cients: A uni\ufb01ed approach. Theoretical Population\n\nBiology, pages Vol. 21, No. 1, pp 24\u201343, 1982.\n\n[24] S. Reed, Z. Akata, X. Yan, L. Logeswaran, H. Lee, and B. Schiele. Generative adversarial text\n\n[25] J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative\n\nto image synthesis. In ICML, 2016.\n\nadversarial networks. ICLR, 2016.\n\n[26] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon. The vitruvian Manifold: Inferring dense\n\ncorrespondences for oneshot human pose estimation. In CVPR, 2012.\n\n[27] J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Real-time continuous pose recovery of human\n\nhands using convolutional networks. ACM Transactions on Graphics, 2014.\n\n[28] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from\n\n[29] E. Zawadzki and S. Lahaie. Nonparametric scoring rules. In AAAI Conference on Arti\ufb01cial\n\nvisual attributes. 2016.\n\nIntelligence. 2015.\n\n9\n\n\f", "award": [], "sourceid": 219, "authors": [{"given_name": "Diane", "family_name": "Bouchacourt", "institution": "University of Oxford"}, {"given_name": "Pawan", "family_name": "Mudigonda", "institution": "University of Oxford"}, {"given_name": "Sebastian", "family_name": "Nowozin", "institution": "Microsoft Research"}]}