{"title": "Learning about an exponential amount of conditional distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 13703, "page_last": 13714, "abstract": "We introduce the Neural Conditioner (NC), a self-supervised machine able to learn about all the conditional distributions of a random vector X. The NC is a function NC(x\u22c5a,a,r) that leverages adversarial training to match each conditional distribution P(Xr|Xa=xa). After training, the NC generalizes to sample from conditional distributions never seen, including the joint distribution. The NC is also able to auto-encode examples, providing data representations useful for downstream classification tasks. In sum, the NC integrates different self-supervised tasks (each being the estimation of a conditional distribution) and levels of supervision (partially observed data) seamlessly into a single learning experience.", "full_text": "Learning about an exponential amount of\n\nconditional distributions\n\nMohamed Ishmael Belghazi1,2\nishmael.belghazi@gmail.com\n\nMaxime Oquab1\n\nqas@fb.com\n\nYann Lecun1\nyann@fb.com\n\nDavid Lopez-Paz1\n\ndlp@fb.com\n\n1Facebook AI Research, Paris, France\n\n2Montr\u00e9al Institute for Learning Algorithms, Montr\u00e9al, Canada\n\nAbstract\n\nWe introduce the Neural Conditioner (NC), a self-supervised machine able to learn\nabout all the conditional distributions of a random vector X. The NC is a function\nNC(x \u00b7 a, a, r) that leverages adversarial training to match each conditional distri-\nbution P (Xr|Xa = xa). After training, the NC generalizes to sample conditional\ndistributions never seen, including the joint distribution. The NC is also able to\nauto-encode examples, providing data representations useful for downstream classi-\n\ufb01cation tasks. In sum, the NC integrates different self-supervised tasks (each being\nthe estimation of a conditional distribution) and levels of supervision (partially\nobserved data) seamlessly into a single learning experience.\n\n1\n\nIntroduction\n\nSupervised learning estimates the conditional distribution of a target variable given values for a feature\nvariable [63]. Supervised learning is the backbone to build state-of-the-art prediction models using\nlarge amounts of labeled data, with unprecedented success in domains spanning image classi\ufb01cation,\nspeech recognition, and language translation [35]. Unfortunately, collecting large amounts of labeled\ndata is an expensive task painstakingly performed by humans (for instance, consider labeling the\nobjects appearing in millions of images). If our ambition to transition from machine learning to\narti\ufb01cial intelligence is to be met, we must build algorithms capable of learning effectively from\ninexpensive unlabeled data without human supervision (for instance, millions of unlabeled images).\nFurthermore, we are interested in the case where the available unlabeled data is partially observed.\nThus, the goal of this paper is unsupervised learning, de\ufb01ned as understanding the underlying process\ngenerating some partially observed unlabeled data.\n\nCurrently, unsupervised learning strategies come in many \ufb02avors, including component analysis,\nclustering, energy modeling, and density estimation [23]. Each of these strategies targets the\nestimation of a particular statistic from high-dimensional data. For example, principal component\nanalysis extracts a set of directions under which the data exhibits maximum variance [28]. However,\npowerful unsupervised learning should not commit to the estimation of a particular statistic from\ndata, but extract general-purpose features useful for downstream tasks.\n\nAn emerging, more general strategy to unsupervised learning is the one of self-supervised learning\n[24, for instance]. The guiding principle behind self-supervised learning is to set up a supervised\nlearning problem based on unlabeled data, such that solving that supervised learning problem leads\nto partial understanding about the data generating process [32]. More speci\ufb01cally, self-supervised\nlearning algorithms transform the unlabeled data into one set of input features and one set of output\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffeatures. Then, a supervised learning model is trained to predict the output features from the input\nfeatures. Finally, the trained model is later leveraged to solve subsequent learning tasks ef\ufb01ciently. As\nsuch, self-supervision turns unsupervised learning into the supervised learning problem of estimating\nthe conditional expectation of the output features given the input features. A common example of a\nself-supervised problem is image in-painting. Here, the central patch of an image (output feature) is\npredicted from its surrounding pixel values (input feature), with the hope that learning to in-paint leads\nto the learning of non-trivial image features [50, 38]. Another example of a self-supervised learning\nproblem extracts a pair of patches from one image as the input feature, and requests their relative\nposition as the target output feature [10]. These examples hint one potential pitfall of \u201cspecialized\u201d\nself-supervised learning algorithms: in order to learn a single conditional distribution from the many\ndescribing the data, it may be acceptable to throw away most of the information about the sought\ngenerative process, which in fact we would like to keep for subsequent learning tasks.\n\nThus, a general-purpose unsupervised learning machine should not commit to the estimation of a\nparticular conditional distribution from data, but attempt to learn as much structure (i.e., interactions\nbetween variables) as possible. This is a daunting task, since joint distributions can be described in\nterms of an exponential amount of conditional distributions. Thus, learning the joint distribution, a\nproblem usually associated to unsupervised learning, can be understood as analogous to an exponential\namount of supervised learning problems. Our challenges do not end here. Being realistic, learning\nagents never observe the entire world. For instance, occlusions and camera movements hide portions\nof the world that we would otherwise observe. Therefore, we are interested in unsupervised learning\nalgorithms able to learn about the structure of unlabeled data from partial observations.\n\nIn this paper, we address the task of unsupervised learning from partial data by introducing the\nNeural Conditioner (NC). In a nutshell, the NC is a function NC(x \u00b7 a, a, r) that leverages adversarial\ntraining to match each conditional distribution P (Xr|Xa = xa). The set of available variables a,\nthe set of requested variables r, and the set of available values x \u00b7 a can be either determined by the\npattern of missing values in data, or randomly by the self-supervised learning process. The set of\navailable variables a and the set of requested variables r are not necessarily complementary, and\nindex an exponential amount of conditional distributions (each associated to a single self-supervised\nlearning problem). After trained, the NC generalizes to sample from conditional distributions never\nseen during training, including the joint distribution. Furthermore, trained NC\u2019s are also able to\nauto-encode examples, providing data representations useful for downstream classi\ufb01cation tasks.\nSince the NC does not commit to a particular conditional distribution but attempts to learn a large\namount of them, we argue that our model is a small step towards general-purpose unsupervised\nlearning. Our contributions are as follows:\n\n\u2022 We introduce the Neural Conditioner (NC) (Section 2), a method to perform unsupervised\n\nlearning from partially observed data.\n\n\u2022 We explain the multiple uses of NCs (Section 3), including the generation of conditional\n\nsamples, unconditional samples, and feature extraction from partially observed data.\n\n\u2022 We provide insights on how NCs work and should be regularized (Section 4).\n\u2022 Throughout a variety of experiments on synthetic and image data, we show the ef\ufb01cacy of\n\nNCs in generation and prediction tasks (Sections 5 and 7).\n\n2 The Neural Conditioner (NC)\n\nConsider the dataset (x1, . . . , xn), where each xi \u2208 Rd is an identically and independently distributed\n(iid) example drawn from some joint probability distribution P (X). Without any further information,\nwe could consider O(3d) different prediction problems about the random vector X, where each\nprediction problem partitions the coordinates xi into features, targets, or unobserved variables. We\nmay index this exponential amount of supervised learning problems using binary vectors of available\nfeatures a \u2208 {0, 1}d and requested features r \u2208 {0, 1}d. In statistical terms, a pair of available and\nrequested vectors (r, a) instantiates the supervised learning problem of estimating the conditional\ndistribution P (Xr|Xa = xa), where xr = (xi : ri = 1), and xa = (xi : ai = 1).\n\nBy making use of the notations above, we can design a single supervised learning problem to estimate\nall the conditional distributions contained in the random vector X. Since learning algorithms are\noften designed to deal with inputs and outputs with a \ufb01xed number of dimensions, we will consider\n\n2\n\n\fx \u00b7 a\n\na\n\nr\n\nz\n\nNC\n\n\u02c6x\n\nfake\n\nloss\n\nreal\n\nx \u00b7 a\n\n\u02c6x \u00b7 r\n\na\n\nr\n\nD\n\nx \u00b7 a\nx \u00b7 r\n\na\n\nr\n\nFigure 1: The proposed NC, where data x \u223c P (X), available/requested masks a, r \u223c P (a, r), and\nnoise z \u223c N (0, I).\n\n#\n\n$\n\n! \" #\n\n! \" $\n\n%! \" $\n\n%!\n\n! \" # + %! \" $\n\n!\n\nFigure 2: Example of masks and masked images. NC learns to predict x \u00b7 r from x \u00b7 a.\n\nthe augmented supervised learning problem of mapping the feature vector (x \u00b7 a, a, r) into the target\nvector x \u00b7 r, where the operation \u201c\u00b7\u201d denotes entry-wise multiplication. In short, our goal is to learn a\nNeural Conditioner (NC) producing samples:\n\n\u02c6x \u223c NC(x \u00b7 a, a, r) : \u02c6xr \u223c P (Xr|Xa = xa) \u2200 (x, a, r).\n\nThe previous equation manifests the ambition of NC to model the entire conditional distribution\nP (Xr|Xa = xa) when given a triplet (x, a, r). Therefore, given the dataset (x1, . . . , xn), learning a\nNC translates into minimizing the distance between the estimated conditional distributions NC(x \u00b7\na, a, r) and the true conditional distributions P (Xr|Xa = xa), based on their samples. In particular,\nwe will follow recent advances in implicit generative modeling, and implement NC training using\ntools from generative adversarial networks [18]. Other alternatives to train NCs would include\nmaximum mean discrepancy metrics [21], energy distances [61], or variational inference [31]. If\nthe practitioner is only interested in recovering a particular statistic from the exponentially many\nconditional distributions (e.g. the conditional means), training a NC with a scoring rule D for such\nstatistic (e.g. the mean squared error loss) would suf\ufb01ce.\n\nTraining a NC is an iterative process involving six steps, illustrated in Figures 1 and 2:\n\n1. A data sample x is drawn from P (X).\n2. Available and requested masks (r, a) are drawn from some data-de\ufb01ned or user-de\ufb01ned dis-\ntribution P (R, A). These masks are not necessarily complementary, enabling the existence\nof unobserved (neither requested or observed) variables. If a coordinate equals to one in\nboth r and a, we zero it at the requested mask.\n\n3. A noise vector z is sampled from an external source of noise with distribution P (Z).\n4. A sample is generated as \u02c6x = NC(x \u00b7 a, a, r, z).\n\n5. A discriminator D provides the \ufb01nal scalar objective function by distinguishing between data\nsamples (scored as D(x \u00b7 r, x \u00b7 a, a, r)) and generated samples (scored as D(\u02c6x \u00b7 r, x \u00b7 a, a, r)).\n\n6. The NC parameters are updated to minimize the objective function, while the parameters of\n\nthe discriminator are updated to maximize it, following adversarial training [18].\n\nMathematically, our general objective function is:\n\nmin\nNC\n\nmax\n\nD\n\nE\n\nx,a,r\n\nlog D(x \u00b7 r, x \u00b7 a, a, r) + E\n\nx,a,r,z\n\nlog(1 \u2212 D(NC(x \u00b7 a, a, r, z) \u00b7 r, x \u00b7 a, a, r)).\n\n(1)\n\n3 Using NCs\n\nOnce trained, one NC serves many purposes. The most direct use is perhaps the multimodal prediction\nof any subset of variables given any subset of variables. More speci\ufb01cally, a NC is able to leverage\n\n3\n\n\fany partially observed vector xa to predict about any partially requested vector xr. Importantly, the\ncombination of test values, available, and requested masks (x, a, r) could be novel and never seen\nduring training. Since NCs leverage an external source of noise z to make their predictions, NCs\nprovide a conditional distribution for each triplet (x, a, r).\n\nTwo special cases of masks deserve special attention. First, properly regularized NCs are able to\ncompress and reconstruct samples when provided with the full requested mask r = 1 and the full\navailable mask a = 1. This turns NCs into autoencoders able to extract feature representations of\ndata, as well as allowing latent interpolations between pairs of examples. Second, when provided\nwith the full requested mask r = 1 and the empty available mask a = 0, NCs are able to generate\nfull samples from the data joint distribution P (X), even in the case when the training never provided\nthe NC with this mask combination, as our experiments verify.\n\nNCs are able to seamlessly deal with missing features and/or labels during both training and testing\ntime. Such \u201cmissingness\u201d of features and labels can be real (as given by incomplete or unlabeled\nexamples) or simulated by designing an appropriate distribution of masks P (A, R). This blurs the\nlines that often separate unsupervised, semi-supervised, and supervised learning, integrating all types\nof data and supervision into a new learning paradigm.\n\nFinally, a trained NC can be used to understand relations between variables, for instance by using a\ncomplete test vector x and querying different available and requested masks. The strongest relations\nbetween variables can also be analyzed in terms of gradients with respect to (a, r).\n\n4 Understanding NCs\n\nTo better understand how NCs work, this section describes i) how NCs look like in the Gaussian case,\nii) what the optimal discriminator minimizes, iii) the relationship between NC training and the usual\nreconstruction error minimized by auto-encoders, and iv) some regularization techniques.\n\n4.1 The Gaussian case\n\nLet us consider the case where the data joint distribution is a Gaussian P (X) = N (\u00b5, \u03a3). Then, the\nclosed-form expression of the conditional distribution implied by any triplet (x, a, r) is P (Xr|Xa =\nxa) = N (\u00b5r|a, \u03a3r|a), where \u00b5r|a = \u00b5r + \u03a3ra\u03a3\u22121\nThe previous expressions highlight an interesting fact: even in the case of Gaussian distributions,\ncomputing the conditional moments implied by (x, a, r) is a non-linear operation. When \ufb01xing\n(a, r) = (a0, r0), learning the conditional distribution implied by triplets (x, a0, r0) can be understood\nas linear heteroencoding [54].\n\naa (xa \u2212 \u00b5a), and \u03a3r|a = \u03a3rr \u2212 \u03a3ra\u03a3\u22121\n\naa \u03a3ar.\n\nby considering the chain rule of the differential entropy [9]: h(X) = Pd\n\nThe motivation behind self-supervised learning is that learning about a conditional distribution is an\neffective way to learn about the joint distribution. In part, this is because learning conditional distribu-\ntions allows to deploy the powerful machinery of supervised learning. To formalize this, we consider\nthe amount of information contained in a probability distribution in terms of its differential entropy.\nThen, we show that learning conditional distributions is easier than learning joint distributions, where\n\u201cdif\ufb01cult\u201d is measured in terms of how much information is to be learned. This argument can be made\ni=1 h(Xi|X1, . . . , Xi\u22121),\nwhere, in the case of partitioning X = (Xa, Xr), we have: h(X) = h(Xr|Xa) + h(Xa). The\nprevious shows that h(Xr|Xa) \u2264 h(Xr), where equality is achieved if and only if Xa and Xr are\nindependent. This reveals a \u201cblessing of structure\u201d of sorts: to reduce the dif\ufb01culty of learning about\na joint distribution, we should construct self-supervised learning problems associated to conditional\ndistributions between highly coupled blocks of input and output features. Indeed, if all of our variables\nare independent, self-supervised learning is hopeless. For the case of a d-dimensional Gaussian\nwith covariance matrix \u03a3, the differential entropy can be stated in terms of the covariance function:\nh(\u03a3) = d\n2 log(|\u03a3|). which allows to choose good self-supervised learning problems\nbased on the log-determinant of empirical covariances.\n\n2 (1+log(2\u03c0))+ 1\n\nA successful evolution from single self-supervised learning problems to NCs rests on the existence of\nrelationships between different conditional distributions. More formally, the success of NCs relies\non assuming a smooth landscape of conditionals. If smoothness across conditional distributions is\nsatis\ufb01ed, learning about some conditional distribution should inform us about other, perhaps never\n\n4\n\n\fseen, conditionals. This is akin to supervised learning algorithms relying on smoothness properties\nof the function to be learned. For NCs we do not consider the smoothness of a single function,\nbut the smoothness of the \u201cconditioning operator\u201d Cx(a, r) = NC(x \u00b7 a, a, r). The smoothness of\nthis conditioning operator is related to the smoothness of the covariance operator studied in kernel\nembeddings of distributions [45].\n\n4.2 Training objective, NC\u2019s point of view\n\nThis section considers the following question: what is the objective function minimized by NC? In\nparticular we are interested in the intriguing fact of how NCs is able to complete and reconstruct\nsamples, when the discriminator is never presented with pairs of real and generated requested\nvariables. First, consider the \u201caugmented\u201d data and model \u03b8 joint distributions\n\nP (Xa, Xr, A, R) = q(Xr|Xa, A, R)p(Xa, A, R),\nP\u03b8(Xa, Xr, A, R) = q\u03b8(Xr|Xa, A, R)p(Xa, A, R).\n\nNext, consider the negative log-likelihood L(xa, a, r) = \u2212 Eq log q\u03b8 and its expectation L =\n\u2212 EP log q\u03b8. Then,\n\nL(Xa, A, R) = \u2212 E\nq\n\nlog(cid:18)q\u03b8 \u00b7\n\nq\n\nq(cid:19) = \u2212 E\n\nq (cid:26)log\n\nq\u03b8\nq\n\n+ log q(cid:27) = Z \u2212q log q \u2212Z q log\n\nq\u03b8\nq\n\n.\n\nIntegrating wrt p(Xa, A, R), we see that NCs minimize:\n\nL = DKL(P k P\u03b8) + H(XR|XA) = DKL(P k P\u03b8) \u2212 I(XA, XR) + H(XR)\n\n= DKL(P k P\u03b8) \u2212 I(XA, XR) + H(XA).\n\nWhere H stands for (conditional) entropy and I for mutual information. Following [18], assuming an\noptimal discriminator and a NC globally minimizing (1) we have that P = P\u03b8, DKL(P k P\u03b8) = 0,\nand thus L = H(XR|XA).\n\nWe summarize the previous results as follows. If a NC is able to match the distributions (P, P\u03b8),\nthere will be a residual reconstruction error of H(XR | XA). Thus, if XA and XR are independent,\nsuch residual reconstruction error reduces to H(XR). This can happen if A = 0, or if XA holds no\ninformation about XR. Moreover the reconstruction error is a decreasing function of the amount of\ninformation that XA holds about XR.\n\n4.3 Regularization\n\nWe found, during our experiments, gradient based regularization on the discriminator to be\ncrucial. Following [53] we augment the discriminator\u2019s loss with the expected gradient with\nrespect to the inputs for both the positive and negative examples; Less succintly, we add\n2 (E[k\u2207D(XA, XR, A, R)k2] + E[k\u2207D(XA, \u02c6XR, A, R)k2]) to the discriminator\u2019s loss.\n1\nFor NC to generalize to unobserved conditional distributions and prevent memorizing the observed\nones, we have found that regularization of the latent space to be essential. In information theoretic\nterms, we would like to control the mutual information between XA and Z := enc(XA, \u01eb). One\ncould use a variational approximation of the conditional entropy [1] or an adversarial approach [3].\nThe former requires an encoder with tractable conditional density (e.g. Gaussian), the latter, while\nallowing general encoders, introduces an additional training loop in the algorithm. We opt for another\napproach by controlling the encoder\u2019s Lipschitz constant using one-sided spectral normalization [42].\n\n5 Experiments on Gaussian data\n\nWe train a single NC to model all the conditionals of a three-dimensional Gaussian distribution.\nGiven that in this example we know that the data generating process is fully determined by the \ufb01rst\ntwo moments, we train two versions of NCs: one that uses moment-matching, and one that uses\nour full adversarial training pipeline. Both strategies train NC given minibatches of triplets (x, a, r)\nobserved from the same Gaussian distribution. This allows us to better understand the impact of\nadversarial training when dealing with NCs. For these experiments, both the discriminator and the\nNC have 2 hidden layers of 64 units each, and ReLU non-linearities. We regularize the latent space\n\n5\n\n\fFigure 3: Illustration of the NC on a three-dimensional Gaussian dataset. We show a) one-dimensional\nconditional estimation, b) two-dimensional conditional estimation, and c,d) the representation of the\nconditional distributions in the hidden space.\n\nTable 1: Average error norms k\u03b8r|a \u2212 \u02c6\u03b8r|ak in the task of estimating the conditional moments\n\u03b8r|a = (\u00b5r|a, \u03a3r|a) of Gaussian data. We show results for Moment-Matching (MM) and the full\nAdversarial Training (AT). VAEAC only supports complementary masks (some results are NA).\n\na\n\nr\n\nNC (MM)\n\nNC (AT)\n\nVAEAC\n\n(1, 0, 0)\n\n(0, 1, 0)\n\n(0, 0, 1)\n\n(1, 0, 1)\n(1, 1, 0)\n(0, 1, 1)\n\n(0, 0, 1)\n(0, 1, 0)\n(0, 1, 1)\n(0, 0, 1)\n(1, 0, 0)\n(1, 0, 1)\n(0, 1, 0)\n(1, 0, 0)\n(1, 1, 0)\n(0, 1, 0)\n(0, 0, 1)\n(1, 0, 0)\n\n.09 \u00b1 .06\n.10 \u00b1 .04\n.67 \u00b1 .05\n.16 \u00b1 .03\n.20 \u00b1 .05\n.28 \u00b1 .07\n.13 \u00b1 .06\n.08 \u00b1 .05\n.29 \u00b1 .03\n.22 \u00b1 .07\n.15 \u00b1 .08\n.27 \u00b1 .09\n\n.10 \u00b1 .05\n.07 \u00b1 .03\n.13 \u00b1 .04\n.08 \u00b1 .05\n.05 \u00b1 .03\n.14 \u00b1 .06\n.11 \u00b1 .07\n.09 \u00b1 .05\n.17 \u00b1 .03\n.11 \u00b1 .07\n.08 \u00b1 .05\n.15 \u00b1 .07\n\nNA\nNA\n.68 \u00b1 .03\nNA\nNA\n.73 \u00b1 .03\nNA\nNA\n.71 \u00b1 .06\n.50 \u00b1 .04\n.43 \u00b1 .03\n.35 \u00b1 .05\n\nNC conditioning\n(a, r)\n\n\u2205\n\ndiscriminator \u2205\nconditioning\n\n(a, r)\n\n0.12\n0.15\n\n0.17\n0.07\n\nof the NC using one-sided spectral normalization [43] We train the networks for 10, 000 updates,\nwith a batch-size of 512, and the Adam optimizer with a learning rate of 10\u22124, \u03b21 = 0.5, and\n\u03b22 = 0.999. The training set contains 104 \ufb01xed samples sampled from a Gaussian with mean (2, 4, 6)\nand covariance ((1, 0.5, 0.25), (0.5, 1, 0), (0.25, 0, 1)).\n\nFigure 3 illustrates the capabilities of NC to perform one-dimensional and two-dimensional con-\nditional distribution estimation. We also show the embeddings of the conditional distributions as\ngiven by the bottleneck of NC. These show a higher dependence for variables that are more tightly\ncoupled. Table 1 shows the error on the conditional parameter estimation for the NC (both using\nmoment matching and adversarial training) as well as the VAEAC [26], a VAE-based analog to the\nNC. Finally, Table 1 (right) shows the importance of conditioning both the discriminator and NC on\nboth available and requested masks.\n\n6 Missing data Imputation\n\nIn order to quantitatively evaluate NC\u2019s ability to construct representations of the joint distributions\nfrom data with missing observations, we consider data imputation tasks on three UCI datasets [37].\nWe compare to GAIN [66] and use the same empirical setup for the sake of consistency. Note that\nwhile [66] augments the adversarial loss with an euclidean reconstruction error NC does not. Table 2\nshows the normalized root mean squared error of the imputed missing data on the test set.\n\n7 Experiments on image data\n\nWe train NCs on SVHN and CelebA. We use rectangular a, r masks spanning between 10% and 50%\nof the images. We evaluate our setup in several ways. First qualitatively: generating full samples\n(using the never seen mask con\ufb01guration a = 0, r = 1, Fig 4) and reconstructing samples (Figures 5\nfor denoising and 6 for inpainting). These experiments share the goal of showing that our model is\n\n6\n\na)pdfr=(1,0,0),a=(0,1,0)realgeneratedb)pdfr=(1,1,0),a=(0,0,1)realfakec)embeddingsr=(0,1,0)2|12|32|1,3d)embeddingsr=(1,0,0)1|2,31|21|3\fTable 2: RMSE of missing data Imputations on the test set. Experiments were repeated \ufb01ve times.\n\nAlgorithm\n\nMICE[55]\n\nMissForest[60]\n\nMatrix[40]\n\nAuto-encoder[17]\n\nEM[14]\n\nGAIN w/o l2[66]\n\nGAIN[66]\nVAEAC[26]\n\nSpam\n\nLetter\n\nCredit\n\n.0699 \u00b1 .0010\n.0553 \u00b1 .0013\n.0542 \u00b1 .0006\n.0670 \u00b1 .0030\n.0712 \u00b1 .0012\n\n.0672 \u00b1 .0036\n.0513 \u00b1 .0016\n.0552 \u00b1 .0020\n\n.1537 \u00b1 .0006\n.1605 \u00b1 .0004\n.1442 \u00b1 .0006\n.1351 \u00b1 .0009\n.1563 \u00b1 .0012\n\n.1586 \u00b1 .0024\n.1198 \u00b1 .005\n.1115 \u00b1 .0010\n\n.2585 \u00b1 .0011\n.1976 \u00b1 .0015\n.2602 \u00b1 .0073\n.2388 \u00b1 .0005\n.2604 \u00b1 .0015\n\n.2533 \u00b1 .048\n.1858 \u00b1 .0010\n.1523 \u00b1 .0020\n\nNC\n\n.0486 \u00b1 .0010\n\n.0851 \u00b1 .0020\n\n.1276 \u00b1 .0020\n\nable to generalize to conditional distributions not observed during training. Second, we evaluate our\nmodels quantitatively: that is, their ability to provide useful features for downstream classi\ufb01cation\ntasks (see Table 3). Our results show that NC-based \ufb01gures systematically outperform state-of-art\nhand-crafted features, while being competitive with deep unsupervised features.\n\nFigure 4: SVHN and CelebA samples. The model never observed a complete sample in training.\n\nFigures 5 and 6 show samples and in-paintings using masks con\ufb01gurations unobserved during\ntraining to illustrate that our model is able to generalize to conditional distributions and construct\nrepresentation of the data solely through partial observation. Figure 4 shows samples from the joint\ndistribution (a = 0, r = 1), even though these masks were never observed during training.\n\n7.0.1 Feature extraction\n\nSVHN As a feature extraction procedure, we retrieve the latent code created by the PAE while\nfeeding an image in compress and reconstruct mode (a = r = 1). Then, we use a linear SVM to\nassess the quality of the extracted encoding, and show in Table 3 that our approach is competitive\nwith deep unsupervised feature extractors.\n\nCelebA The multimodality presented by the CelebA attributes provides an ideal test mode to\nquantify our model ability to construct a global understanding out of local and partial observations.\nFollowing [6, 39], we train 40 linear SVMs on learned representations extracted from the encoder\nusing full available and requested masks (a = r = 1) on the CelebA validation set. We measure the\nperformance on the test set. As in [6, 25, 29], we report the balanced accuracy in order to evaluate\nthe attribute prediction performance. Please note that our model was trained trained on entirely\nunsupervised data and masking con\ufb01gurations unobserved during training. Attribute labels were only\nused to train the linear SVM classi\ufb01ers.\n\n7\n\n\fTable 3: Test errors on SVHN (left), and test accuracies on CelebA (right).\n\nModel\n\nKNN\nTSVM\nVAE (M1 + M2) [30]\nDCGAN + L2-SVM [51]\nALI + L2-SVM [13]\nVAEAC [26]\n\nTest error\n\n77.93\n66.55\n36.02\n22.18\n19.14 \u00b1 0.50\n57.89 \u00b1 1.0\n\nNC (L2-SVM) (ours)\n\n17.12 \u00b1 0.59\n\n8 Related work\n\nModel\n\nMean\n\nStdv\n\nTriplet-kNN [57]\nPANDA [67]\nAnet [39]\nLMLE-kNN [25]\nVAE [31]\nALI [13]\nHALI [4]\n\nVAEAC [27]\n\nNC (Ours)\n\n71.55\n76.95\n79.56\n83.83\n73.30\n73.88\n83.75\n\n66.06\n\n82.21\n\n12.61\n13.33\n12.17\n12.33\n9.65\n10.16\n8.96\n\n6.98\n\n7.63\n\nSelf-supervised learning is an emerging technique for unsupervised learning. Perhaps the earliest\nexample of self-supervised learning is auto-encoding [2, 24], which in the language of NCs amounts\nto full available and requested masks. Auto-encoders evolved into more sophisticated variants\nsuch as denoising auto-encoders [64], a family of models including NC. Recent trends in generative\nadversarial networks [18] are yet another example of self-supervised training. The connection between\nauto-encoders and generative adversarial training was \ufb01rst instantiated by [34]. Auto-regressive\nmodels [5] such as the masked autoencoder [15], neural autoregressive distribution estimators\n[33, 62], and Pixel RNNs [47] are other examples of casting unsupervised learning using a simple\nself-supervision strategy: order the variables, and then predict each of them using the previous.\n\nMoving further, the task of unsupervised learning with partially observed data was also considered\nby others, often in terms of estimating transition operators [20, 7, 58]. Generative adversarial\nimputation nets [66] considered the case of learning missing feature predictions using adversarial\ntraining. In a different thread of research, the literature in kernel mean embeddings [59, 36, 45]\nis an early consideration of the problem of learning distributions. Concerning applications, self-\nsupervised learning was pioneered by word embeddings [41]. In the image domain, self-supervised\nsetups include image in-painting [50], colorization [68], clustering [8], de-rotation [16], and patch\nreordering [10, 46]. In the video domain, common self-supervised strategies include enforcing similar\nfeature representations for nearby frames [44, 19, 65], or predicting ambient sound statistics from\nvideo frames [48]. These applications yield representations useful for downstream tasks, including\nclassi\ufb01cation [8], multi-task learning [11], and RL [49]. Finally, the most similar piece of literature\nto our research is the concurrent work on VAE with Arbitrary Conditioning, or VAEAC [26]. The\nVAEAC is proposed as a fast alternative to the also related universal marginalizer [12]. Similarly to\nour setup, the VAEAC augments a VAE with a mask of requested variables; the complimentary set\nof variables is provided as the available information for prediction. Our work extends VAEAC by\nemploying adversarial training to obtain better sample quality and features for downstream tasks.\nTo sustain these claims, a comparison between NC and VAEAC was performed in Section 7. As\ncommonly assumed in VAE-like architectures, the conditional encoding and decoding distributions\nare assumed Gaussian, which may not be a good \ufb01t for complex multimodal data such as natural\nimages. The VAEAC work was mainly applied to the problem of feature imputation. Here we\nhope to provide a more holistic perspective on the uses of NCs, including feature extraction and\nsemi-supervised learning.\n\n9 Conclusion\n\nWe presented the Neural Conditioner (NC), an adversarially-learned neural network able to learn\nabout the exponentially many conditional distributions describing some partially observed unlabeled\ndata. Once trained, one NC serves many purposes: sampling from (unseen) conditional distributions\nto perform multimodal prediction, sampling from the (unseen) joint distribution, and auto-encode\n(partially observed) data to extract data representations useful for (semi-supervised) downstream\ntasks. Neural Conditioner blurs the lines that often separate unsupervised, semi-supervised, and\nsupervised learning, integrating all types of data and supervision into a holistic learning paradigm.\n\n8\n\n\fReferences\n\n[1] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational\n\ninformation bottleneck. arXiv preprint arXiv:1612.00410, 2016.\n\n[2] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning\n\nfrom examples without local minima. Neural networks, 2(1):53\u201358, 1989.\n\n[3] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio,\nAaron Courville, and Devon Hjelm. Mutual information neural estimation. In Jennifer Dy\nand Andreas Krause, editors, Proceedings of the 35th International Conference on Machine\nLearning, volume 80 of Proceedings of Machine Learning Research, pages 531\u2013540, Stock-\nholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[4] Mohamed Ishmael Belghazi, Sai Rajeswar, Olivier Mastropietro, Negar Rostamzadeh, Jovana\nMitrovic, and Aaron Courville. Hierarchical adversarially learned inference. International\nConference on Machine Learning Workshop on Theoretical Foundations and Applications of\nDeep Generative Models, 2018.\n\n[5] Samy Bengio and Yoshua Bengio. Taking on the curse of dimensionality in joint distributions\n\nusing neural networks. IEEE Transactions on Neural Networks, 11(3):550\u2013557, 2000.\n\n[6] Thomas Berg and Peter N Belhumeur. Poof: Part-based one-vs.-one features for \ufb01ne-grained\n\ncategorization, face veri\ufb01cation, and attribute estimation. In CVPR, pages 955\u2013962, 2013.\n\n[7] Florian Bordes, Sina Honari, and Pascal Vincent. Learning to generate samples from noise\n\nthrough infusion training. arXiv, 2017.\n\n[8] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for\n\nunsupervised learning of visual features. ECCV, 2018.\n\n[9] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[10] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning\n\nby context prediction. CVPR, 2015.\n\n[11] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. ICCV, 2017.\n\n[12] Laura Douglas, Iliyan Zarov, Konstantinos Gourgoulias, Chris Lucas, Chris Hart, Adam Baker,\nManeesh Sahani, Yura Perov, and Saurabh Johri. A universal marginalizer for amortized\ninference in generative models. arXiv, 2017.\n\n[13] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin\n\nArjovsky, and Aaron Courville. Adversarially learned inference. arXiv, 2016.\n\n[14] Pedro J Garc\u00eda-Laencina, Jos\u00e9-Luis Sancho-G\u00f3mez, and An\u00edbal R Figueiras-Vidal. Pattern\nclassi\ufb01cation with missing data: a review. Neural Computing and Applications, 19(2):263\u2013282,\n2010.\n\n[15] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder\n\nfor distribution estimation. In ICML, pages 881\u2013889, 2015.\n\n[16] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by\n\npredicting image rotations. ICLR, 2018.\n\n[17] Lovedeep Gondara and Ke Wang. Multiple imputation using deep denoising autoencoders.\n\narXiv preprint arXiv:1705.02737, 2017.\n\n[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[19] Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. Unsupervised\nIn Proceedings of the IEEE international\n\nlearning of spatiotemporally coherent metrics.\nconference on computer vision, pages 4086\u20134093, 2015.\n\n9\n\n\f[20] Anirudh Goyal Alias Parth Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio.\nVariational walkback: Learning a transition operator as a stochastic recurrent net. In NeurIPS,\npages 4392\u20134402, 2017.\n\n[21] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\n\nSmola. A kernel two-sample test. JMLR, 13(1):723\u2013773, 2012.\n\n[22] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\n\nImproved training of wasserstein GANs. In NeurIPS, pages 5767\u20135777, 2017.\n\n[23] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised learning. In The elements\n\nof statistical learning, pages 485\u2013585. Springer, 2009.\n\n[24] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with\n\nneural networks. science, 313(5786):504\u2013507, 2006.\n\n[25] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for\n\nimbalanced classi\ufb01cation. In CVPR, pages 5375\u20135384, 2016.\n\n[26] Oleg Ivanov, Michael Figurnov, and Vetrov Dmitry. Variational autoencoder with arbitrary\n\nconditioning. ICLR, 2019.\n\n[27] Oleg Ivanov, Michael Figurnov, and Dmitry Vetrov. Variational autoencoder with arbitrary\n\nconditioning. In ICLR, 2019.\n\n[28] Ian Jolliffe. Principal component analysis. In International encyclopedia of statistical science,\n\npages 1094\u20131096. Springer, 2011.\n\n[29] Mahdi M Kalayeh, Boqing Gong, and Mubarak Shah. Improving facial attribute prediction\n\nusing semantic segmentation. arXiv, 2017.\n\n[30] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\n\nsupervised learning with deep generative models. In NeurIPS, pages 3581\u20133589, 2014.\n\n[31] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv, 2013.\n\n[32] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual\n\nrepresentation learning. arXiv, 2019.\n\n[33] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceed-\nings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n29\u201337, 2011.\n\n[34] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther.\n\nAutoencoding beyond pixels using a learned similarity metric. arXiv, 2015.\n\n[35] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[36] Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, Massimiliano Pontil, and Steffen\nGr\u00fcnew\u00e4lder. Conditional mean embeddings as regressors. In International Conference on\nMachine Learing (ICML), volume 5, 2012.\n\n[37] Moshe Lichman et al. Uci machine learning repository, 2013.\n\n[38] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro.\n\nImage inpainting for irregular holes using partial convolutions. ECCV, 2018.\n\n[39] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in\nthe wild. In Proceedings of the IEEE International Conference on Computer Vision, pages\n3730\u20133738, 2015.\n\n[40] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for\nlearning large incomplete matrices. Journal of machine learning research, 11(Aug):2287\u20132322,\n2010.\n\n10\n\n\f[41] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[42] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\n\nfor generative adversarial networks. In ICLR, 2018.\n\n[43] Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. arXiv, 2018.\n\n[44] Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence\n\nin video. In ICML, pages 737\u2013744. ACM, 2009.\n\n[45] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch\u00f6lkopf, et al. Kernel\nmean embedding of distributions: A review and beyond. Foundations and Trends R(cid:13) in Machine\nLearning, 10(1-2):1\u2013141, 2017.\n\n[46] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving\n\njigsaw puzzles. In ECCV, pages 69\u201384. Springer, 2016.\n\n[47] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural\n\nnetworks. arXiv, 2016.\n\n[48] Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba.\nAmbient sound provides supervision for visual learning. In ECCV, pages 801\u2013816. Springer,\n2016.\n\n[49] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration\n\nby self-supervised prediction. ICML, 2017.\n\n[50] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context\n\nencoders: Feature learning by inpainting. pages 2536\u20132544, 2016.\n\n[51] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. arXiv, 2015.\n\n[52] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for\nbiomedical image segmentation. In International Conference on Medical image computing and\ncomputer-assisted intervention, pages 234\u2013241. Springer, 2015.\n\n[53] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training\nof generative adversarial networks through regularization. In Advances in Neural Information\nProcessing Systems, pages 2018\u20132028, 2017.\n\n[54] Sam Roweis and Carlos Brody. Linear heteroencoders, 1999.\n\n[55] Patrick Royston, Ian R White, et al. Multiple imputation by chained equations (mice): imple-\n\nmentation in stata. J Stat Softw, 45(4):1\u201320, 2011.\n\n[56] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.\n\n[57] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for\n\nface recognition and clustering. 2015.\n\n[58] Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep\n\nunsupervised learning using nonequilibrium thermodynamics. arXiv, 2015.\n\n[59] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of\nconditional distributions with applications to dynamical systems. In ICML, pages 961\u2013968.\nACM, 2009.\n\n[60] Daniel J Stekhoven and Peter B\u00fchlmann. Missforest\u2014non-parametric missing value imputation\n\nfor mixed-type data. Bioinformatics, 28(1):112\u2013118, 2011.\n\n11\n\n\f[61] G\u00e1bor J Sz\u00e9kely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence by\n\ncorrelation of distances. The Annals of Statistics, 35(6):2769\u20132794, 2007.\n\n[62] Benigno Uria, Iain Murray, and Hugo Larochelle. A deep and tractable density estimator. In\n\nICML, pages 467\u2013475, 2014.\n\n[63] Vladimir Vapnik. Statistical learning theory. Wiley, New York, 1998.\n\n[64] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol.\nStacked denoising autoencoders: Learning useful representations in a deep network with a local\ndenoising criterion. JMLR, 11(Dec):3371\u20133408, 2010.\n\n[65] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using\n\nvideos. In ICCV, pages 2794\u20132802, 2015.\n\n[66] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Gain: Missing data imputation using\n\ngenerative adversarial nets. arXiv, 2018.\n\n[67] Ning Zhang, Manohar Paluri, Marc\u2019Aurelio Ranzato, Trevor Darrell, and Lubomir Bourdev.\nPanda: Pose aligned networks for deep attribute modeling. In CVPR, pages 1637\u20131644, 2014.\n\n[68] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, pages\n\n649\u2013666. Springer, 2016.\n\n12\n\n\f", "award": [], "sourceid": 7615, "authors": [{"given_name": "Mohamed", "family_name": "Belghazi", "institution": "Facebook AI Research"}, {"given_name": "Maxime", "family_name": "Oquab", "institution": "Facebook AI Research"}, {"given_name": "David", "family_name": "Lopez-Paz", "institution": "Facebook AI Research"}]}