{"title": "Iterative Neural Autoregressive Distribution Estimator NADE-k", "book": "Advances in Neural Information Processing Systems", "page_first": 325, "page_last": 333, "abstract": "Training of the neural autoregressive density estimator (NADE) can be viewed as doing one step of probabilistic inference on missing values in data. We propose a new model that extends this inference scheme to multiple steps, arguing that it is easier to learn to improve a reconstruction in $k$ steps rather than to learn to reconstruct in a single inference step. The proposed model is an unsupervised building block for deep learning that combines the desirable properties of NADE and multi-predictive training: (1) Its test likelihood can be computed analytically, (2) it is easy to generate independent samples from it, and (3) it uses an inference engine that is a superset of variational inference for Boltzmann machines. The proposed NADE-k is competitive with the state-of-the-art in density estimation on the two datasets tested.", "full_text": "Iterative Neural Autoregressive Distribution\n\nEstimator (NADE-k)\n\nTapani Raiko\nAalto University\n\nLi Yao\n\nUniversit\u00b4e de Montr\u00b4eal\n\nKyungHyun Cho\n\nUniversit\u00b4e de Montr\u00b4eal\n\nYoshua Bengio\n\nUniversit\u00b4e de Montr\u00b4eal,\nCIFAR Senior Fellow\n\nAbstract\n\nTraining of the neural autoregressive density estimator (NADE) can be viewed as\ndoing one step of probabilistic inference on missing values in data. We propose\na new model that extends this inference scheme to multiple steps, arguing that\nit is easier to learn to improve a reconstruction in k steps rather than to learn to\nreconstruct in a single inference step. The proposed model is an unsupervised\nbuilding block for deep learning that combines the desirable properties of NADE\nand multi-prediction training: (1) Its test likelihood can be computed analytically,\n(2) it is easy to generate independent samples from it, and (3) it uses an inference\nengine that is a superset of variational inference for Boltzmann machines. The\nproposed NADE-k is competitive with the state-of-the-art in density estimation\non the two datasets tested.\n\n1\n\nIntroduction\n\nTraditional building blocks for deep learning have some unsatisfactory properties. Boltzmann ma-\nchines are, for instance, dif\ufb01cult to train due to the intractability of computing the statistics of the\nmodel distribution, which leads to the potentially high-variance MCMC estimators during training\n(if there are many well-separated modes (Bengio et al., 2013)) and the computationally intractable\nobjective function. Autoencoders have a simpler objective function (e.g., denoising reconstruction\nerror (Vincent et al., 2010)), which can be used for model selection but not for the important choice\nof the corruption function. On the other hand, this paper follows up on the Neural Autoregressive\nDistribution Estimator (NADE, Larochelle and Murray, 2011), which specializes previous neural\nauto-regressive density estimators (Bengio and Bengio, 2000) and was recently extended (Uria et al.,\n2014) to deeper architectures. It is appealing because both the training criterion (just log-likelihood)\nand its gradient can be computed tractably and used for model selection, and the model can be\ntrained by stochastic gradient descent with backpropagation. However, it has been observed that the\nperformance of NADE has still room for improvement.\nThe idea of using missing value imputation as a training criterion has appeared in three recent pa-\npers. This approach can be seen either as training an energy-based model to impute missing values\nwell (Brakel et al., 2013), as training a generative probabilistic model to maximize a generalized\npseudo-log-likelihood (Goodfellow et al., 2013), or as training a denoising autoencoder with a mask-\ning corruption function (Uria et al., 2014). Recent work on generative stochastic networks (GSNs),\nwhich include denoising auto-encoders as special cases, justi\ufb01es dependency networks (Hecker-\nman et al., 2000) as well as generalized pseudo-log-likelihood (Goodfellow et al., 2013), but have\nthe disadvantage that sampling from the trained \u201cstochastic \ufb01ll-in\u201d model requires a Markov chain\n(repeatedly resampling some subset of the values given the others).\nIn all these cases, learning\nprogresses by back-propagating the imputation (reconstruction) error through inference steps of the\nmodel. This allows the model to better cope with a potentially imperfect inference algorithm. This\nlearning-to-cope was introduced recently in 2011 by Stoyanov et al. (2011) and Domke (2011).\n\n1\n\n\fFigure 1: The choice of a structure for NADE-k is very \ufb02exible. The dark \ufb01lled halves indicate that\na part of the input is observed and \ufb01xed to the observed values during the iterations. Left: Basic\nstructure corresponding to Equations (6\u20137) with n = 2 and k = 2. Middle: Depth added as in\nNADE by Uria et al. (2014) with n = 3 and k = 2. Right: Depth added as in Multi-Prediction Deep\nBoltzmann Machine by Goodfellow et al. (2013) with n = 2 and k = 3. The \ufb01rst two structures are\nused in the experiments.\n\nThe NADE model involves an ordering over the components of the data vector. The core of the\nmodel is the reconstruction of the next component given all the previous ones. In this paper we\nreinterpret the reconstruction procedure as a single iteration in a variational inference algorithm,\nand we propose a version where we use k iterations instead, inspired by (Goodfellow et al., 2013;\nBrakel et al., 2013). We evaluate the proposed model on two datasets and show that it outperforms\nthe original NADE (Larochelle and Murray, 2011) as well as NADE trained with the order-agnostic\ntraining algorithm (Uria et al., 2014).\n\n2 Proposed Method: NADE-k\n\nWe propose a probabilistic model called NADE-k for D-dimensional binary data vectors x. We start\nby de\ufb01ning p\u03b8 for imputing missing values using a fully factorial conditional distribution:\n\np\u03b8(xmis | xobs) = (cid:89)\n\ni\u2208mis\n\np\u03b8(xi | xobs),\n\n(1)\n\nwhere the subscripts mis and obs denote missing and observed components of x. From the con-\nditional distribution p\u03b8 we compute the joint probability distribution over x given an ordering o (a\npermutation of the integers from 1 to D) by\np\u03b8(x | o) =\n\nD(cid:89)\n\n| xov<1>v<2>h<1>h<1>WVWVv<0>v<1>h<1>h<1>[1][2]UWVh<2>h<2>[1][2]UWVv<2>v<0>v<1>v<2>h<1>WWWWv<3>WWTTTVVVVTTh<2>h<3>h<1>h<2>[1][1][1][2][2]\fFigure 2: The inner working mechanism of NADE-k. The left most column shows the data vectors x,\nthe second column shows their masked version and the subsequent columns show the reconstructions\nv(cid:104)0(cid:105) . . . v(cid:104)10(cid:105) (See Eq. (7)).\n\nequations for a simple structure with n = 2. See Fig. 1 (left) for the illustration of this simple\nstructure.\nIn this case, the activations of the layers at the t-th step are\n\nh(cid:104)t(cid:105) = \u03c6(Wv(cid:104)t\u22121(cid:105) + c)\nv(cid:104)t(cid:105) = m (cid:12) \u03c3(Vh(cid:104)t(cid:105) + b) + (1 \u2212 m) (cid:12) x\n\n(6)\n(7)\n\nwhere \u03c6 is an element-wise nonlinearity, \u03c3 is a logistic sigmoid function, and the iteration index t\nruns from 1 to k. The conditional probabilities of the variables (see Eq. (1)) are read from the output\nv(cid:104)k(cid:105) as\n\np\u03b8(xi = 1 | xobs) = v\n\n(cid:104)k(cid:105)\ni\n\n.\n\n(8)\n\nFig. 2 shows examples of how v(cid:104)t(cid:105) evolves over iterations, with the trained model.\nThe parameters \u03b8 = {W, V, c, b} can be learned by stochastic gradient descent to minimize \u2212L(\u03b8)\nin Eq. (3), or its stochastic approximation \u2212 \u02c6L(\u03b8) in Eq. (4), with the stochastic gradient computed\nby back-propagation.\nOnce the parameters \u03b8 are learned, we can de\ufb01ne a mixture model by using a uniform probability\nover a set of orderings O. We can compute the probability of a given vector x as a mixture model\n\npmixt(x | \u03b8, O) =\n\n1\n|O|\n\np\u03b8(x | o)\n\n(9)\n\n(cid:88)\n\no\u2208O\n\u223c p\u03b8(xod\n\nk(cid:88)\n\nt=1\n\nlog (cid:89)\n\ni\u2208o\u2265d\n\nwith Eq. (2). We can draw independent samples from the mixture by \ufb01rst drawing an ordering o and\n| xo