{"title": "Gaussian Process Prior Variational Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 10369, "page_last": 10380, "abstract": "Variational autoencoders (VAE) are a powerful and widely-used class of models to learn complex data distributions in an unsupervised fashion. One important limitation of VAEs is the prior assumption that latent sample representations are independent and identically distributed. However, for many important datasets, such as time-series of images, this assumption is too strong: accounting for covariances between samples, such as those in time, can yield to a more appropriate model specification and improve performance in downstream tasks. In this work, we introduce a new model, the Gaussian Process (GP) Prior Variational Autoencoder (GPPVAE), to specifically address this issue. The GPPVAE aims to combine the power of VAEs with the ability to model correlations afforded by GP priors. To achieve efficient inference in this new class of models, we leverage structure in the covariance matrix, and introduce a new stochastic backpropagation strategy that allows for computing stochastic gradients in a distributed and low-memory fashion. We show that our method outperforms conditional VAEs (CVAEs) and an adaptation of standard VAEs in two image data applications.", "full_text": "Gaussian Process Prior Variational Autoencoders\n\nFrancesco Paolo Casale\u2020\u2217, Adrian V Dalca\u2021\u00a7, Luca Saglietti\u2020\u00b6,\n\nJennifer Listgarten(cid:93), Nicolo Fusi\u2020\n\n\u2020 Microsoft Research New England, Cambridge (MA), USA\n\n\u2021 Computer Science and Arti\ufb01cial Intelligence Lab, MIT, Cambridge (MA), USA\n\u00a7 Martinos Center for Biomedical Imaging, MGH, HMS, Boston (MA), USA;\n\n\u00b6 Italian Institute for Genomic Medicine, Torino, Italy\n\n(cid:93) EECS Department, University of California, Berkeley (CA), USA.\n\n\u2217 frcasale@microsoft.com\n\nAbstract\n\nVariational autoencoders (VAE) are a powerful and widely-used class of models\nto learn complex data distributions in an unsupervised fashion. One important\nlimitation of VAEs is the prior assumption that latent sample representations are in-\ndependent and identically distributed. However, for many important datasets, such\nas time-series of images, this assumption is too strong: accounting for covariances\nbetween samples, such as those in time, can yield to a more appropriate model\nspeci\ufb01cation and improve performance in downstream tasks. In this work, we\nintroduce a new model, the Gaussian Process (GP) Prior Variational Autoencoder\n(GPPVAE), to speci\ufb01cally address this issue. The GPPVAE aims to combine the\npower of VAEs with the ability to model correlations afforded by GP priors. To\nachieve ef\ufb01cient inference in this new class of models, we leverage structure in\nthe covariance matrix, and introduce a new stochastic backpropagation strategy\nthat allows for computing stochastic gradients in a distributed and low-memory\nfashion. We show that our method outperforms conditional VAEs (CVAEs) and an\nadaptation of standard VAEs in two image data applications.\n\n1\n\nIntroduction\n\nDimensionality reduction is a fundamental approach to compression of complex, large-scale data\nsets, either for visualization or for pre-processing before application of supervised approaches.\nHistorically, dimensionality reduction has been framed in one of two modeling camps: the simple and\nrich capacity language of neural networks; or the probabilistic formalism of generative models, which\nenables Bayesian capacity control and provides uncertainty over latent encodings. Recently, these\ntwo formulations have been combined through the Variational Autoencoder (VAE) (Kingma and\nWelling, 2013), wherein the expressiveness of neural networks was used to model both the mean and\nthe variance of a simple likelihood. In these models, latent encodings are assumed to be identically\nand independently distributed (iid ) across both latent dimensions and samples. Despite this simple\nprior, the model lacks conjugacy, exact inference is intractable and variational inference is used. In\nfact, the main contribution of the Kingma et al. paper is to introduce an improved, general approach\nfor variational inference (also developed in Rezende et al. (2014)).\nOne important limitation of the VAE model is the prior assumption that latent representations of\nsamples are iid , whereas in many important problems, accounting for sample structure is crucial\nfor correct model speci\ufb01cation and consequently, for optimal results. For example, in autonomous\ndriving, or medical imaging (Dalca et al., 2015; Lonsdale et al., 2013), high dimensional images are\ncorrelated in time\u2014an iid prior for these would not be sensible because, a priori, two images that\nwere taken closer in time should have more similar latent representations than images taken further\napart. More generally, one can have multiple sequences of images from different cars, or medical\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fimage sequences from multiple patients. Therefore, the VAE prior should be able to capture multiple\nlevels of correlations at once, including time, object identities, etc. A natural solution to this problem\nis to replace the VAE iid prior over the latent space with a Gaussian Process (GP) prior (Rasmussen,\n2004), which enables the speci\ufb01cation of sample correlations through a kernel function (Durrande\net al., 2011; G\u00f6nen and Alpayd\u0131n, 2011; Wilson and Adams, 2013; Wilson et al., 2016; Rakitsch\net al., 2013; Bonilla et al., 2007). GPs are often amenable to exact inference, and a large body of work\nin making computationally challenging GP-based models tractable can be leveraged (GPs naively\nscale cubically in the number of samples) (Gal et al., 2014; Bauer et al., 2016; Hensman et al., 2013;\nCsat\u00f3 and Opper, 2002; Qui\u00f1onero-Candela and Rasmussen, 2005; Titsias, 2009).\nIn this work, we introduce the Gaussian Process Prior Variational Autoencoder (GPPVAE), an\nextension of the VAE latent variable model where correlation between samples is modeled through a\nGP prior on the latent encodings. The introduction of the GP prior, however, introduces two main\ncomputational challenges. First, naive computations with the GP prior have cubic complexity in\nthe number of samples, which is impractical in most applications. To mitigate this problem one can\nleverage several tactics commonly used in the GP literature, including the use of pseudo-inputs (Csat\u00f3\nand Opper, 2002; Gal et al., 2014; Hensman et al., 2013; Qui\u00f1onero-Candela and Rasmussen, 2005;\nTitsias, 2009), Kronecker-factorized covariances (Casale et al., 2017; Stegle et al., 2011; Rakitsch\net al., 2013), and low rank structures (Casale et al., 2015; Lawrence, 2005). Speci\ufb01cally, in the\ninstantiations of GPPVAE considered in this paper, we focus on low-rank factorizations of the\ncovariance matrix. A second challenge is that the iid assumption which guarantees unbiasedness\nof mini-batch gradient estimates (used to train standard VAEs) no longer holds due to the GP\nprior. Thus mini-batch GD is no longer applicable. However, for the applications we are interested\nin, comprising sequences of large-scale images, it is critical from a practical standpoint to avoid\nprocessing all samples simultaneously; we require a procedure that is both low in memory use and\nyields fast inference. Thus, we propose a new scheme for gradient descent that enables Monte Carlo\ngradient estimates in a distributable and memory-ef\ufb01cient fashion. This is achieved by exploiting\nthe fact that sample correlations are only modeled in the latent (low-dimensional) space, whereas\nhigh-dimensional representations are independent when conditioning on the latent ones.\nIn the next sections we (i) discuss our model in the context of related work, (ii) formally develop\nthe model and the associated inference procedure, (iii) compare GPPVAE with alternative models in\nempirical settings, demonstrating the advantages of our approach.\n\n2 Related work\n\nOur method is related to several extensions of the standard VAE that aim at improving the latent\nrepresentation by leveraging auxiliary data, such as time annotations, pose information or lighting.\nAn ad hoc attempt to induce structure on the latent space by grouping samples with speci\ufb01c properties\nin mini-batches was introduced in Kulkarni et al. (2015). More principled approaches proposed\na semi-supervised model using a continuous-discrete mixture model that concatenates the input\nwith auxiliary information (Kingma et al., 2014). Similarly, the conditional VAE (Sohn et al., 2015)\nincorporates auxiliary information in both the encoder and the decoder, and has been used successfully\nfor sample generation with speci\ufb01c categorical attributes. Building on this approach, several models\nuse the auxiliary information in an unconditional way (Suzuki et al., 2016; Pandey and Dukkipati,\n2017; Vedantam et al., 2017; Wang et al., 2016; Wu and Goodman, 2018).\nA separate body of related work aims at designing a more \ufb02exible variational posterior distributions,\neither by considering a dependence on auxiliary variables (Maal\u00f8e et al., 2016), by allowing structured\nencoder models (Siddharth et al., 2016), or by considering chains of invertible transformations that\ncan produce arbitrarily complex posteriors (Kingma et al., 2016; Nalisnick et al., 2016; Rezende\nand Mohamed, 2015). In other work, a dependency between latent variables is induced by way of\nhierarchical structures at the level of the parameters of the variational family (Ranganath et al., 2016;\nTran et al., 2015).\nThe extensions of VAEs most related to GPPVAE are those that move away from the assumption of\nan iid Gaussian prior on the latent representations to consider richer prior distributions (Jiang et al.,\n2016; Shu et al., 2016; Tomczak and Welling, 2017). These build on the observation that overly-\nsimple priors can induce excessive regularization, limiting the success of such models (Chen et al.,\n2016; Hoffman and Johnson, 2016; Siddharth et al., 2017). For example, Johnson et al. proposed\n\n2\n\n\fcomposing latent graphical models with deep observational likelihoods. Within their framework, more\n\ufb02exible priors over latent encodings are designed based on conditional independence assumptions,\nand a conditional random \ufb01eld variational family is used to enable ef\ufb01cient inference by way of\nmessage-passing algorithms (Johnson et al., 2016).\nIn contrast to existing methods, we propose to model the relationship between the latent space\nand the auxiliary information using a GP prior, leaving the encoder and decoder as in a standard\nVAE (independent of the auxiliary information). Importantly, the proposed approach allows for\nmodeling arbitrarily complex sample structure in the data. In this work, we speci\ufb01cally focus on\ndisentangling sample correlations induced by different aspects of the data. Additionally, GPPVAE en-\nables estimation of latent auxiliary information when such information is unobserved by leveraging\nprevious work (Lawrence, 2005). Finally, using the encoder and decoder networks together with the\nGP predictive posterior, our model provides a natural framework for out-of-sample predictions of\nhigh-dimensional data, for virtually any con\ufb01guration of the auxiliary data.\n\n3 Gaussian Process Prior Variational Autoencoder\n\nAssume we are given a set of samples (e. g., images), each coupled with different types of auxiliary\ndata (e. g., time, lighting, pose, person identity). In this work, we focus on the case of two types of\nauxiliary data: object and view entities. Speci\ufb01cally, we consider datasets with images of objects\nin different views. For example, images of faces in different poses or images of hand-written digits\nat different rotation angles. In these problems, we know both which object (person or hand-written\ndigit) is represented in each image in the dataset, and in which view (pose or rotation angle). Finally,\neach unique object and view is attached to a feature vector, which we refer to as an object feature\nvector and a view feature vector, respectively. In the face dataset example, object feature vectors\nmight contain face features such as skin color or hair style, while view feature vectors may contain\npose features such as polar and azimuthal angles with respect to a reference position. Importantly, as\ndescribed and shown below, we can learn these feature vectors if not observed.\n\n3.1 Formal description of the model\n\np=1 denote M-dimensional object feature vectors for the P objects; and let {wq}Q\n\nLet N denote the number of samples, P the number of unique objects and Q the number of\nunique views. Additionally, let {yn}N\nn=1 denote K-dimensional representation for N samples;\nlet {xp}P\nq=1 denote\nR-dimensional view feature vectors for the Q views. Finally, let {zn}N\nn=1 denote the L-dimensional\nlatent representations. We consider the following generative process for the observed samples (Fig\n1a):\n\n\u2022 the latent representation of object pn in view qn is generated from object feature vector xpn\n\n(1)\n\nand view feature vector wqn as\n\n\u2022 image yn is generated from its latent representation zn as\n\nzn = f (xpn , wqn ) + \u03b7n, where \u03b7n \u223c N (0, \u03b1IL) ;\n\nyn = g(zn) + \u0001n, where \u0001n \u223c N(cid:0)0, \u03c32\n\n(cid:1) .\n\nyIK\n\n(2)\nThe function f : RM \u00d7 RR \u2192 RL de\ufb01nes how sample latent representations can be obtained in\nterms of object and view feature vectors, while g : RL \u2192 RK maps latent representations to the\nhigh-dimensional sample space.\nWe use a Gaussian process (GP) prior on f, which allows us to model sample covariances in the\nlatent space as a function of object and view feature vectors. Herein, we use a convolutional neural\nnetwork for g, which is a natural choice for image data (LeCun et al., 1995). The resulting marginal\nlikelihood of the GPPVAE, is\np(Y | X, W , \u03c6, \u03c32\n\n(3)\nwhere Y = [y1, . . . , yN ]T \u2208 RN\u00d7K, Z = [z1, . . . , zN ]T \u2208 RN\u00d7L, W = [w1, . . . , wQ]T \u2208\nRQ\u00d7R, X = [x1, . . . , xP ]T \u2208 RP\u00d7M . Additionally, \u03c6 denotes the parameters of g and \u03b8 the GP\nkernel parameters.\n\ny)p (Z | X, W , \u03b8, \u03b1) dZ,\n\np(Y | Z, \u03c6, \u03c32\n\ny, \u03b8, \u03b1) =\n\n(cid:90)\n\n3\n\n\fFigure 1: (a) Generative model underlying the proposed GPPVAE. (b) Pictorial representation of\nthe inference procedure in GPPVAE. Each sample (here an image) is encoded in a low-dimensional\nspace and then decoded to the original space. Covariances between samples are modeled through a\nGP prior on each column of the latent representation matrix Z.\n\nGaussian Process Model. The GP prior de\ufb01nes the following multivariate normal distribution on\nlatent representations:\n\np (Z | X, W , \u03b8, \u03b1) =\n\nL(cid:89)\n\nN(cid:0)zl | 0, K\u03b8(X, W ) + \u03b1IN\n\n(cid:1) ,\n\n(4)\n\nl=1\n\nwhere zl denotes the l-th column of Z. In the setting considered in this paper, the covariance\nfunction K\u03b8 is composed of a view kernel that models covariances between views, and an object\nkernel that models covariances between objects. Speci\ufb01cally, the covariance between sample n (with\ncorresponding feature vectors xpn and wqn) and sample m (with corresponding feature vectors xpm\nand wqm) is given by the factorized form (Bonilla et al., 2007; Rakitsch et al., 2013):\n\nK\u03b8(X, W )nm = K(view)\n\n\u03b8\n\n(wqn , wqm)K(object)\n\n\u03b8\n\n(xpn , xpm ).\n\n(5)\n\nObserved versus unobserved feature vectors Our model can be used when either one, or both of\nthe view/sample feature vectors are unobserved. In this setting, we regard the unobserved features\nas latent variables and obtain a point estimate for them, similar to Gaussian process latent variable\nmodels (Lawrence, 2005). We have done so in our experiments.\n\n3.2\n\nInference\n\nAs with a standard VAE, we make use of variational inference for our model. Speci\ufb01cally, we\nconsider the following variational distribution over the latent variables\n\nq\u03c8(Z | Y ) =\n\nzn | \u00b5z\n\n\u03c8(yn), diag(\u03c3z 2\n\n\u03c8(yn))\n\n,\n\n(6)\n\n(cid:17)\n\n(cid:89)\n\nN(cid:16)\n\nn\n\n\u03c8 are the hyperparameters of the\nwhich approximates the true posterior on Z. In Eq. (6), \u00b5z\nvariational distribution and are neural network functions of the observed data, while \u03c8 denotes the\nweights of such neural networks. We obtain the following evidence lower bound (ELBO):\n\n\u03c8 and \u03c3z\n\nlogp(Y | X, W , \u03c6, \u03c32\n\nlog N (yn | g\u03c6(zn), \u03c32\n\n(cid:34)(cid:88)\ny, \u03b8) \u2265 EZ\u223cq\u03c8\n(cid:88)\n\nn\n\n+\n\n1\n2\n\nlog(\u03c3z 2\n\n\u03c8(yn)l) + const.\n\nnl\n\n4\n\n(cid:35)\nyIK) + logp (Z | X, W , \u03b8, \u03b1)\n\n+\n\n(7)\n\n\fStochastic backpropagation. We use stochastic backpropagation to maximize the ELBO (Kingma\nand Welling, 2013; Rezende et al., 2014). Speci\ufb01cally, we approximate the expectation by sampling\nfrom a reparameterized variational posterior over the latent representations, obtaining the following\nloss function:\n\nl(cid:0)\u03c6, \u03c8, \u03b8, \u03b1, \u03c32\n\ny\n\n= N Klog \u03c32\n\ny +\n\n(cid:1) =\n(cid:88)\n(cid:124)\n\nn\n\n(cid:13)(cid:13)yn \u2212 g\u03c6(z\u03c8 n)(cid:13)(cid:13)2\n(cid:125)\n\n(cid:123)(cid:122)\n\n2\u03c32\ny\n\nreconstruction term\n\n(cid:124)\n\n(cid:88)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n+\n\n1\n2\n\n\u2212 logp (Z\u03c8 | X, W , \u03b8, \u03b1)\n\nlog(\u03c3z 2\n\nlatent-space GP term\n\n(cid:125)\ny. Latent representations Z\u03c8 =(cid:2)z\u03c8 1, . . . , z\u03c8 N\n\nregularization term\n\n\u03c8(yn)l)\n,\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nnl\n\n(8)\n\n(cid:3) \u2208\n\nwhich we optimize with respect to \u03c6, \u03c8, \u03b8, \u03b1, \u03c32\nRN\u00d7L are sampled using the re-parameterization trick Kingma and Welling (2013),\n\u03c8(yn), \u0001n \u223c N (0, IL\u00d7L), n = 1, . . . , N,\n\n(9)\nwhere (cid:12) denotes the Hadamard product. Full details on the derivation of the loss can be found in\nSupplementary Information.\n\n\u03c8(yn) + \u0001n (cid:12) \u03c3z\n\nz\u03c8 n = \u00b5z\n\nEf\ufb01cient GP computations. Naive computations in Gaussian processes scale cubically with the\nnumber of samples (Rasmussen, 2004). In this work, we achieve linear computations in the number\nof samples by assuming that the overall GP kernel is low-rank. In order to meet this assumption, we (i)\nexploit that in our setting the number of views, Q, is much lower than the number of samples, N, and\n(ii) impose a low-rank form for the object kernel (M (cid:28) N). Brie\ufb02y, as a result of these assumptions,\nthe total covariance is the sum of a low-rank matrix and the identity matrix K = V V T + \u03b1I,\nwhere V \u2208 RN\u00d7H and H (cid:28) N 1. For this covariance, computation of the inverse and the log\ndeterminant, which have cubic complexity for general covariances, can be recast to have complexity,\nO(N H 2 + H 3 + HN K) and O(N H 2 + H 3), respectively, using the Woodbury identity (Henderson\nand Searle, 1981) and the determinant lemma (Harville, 1997):\n\nK\u22121M =\nlog |K| = N L log \u03b1 + log\n\nI \u2212 1\n\u03b1\n\n1\n\u03b1\n\n(cid:12)(cid:12)(cid:12)(cid:12)I +\n\n1\n\u03b1\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\nV (\u03b1I + V T V )\u22121V T M ,\n\n(10)\n\n(11)\nwhere M \u2208 RN\u00d7K. Note a low-rank approximation of an arbitrary kernel can be obtained through\nthe fully independent training conditional approximation (Snelson and Ghahramani, 2006), which\nmakes the proposed inference scheme applicable in a general setting.\n\nV T V\n\nLow-memory stochastic backpropagation. Owing to the coupling between samples from the\nGP prior, mini-batch gradient descent is no longer applicable. However, a naive implementation\nof full gradient descent is impractical as it requires loading the entire dataset into memory, which\nis infeasible with most image datasets. To overcome this limitation, we propose a new strategy\nto compute gradients on the whole dataset in a low-memory fashion. We do so by computing the\n\ufb01rst-order Taylor series expansion of the GP term of the loss with respect to both the latent encodings\nand the prior parameters, at each step of gradient descent. In doing so, we are able to use the following\nprocedure:\n\n1. Compute latent encodings from the high-dimensional data using the encoder. This step can\n\nbe performed in data mini-batches, thereby imposing only low-memory requirements.\n\n2. Compute the coef\ufb01cients of the GP-term Taylor series expansion using the latent encodings.\nAlthough this step involves computations across all samples, these have low-memory\nrequirements as they only involve the low-dimensional representations.\n\n3. Compute a proxy loss by replacing the GP term by its \ufb01rst-order Taylor series expansion,\nwhich locally has the same gradient as the original loss. Since the Taylor series expansion\nis linear in the latent representations, gradients can be easily accumulated across data\nmini-batches, making this step also memory-ef\ufb01cient.\n\n1 For example,\n\nif both the view and the object kernels are linear, we have V\n\n=\n\n[X:,1 (cid:12) W:,1, X:,1 (cid:12) W:,2, . . . , X:,M (cid:12) W:,Q] \u2208 RN\u00d7H.\n\n5\n\n\f4. Update the parameters using these accumulated gradients.\n\nFull details on this procedure are given in Supplementary Information.\n\n3.3 Predictive posterior\n\nWe derive an approximate predictive posterior for GPPVAE that enables out-of-sample predictions\nof high-dimensional samples. Speci\ufb01cally, given training samples Y , object feature vectors X, and\nview feature vectors W , the predictive posterior for image representation y(cid:63) of object p(cid:63) in view q(cid:63)\nis given by\n\np(y(cid:63) | x(cid:63), w(cid:63), Y , X, W ) \u2248\n\np(y(cid:63) | z(cid:63))\n\np(z(cid:63) | x(cid:63), w(cid:63), Z, X, W )\n\nq(Z | Y )\n\ndz(cid:63)dZ(12)\n\n(cid:90)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\ndecode GP prediction\n\nlatent-space GP predictive posterior\n\nencode training data\n\nwhere x(cid:63) and w(cid:63) are object and feature vectors of object p(cid:63) and view q(cid:63) respectively, and we dropped\nthe dependency on parameters for notational compactness. The approximation in Eq. (12) is obtained\nby replacing the exact posterior on Z with the variational distribution q(Z | Y ) (see Supplementary\nInformation for full details). From Eq. (12), the mean of the GPPVAE predictive posterior can be\nobtained by the following procedure: (i) encode training image data in the latent space through the\nencoder, (ii) predict latent representation z(cid:63) of image y(cid:63) using the GP predictive posterior, and (iii)\ndecode latent representation z(cid:63) to the high-dimensional image space through the decoder.\n\n4 Experiments\n\nWe focus on the task of making predictions of unseen images, given speci\ufb01ed auxiliary information.\nSpeci\ufb01cally, we want to predict the image representation of object p in view q when that object was\nnever observed in that view, but assuming that object p was observed in at least one other view, and\nthat view had been observed for at least one other object. For example, we may want to predict the\npose of a person appearing in our training data set without having seen that person in that pose. To do\nso, we need to have observed that pose for other people. This prediction task gets at the heart of what\nwe want our model to achieve, and therefore serves as a good evaluation metric.\n\n4.1 Methods considered\n\nIn addition to the GPPVAE presented (GPPVAE-joint), we also considered a version with a simpler\noptimization scheme (GPPVAE-dis). We also considered two extensions of the VAE that can be used\nfor the task at hand. Speci\ufb01cally, we considered:\n\n\u2022 GPPVAE with joint optimization (GPPVAE-joint), where autoencoder and GP parame-\nters were optimized jointly. We found that convergence was improved by \ufb01rst training the\nencoder and the decoder through standard VAE, then optimizing the GP parameters with\n\ufb01xed encoder and decoder for 100 epochs, and \ufb01nally, optimizing all parameters jointly. Out-\nof-sample predictions from GPPVAE-joint were obtained by using the predictive posterior\nin Eq. (12);\n\u2022 GPPVAE with disjoint optimization (GPPVAE-dis), where we \ufb01rst learned the encoder\nand decoder parameters through standard VAE, and then optimized the GP parameters with\n\ufb01xed encoder and decoder. Again, out-of-sample predictions were obtained by using the\npredictive posterior in Eq. (12);\n\u2022 Conditional VAE (CVAE) (Sohn et al., 2015), where view auxiliary information was\nprovided as input to both the encoder and decoder networks (Figure S1, S2). After training,\nwe considered the following procedure to generate an image of object p in view q. First,\nwe computed latent representations of all the images of object p across all the views in the\ntraining data (in this setting, CVAE latent representations are supposedly independent from\nthe view). Second, we averaged all the obtained latent representations to obtain a unique\nrepresentation of object p. Finally, we fed the latent representation of object p together with\nout-of-sample view q to the CVAE decoder. As an alternative implementation, we also tried\nto consider the latent representation of a random image of object p instead of averaging, but\nthe performance was worse\u2013these results are not included;\n\n6\n\n\f\u2022 Linear Interpolation in VAE latent space (LIVAE), which uses linear interpolation be-\ntween observed views of an object in the latent space learned through standard VAE in\norder to predict unobserved views of the same object. Speci\ufb01cally, denoting z1 and z2\nas the latent representations of images of a given object in views r1 and r2, a prediction\nfor the image of that same object in an intermediate view, r(cid:63), is obtained by \ufb01rst linearly\ninterpolating between z1 and z2, and then projecting the interpolated latent representation\nto the high-dimensional image space.\n\nConsistent with the L2 reconstruction error appearing in the loss off all the aforementioned VAEs\n(e. g., Eq. (18)), we considered pixel-wise mean squared error (MSE) as the evaluation metric. We\nused the same architecture for encoder and decoder neural networks in all methods compared (see\nFigure S1, S2 in Supplementary Information). The architecture and \u03c32\ny, were chosen to minimize\nthe ELBO loss for the standard VAE on a validation set (Figure S3, Supplementary Information).\nFor CVAE and LIVAE, we also considered the alternative strategy of selecting the value of \u03c32\ny that\nmaximizes out-of-sample prediction performance on the validation set (the results for these two\nmethods are in Figure S4 and S5). All models were trained using the Adam optimizer (Kingma and\nBa, 2014) with standard parameters and a learning rate of 0.001. When optimizing GP parameters\nwith \ufb01xed encoder and decoder, we observed that higher learning rates led to faster convergence\nwithout any loss in performance, and thus we used a higher learning rate of 0.01 in this setting.\n\n4.2 Rotated MNIST\n\nSetup. We considered a variation of the MNIST dataset, consisting of rotated images of hand-\nwritten \"3\" digits with different rotation angles. In this setup, objects correspond to different draws of\nthe digit \"3\" while views correspond to different rotation states. View features are observed scalars,\ncorresponding to the attached rotation angles. Conversely, object feature vectors are unobserved and\nlearned from data\u2014no draw-speci\ufb01c features are available.\n\nDataset generation. We generated a dataset from 400 handwritten versions of the digit three by\nrotating through Q = 16 evenly separated rotation angles in [0, 2\u03c0), for a total of N = 6, 400\nsamples. We then kept 90% of the data for training and test, and the rest for validation. From the\ntraining and test sets, we then randomly removed 25% of the images to consider the scenario of\nincomplete data. Finally, the set that we used for out-of-sample predictions (test set) was created by\nremoving one of the views (i. e., rotation angles) from the remaining images. This procedure resulted\nin 4, 050 training images spanning 15 rotation angles and 270 test images spanning one rotation\nangle.\n\nAutoencoder and GP model. We set the dimension of the latent space to L = 16. For encoder\nand decoder neural networks we considered the convolutional architecture in Figure S1. As view\nkernel, we considered a periodic squared exponential kernel taking rotation angles as inputs. As\nobject kernel, we considered a linear kernel taking the object feature vectors as inputs. As object\nfeature vectors are unobserved, we learned from data\u2014their dimensionality was set to M = 8. The\nresulting composite kernel K, expresses the covariance between images n and m in terms of the\ncorresponding rotations angles wqn and wqm and object feature vectors xpn and xpm as\n\n(cid:18)\n\nK\u03b8(X, w)nm = \u03b2 exp\n\n(cid:124)\n\n\u2212 2sin2|wqn \u2212 wqm|\n\n(cid:123)(cid:122)\n\n\u03bd2\n\nrotation kernel\n\n\u00b7 xT\n\npn\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nxpm\n\ndigit draw kernel\n\n,\n\n(13)\n\n(cid:19)\n(cid:125)\n\nwhere \u03b2 \u2265 0 and \u03bd \u2265 0 are kernel hyper-parameters learned during training of the model (Rasmussen,\n2004), and we set \u03b8 = {\u03b2, \u03bd}.\n\nResults. GPPVAE-joint and GPPVAE-dis yielded lower MSE than CVAE and LIVAE in the\ninterpolation task, with GPPVAE-joint performing signi\ufb01cantly better than GPPVAE-dis (0.0280 \u00b1\n0.0008 for GPVAE-joint vs 0.0306 \u00b1 0.0009 for GPPVAE-dis, p < 0.02, Fig. 2a,b). Importantly,\nGPPVAE-joint learns different variational parameters than a standard VAE (Fig. 2c,d), used also by\nGPPVAE-dis, consistent with the fact that GPPVAE-joint performs better by adapting the VAE latent\nspace using guidance from the prior.\n\n7\n\n\fFigure 2: Results from experiments on rotated MNIST. (a) Mean squared error on test set. Error\nbars represent standard error of per-sample MSE. (b) Empirical density of estimated means of q\u03c8,\naggregated over all latent dimensions. (c) Empirical density of estimated log variances of q\u03c8. (d)\nOut-of-sample predictions for ten random draws of digit \"3\" at the out-of-sample rotation state. (e, f)\nObject and view covariances learned through GPPVAE-joint.\n\n4.3 Face dataset\n\nSetup. As second application, we considered the Face-Place Database (3.0) (Righi et al., 2012),\nwhich contains images of people faces in different poses. In this setting, objects correspond to the\nperson identities while views correspond to different poses. Both view and object feature vectors are\nunobserved and learned from data. The task is to predict images of people face in orientations that\nremained unobserved.\n\nData. We considered 4,835 images from the Face-Place Database (3.0), which includes images\nof faces for 542 people shown across nine different poses (frontal and 90, 60, 45, 30 degrees left\nand right2). We randomly selected 80% of the data for training (n = 3, 868), 10% for validation\n(n = 484) and 10% for testing (n = 483). All images were rescaled to 128 \u00d7 128.\n\nAutoencoder and GP model. We set the dimension of the latent space to L = 256. For encoder\nand decoder neural networks we considered the convolutional architecture in Figure S2. We consider\na full-rank covariance as a view covariance (only nine poses are present in the dataset) and a linear\ncovariance for the object covariance (M = 64).\n\nResults. GPPVAE-jointand GPPVAE-disyielded lower MSE than CVAE and LIVAE (Fig. 2a,b).\nIn contrast to the MNIST problem, the difference between GPPVAE-joint andGPPVAE-dis was not\nsigni\ufb01cant (0.0281 \u00b1 0.0008 for GPPVAE-joint vs 0.0298 \u00b1 0.0008 for GPPVAE-dis). Importantly,\nGPPVAE-joint was able to dissect people (object) and pose (view) covariances by learning people\nand pose kernel jointly (Fig. 2a,b).\n\n5 Discussion\n\nWe introduced GPPVAE, a generative model that incorporates a GP prior over the latent space.\nWe also presented a low-memory and computationally ef\ufb01cient inference strategy for this model,\nwhich makes the model applicable to large high-dimensional datasets. GPPVAE outperforms natural\nbaselines (CVAE and linear interpolations in the VAE latent space) when predicting out-of-sample\ntest images of objects in speci\ufb01ed views (e. g., pose of a face, rotation of a digit). Possible future\n\n2 We could have used the pose angles as view feature scalar similar to the application in rotated MNIST, but\npurposely ignored these features to consider a more challenging setting were neither object and view features are\nobserved.\n\n8\n\n\fFigure 3: Results from experiments on the face dataset. (a) Mean squared error on test set (b)\nOut-of-sample predictions of people faces in out-of-sample poses. (c, d) Object and view covariances\nlearned through GPPVAE-joint.\n\nwork includes augmenting the GPPVAE loss with a discriminator function, similar in spirit to a\nGAN (Goodfellow et al., 2014), or changing the loss to be perception-aware (Hou et al., 2017) (see\nresults from preliminary experiments in Figure S6). Another extension is to consider approximations\nof the GP likelihood that fully factorize over data points (Hensman et al., 2013); this could further\nimprove the scalability of our method.\n\nCode availability\n\nAn implementation of GPPVAE is available at https://github.com/fpcasale/GPPVAE.\n\nAcknowledgments\n\nStimulus images courtesy of Michael J. Tarr, Center for the Neural Basis of Cognition and Department\nof Psychology, Carnegie Mellon University, http:// www.tarrlab.org. Funding provided by NSF award\n0339122.\n\nReferences\nMatthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse\ngaussian process approximations. In Advances in neural information processing systems, pages\n1533\u20131541, 2016.\n\nEdwin V Bonilla, Felix V Agakov, and Christopher KI Williams. Kernel multi-task learning using\n\ntask-speci\ufb01c features. In Arti\ufb01cial Intelligence and Statistics, pages 43\u201350, 2007.\n\nFrancesco Paolo Casale, Barbara Rakitsch, Christoph Lippert, and Oliver Stegle. Ef\ufb01cient set tests\n\nfor the genetic analysis of correlated traits. Nature methods, 12(8):755, 2015.\n\n9\n\n\fFrancesco Paolo Casale, Danilo Horta, Barbara Rakitsch, and Oliver Stegle. Joint genetic analysis\nusing variant sets reveals polygenic gene-context interactions. PLoS genetics, 13(4):e1006693,\n2017.\n\nXi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya\nSutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731,\n2016.\n\nLehel Csat\u00f3 and Manfred Opper. Sparse on-line gaussian processes. Neural computation, 14(3):\n\n641\u2013668, 2002.\n\nAdrian V Dalca, Ramesh Sridharan, Mert R Sabuncu, and Polina Golland. Predictive modeling of\nanatomy with genetic and clinical data. In International Conference on Medical Image Computing\nand Computer-Assisted Intervention, pages 519\u2013526. Springer, 2015.\n\nNicolas Durrande, David Ginsbourger, Olivier Roustant, and Laurent Carraro. Additive covariance\nkernels for high-dimensional gaussian process modeling. arXiv preprint arXiv:1111.6233, 2011.\n\nYarin Gal, Mark Van Der Wilk, and Carl Edward Rasmussen. Distributed variational inference in\nsparse gaussian process regression and latent variable models. In Advances in Neural Information\nProcessing Systems, pages 3257\u20133265, 2014.\n\nMehmet G\u00f6nen and Ethem Alpayd\u0131n. Multiple kernel learning algorithms. Journal of machine\n\nlearning research, 12(Jul):2211\u20132268, 2011.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa-\ntion processing systems, pages 2672\u20132680, 2014.\n\nDavid A Harville. Matrix algebra from a statistician\u2019s perspective, volume 1. Springer, 1997.\n\nHarold V Henderson and Shayle R Searle. On deriving the inverse of a sum of matrices. Siam Review,\n\n23(1):53\u201360, 1981.\n\nJames Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. In Uncertainty\n\nin Arti\ufb01cial Intelligence, page 282. Citeseer, 2013.\n\nMatthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational\nevidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.\n\nXianxu Hou, Linlin Shen, Ke Sun, and Guoping Qiu. Deep feature consistent variational autoencoder.\nIn Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 1133\u20131141.\nIEEE, 2017.\n\nZhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embed-\nding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148,\n2016.\n\nMatthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta.\nComposing graphical models with neural networks for structured representations and fast inference.\nIn Advances in neural information processing systems, pages 2946\u20132954, 2016.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nDiederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems,\npages 3581\u20133589, 2014.\n\nDiederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems, pages 4743\u20134751, 2016.\n\n10\n\n\fTejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional\ninverse graphics network. In Advances in Neural Information Processing Systems, pages 2539\u2013\n2547, 2015.\n\nNeil Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent\n\nvariable models. Journal of machine learning research, 6(Nov):1783\u20131816, 2005.\n\nYann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The\n\nhandbook of brain theory and neural networks, 3361(10):1995, 1995.\n\nJohn Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard\nHasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotype-tissue expression (gtex)\nproject. Nature genetics, 45(6):580, 2013.\n\nLars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep\n\ngenerative models. arXiv preprint arXiv:1602.05473, 2016.\n\nEric Nalisnick, Lars Hertel, and Padhraic Smyth. Approximate inference for deep latent gaussian\n\nmixtures. In NIPS Workshop on Bayesian Deep Learning, volume 2, 2016.\n\nGaurav Pandey and Ambedkar Dukkipati. Variational methods for conditional multimodal deep\nlearning. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 308\u2013315.\nIEEE, 2017.\n\nJoaquin Qui\u00f1onero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate\n\ngaussian process regression. Journal of Machine Learning Research, 6(Dec):1939\u20131959, 2005.\n\nBarbara Rakitsch, Christoph Lippert, Karsten Borgwardt, and Oliver Stegle. It is all in the noise:\nEf\ufb01cient multi-task gaussian process inference with structured residuals. In Advances in neural\ninformation processing systems, pages 1466\u20131474, 2013.\n\nRajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. In International\n\nConference on Machine Learning, pages 324\u2013333, 2016.\n\nCarl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine\n\nlearning, pages 63\u201371. Springer, 2004.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\nDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\n\napproximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\nGiulia Righi, Jessie J Peissig, and Michael J Tarr. Recognizing disguised faces. Visual Cognition, 20\n\n(2):143\u2013169, 2012.\n\nRui Shu, James Brofos, Frank Zhang, Hung Hai Bui, Mohammad Ghavamzadeh, and Mykel Kochen-\nderfer. Stochastic video prediction with conditional density estimation. In ECCV Workshop on\nAction and Anticipation for Visual Learning, volume 2, 2016.\n\nN Siddharth, Brooks Paige, Alban Desmaison, Van de Meent, Frank Wood, Noah D Goodman,\nPushmeet Kohli, Philip HS Torr, et al. Inducing interpretable representations with variational\nautoencoders. arXiv preprint arXiv:1611.07492, 2016.\n\nN Siddharth, Brooks Paige, Jan-Willem Van de Meent, Alban Desmaison, Frank Wood, Noah D\nGoodman, Pushmeet Kohli, and Philip HS Torr. Learning disentangled representations with\nsemi-supervised deep generative models. ArXiv e-prints (Jun 2017), 2017.\n\nEdward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs.\n\nAdvances in neural information processing systems, pages 1257\u20131264, 2006.\n\nIn\n\nKihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep\nconditional generative models. In Advances in Neural Information Processing Systems, pages\n3483\u20133491, 2015.\n\n11\n\n\fOliver Stegle, Christoph Lippert, Joris M Mooij, Neil D Lawrence, and Karsten M Borgwardt.\nEf\ufb01cient inference in matrix-variate gaussian models with\\iid observation noise. In Advances in\nneural information processing systems, pages 630\u2013638, 2011.\n\nMasahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint multimodal learning with deep\n\ngenerative models. arXiv preprint arXiv:1611.01891, 2016.\n\nMichalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 567\u2013574, 2009.\n\nJakub M Tomczak and Max Welling. Vae with a vampprior. arXiv preprint arXiv:1705.07120, 2017.\n\nDustin Tran, Rajesh Ranganath, and David M Blei. The variational gaussian process. arXiv preprint\n\narXiv:1511.06499, 2015.\n\nRamakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative models of\n\nvisually grounded imagination. arXiv preprint arXiv:1705.10762, 2017.\n\nWeiran Wang, Xinchen Yan, Honglak Lee, and Karen Livescu. Deep variational canonical correlation\n\nanalysis. arXiv preprint arXiv:1610.03454, 2016.\n\nAndrew Wilson and Ryan Adams. Gaussian process kernels for pattern discovery and extrapolation.\n\nIn International Conference on Machine Learning, pages 1067\u20131075, 2013.\n\nAndrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 370\u2013378, 2016.\n\nMike Wu and Noah Goodman. Multimodal generative models for scalable weakly-supervised\n\nlearning. arXiv preprint arXiv:1802.05335, 2018.\n\n12\n\n\f", "award": [], "sourceid": 6628, "authors": [{"given_name": "Francesco Paolo", "family_name": "Casale", "institution": "Microsoft Research"}, {"given_name": "Adrian", "family_name": "Dalca", "institution": "MIT"}, {"given_name": "Luca", "family_name": "Saglietti", "institution": "Microsoft Research New England (visitor) Italian Institute for Genomic Medicine, Torino, Italy"}, {"given_name": "Jennifer", "family_name": "Listgarten", "institution": "UC Berkeley"}, {"given_name": "Nicolo", "family_name": "Fusi", "institution": "Microsoft Research"}]}