{"title": "Explicitly disentangling image content from translation and rotation with spatial-VAE", "book": "Advances in Neural Information Processing Systems", "page_first": 15435, "page_last": 15445, "abstract": "Given an image dataset, we are often interested in finding data generative factors that encode semantic content independently from pose variables such as rotation and translation. However, current disentanglement approaches do not impose any specific structure on the learned latent representations.  We propose a method for explicitly disentangling image rotation and translation from other unstructured latent factors in a variational autoencoder (VAE) framework. By formulating the generative model as a function of the spatial coordinate, we make the reconstruction error differentiable with respect to latent translation and rotation parameters. This formulation allows us to train a neural network to perform approximate inference on these latent variables while explicitly constraining them to only represent rotation and translation. We demonstrate that this framework, termed spatial-VAE, effectively learns latent representations that disentangle image rotation and translation from content and improves reconstruction over standard VAEs on several benchmark datasets, including applications to modeling continuous 2-D views of proteins from single particle electron microscopy and galaxies in astronomical images.", "full_text": "Explicitly disentangling image content from\ntranslation and rotation with spatial-VAE\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nTristan Bepler\n\nCambridge, MA\n\ntbepler@mit.edu\n\nKotaro Kelley\n\nNew York, NY\n\nkkelley@nysbc.org\n\nEllen D. Zhong\n\nCambridge, MA\nzhonge@mit.edu\n\nEdward Brignole\n\nCambridge, MA\n\nbrignole@mit.edu\n\nNew York Structural Biology Center\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nBonnie Berger\u2217\n\nCambridge, MA\nbab@mit.edu\n\nAbstract\n\nGiven an image dataset, we are often interested in \ufb01nding data generative factors\nthat encode semantic content independently from pose variables such as rotation\nand translation. However, current disentanglement approaches do not impose any\nspeci\ufb01c structure on the learned latent representations. We propose a method for\nexplicitly disentangling image rotation and translation from other unstructured\nlatent factors in a variational autoencoder (VAE) framework. By formulating the\ngenerative model as a function of the spatial coordinate, we make the reconstruction\nerror differentiable with respect to latent translation and rotation parameters. This\nformulation allows us to train a neural network to perform approximate inference\non these latent variables while explicitly constraining them to only represent\nrotation and translation. We demonstrate that this framework, termed spatial-\nVAE, effectively learns latent representations that disentangle image rotation and\ntranslation from content and improves reconstruction over standard VAEs on several\nbenchmark datasets, including applications to modeling continuous 2-D views of\nproteins from single particle electron microscopy and galaxies in astronomical\nimages. 2\n\n1\n\nIntroduction\n\nA central problem in computer vision is unsupervised learning on image datasets. Often, this takes the\nform of latent variable models in which we seek to encode semantic information about images into\ndiscrete (as in mixture models) or continuous (as in recent generative neural network models) vector\nrepresentations. However, in many imaging domains, image content is confounded by variability from\ngeneral image transformations, such as rotation and translation. In single particle electron microscopy,\nthis emerges from particles being randomly oriented in the microscope images. In astronomy, objects\n\n\u2217To whom correspondences should be addressed.\n2Source code and data are available at: https://github.com/tbepler/spatial-VAE\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fappear randomly oriented in telescope images, such as galaxies in the Sloan Digital Sky Survey.\nThere have been signi\ufb01cant efforts to develop rotation and translation invariant clustering methods for\nimages in various domains [1\u20133]. However, learning continuous latent variable models that capture\ncontent separately from nuisance transformations remains an open problem.\nWe motivate the importance of this problem in particular for single particle electron microscopy\n(EM), where the goal is to determine the 3-D electron density of a protein from many noisy and\nrandomly oriented 2-D projections. The \ufb01rst step in this process is to model the variety of 2-D views\nand protein conformational states. Existing methods for this step use Gaussian mixture models to\ngroup these 2-D views into distinct clusters where the orientation of each image relative to the cluster\nmean is inferred by maximum likelihood. This assumes that these projections arise from a discrete\nset of views and conformational states. However, protein conformations are continuous and may be\npoorly approximated with discrete clusters. Despite interest in continuous latent representations, no\ngeneral methods exist.\nIn this work, we focus speci\ufb01cally on the problem of learning continuous generative factors of image\ndatasets in the presence of random rotation and translation of the observed images. Our goal is to\nlearn a deep generative model and corresponding latent representations that separate content from\nrotation and translation and to perform ef\ufb01cient approximate inference on these latent variables in\na fully unsupervised manner. To this end, we propose a novel variational autoencoder framework,\ntermed spatial-VAE. By parameterizing the distribution over pixel intensities at a spatial coordinate\nexplicitly as a function of the coordinate, we make the image reconstruction term differentiable with\nrespect to rotation and translation variables. This insight enables us to perform approximate inference\non these latent variables jointly with unstructured latent variables representing the image content and\ntrain the VAE end-to-end. Unlike in previously proposed unconstrained disentangled representation\nlearning methods (e.g. \u03b2-VAE [4]), the rotation and translation variables are structurally constrained\nto represent only those image transformations.\nIn experiments, we demonstrate the ability of spatial-VAE to disentangle image content from rotation\nand translation. We \ufb01nd that rotation and translation inference allows us to learn improved generative\nmodels when images are perturbed with random transformations. In application to single particle EM\nand astronomical images, spatial-VAE learns latent representations of image content and generative\nmodels of proteins and galaxies that are disentangled from confounding transformations naturally\npresent in these datasets. Going forward, we expect this general framework to enable better, spatially-\naware object models across a variety of image domains.\n\n2 Methods\n\n2.1 Spatial generator network\n\nThe de\ufb01ning component of our spatial-VAE framework is the parameterization of the deep image\ngenerative model as a function of the spatial coordinates of the image. That is, given a signal with n\nobservations indexed by i, we learn a single function that describes the probability of observing the\nsignal value, yi, at coordinate, xi, as a function of xi and the unstructured latent variables, z. For\nimages, xi is the 2-dimensional spatial coordinate of pixel i, but this concept generalizes to signals\nof arbitrary dimension. Following Kingma and Welling [5], we de\ufb01ne this distribution, pg(yi|xi, z),\nto be Gaussian in the case of real valued yi and Bernoulli in the case of binary yi with distribution\nparameters computed from xi and z using a multilayer perceptron (MLP) with parameters g (Figure\n1). Under this model, the log probability of an image, y, represented as a vector of size n with\ncorresponding spatial coordinates, xi, given the current MLP and unstructured latent variables, z, is\n\nn(cid:88)\n\nlog p(y|z) =\n\nlog pg(yi|xi, z).\n\n(1)\n\ni=1\n\nAlthough the coordinate space of an image can be represented using several different systems,\nwe use Cartesian coordinates with the origin being the center of the image to naturally represent\nrotations around the image center. Therefore, to get the predicted distribution over the pixel at\nspatial coordinate xi, z and xi are concatenated and fed as input to the MLP. We contrast our spatial\ngenerator network with the usual approach to neural image generative models in which each pixel is\ndecoded independently, conditioned on the latent variables. In these standard models, which include\n\n2\n\n\fFigure 1: Diagram of the spatial-VAE framework. a) The generative model is a MLP mapping spatial\ncoordinates and latent variables to the distribution parameters of the pixel intensity at that coordinate.\nThis model is applied to each coordinate in the pixel grid to generate a complete image. Coordinate\ntransformations are applied directly to the spatial coordinates before being decoded by the generator\nnetwork. b) Approximate inference is performed on the rotation, translation, and unstructured\nlatent variables using an inference network, depicted here as a MLP. We use this architecture in our\nexperiments, but our framework generalizes to other inference network architectures.\n\nboth fully connected and transpose convolutional models, one function is learned per pixel index\nwhereas we explicitly condition on the pixel spatial coordinate instead.\nModeling rotation and translation. This model structure allows us to represent rotations and trans-\nlations directly as transformations of the coordinate space. A rotation by \u03b8 of the image y corresponds\nto rotating the underlying coordinates by \u03b8. Furthermore, shifting the image by some \u2206x corresponds\nto shifting the spatial coordinates by \u2206x. Let R(\u03b8) = [[cos(\u03b8), sin(\u03b8)], [\u2212sin(\u03b8), cos(\u03b8)]] be the\nrotation matrix for angle \u03b8, then the probability of observing image y with rotation \u03b8 and shift \u2206x is\ngiven by\n\nn(cid:88)\n\nlog pg(y|z, \u03b8, \u2206x) =\n\nlog pg(yi|xiR(\u03b8) + \u2206x, z).\n\n(2)\n\ni=1\n\nThis formulation of the conditional log probability of the image is differentiable with respect to \u03b8 and\n\u2206x which enables us to train an approximate inference network for these parameters. Furthermore,\nalthough we consider only rotation and translation in this work, this framework extends to general\ncoordinate transformations.\n\n2.2 Approximate inference of rotation, translation, and unstructured latent variables\n\nWe perform approximate inference on the unstructured latent variables, the rotation, and the trans-\nlation using a neural network following the standard VAE procedure. For all parameters, we\nmake the usual simplifying choice and represent the approximate posterior as log q(z, \u03b8, \u2206x|y) =\nlog N (z, \u03b8, \u2206x; \u00b5(y), \u03c3(y)2I). For z, we use the usual N (0, I) prior. Choosing the appropriate\npriors on \u2206x and \u03b8, however, requires more consideration. We set the prior on \u2206x to be Gaussian\nwith \u00b5 = 0, but the standard deviation of this prior controls how tolerant our model should be to\nlarge image translations. Priors on \u03b8 are particularly tricky, due to angles being bounded and, ideally,\nwe would like to have the option to use a uniform prior over \u03b8 between 0 and 2\u03c0 and to use an\napproximate posterior distribution with matching support. In particular, the wrapped normal or von\nMises-Fisher distributions would be ideal. However, these distributions introduce signi\ufb01cant extra\nchallenges to sampling and computing the Kullback\u2013Leibler (KL) divergence in the VAE setting [6].\nTherefore, we instead model the prior and approximate posterior of \u03b8 using the Gaussian distribution\n(\u03b8 is wrapped when we calculate the rotation matrix). For the prior, we use mean zero and the standard\ndeviation of this distribution can be set large to approximate a uniform prior over \u03b8. Furthermore, by\nobserving that the KL divergence does not penalize the mean of the approximate posterior when the\nprior over \u03b8 is uniform, we can make the following adjustment to the standard KL divergence for \u03b8:\n\nKL(y) = \u2212log \u03c3\u03b8(y) + log s\u03b8 +\nD\u03b8\n\n\u03c32\n\u03b8 (y)\n2s2\n\u03b8\n\n\u2212 0.5\n\n(3)\n\n3\n\n\fwhere \u03c3\u03b8(y) is given by the inference network for image y and s\u03b8 is the standard deviation of the\nprior on \u03b8. In our experiments, we use this variant of the KL divergence for \u03b8 in cases where we\nwould like to make no assumptions about its prior mean. In the future, this structure could be replaced\nwith better prior and approximate posterior choices for the translation and rotation parameters.\n\n2.3 Variational lower-bound for this model\n\nLet the approximate posterior given by the inference network to the unconstrained latent variables\nfor an image y be q(z|y) = N (\u00b5z(y), \u03c32\n\u03b8 (y)), and\nthe translation be q(\u2206x|y) = N (\u00b5\u2206x(y), \u03c32\n\u2206x(y)). For convenience, we denote these variables\ncollectively as \u03c6 and joint approximate posterior as q(\u03c6|y). The full variational lower bound for our\nmodel with inference of the rotation and translation parameters on a single image is\n\nz(y)), the rotation be q(\u03b8|y) = N (\u00b5\u03b8(y), \u03c32\n\nE\n\n\u03c6\u223cq(\u03c6|y)\n\n[log pg(y|z, \u03b8, \u2206x)] \u2212 DKL(q(\u03c6|y)||p(\u03c6)),\n\n(4)\nwhere DKL(q(\u03c6|y)||p(\u03c6)) = DKL(q(z|y)||p(z)) + DKL(q(\u03b8|y)||p(\u03b8)) + DKL(q(\u2206x|y)||p(\u2206x)),\np(z) = N (0, I), p(\u03b8) = N (0, s2\n\u2206x) with problem speci\ufb01c s\u03b8 and s\u2206x. In\nthe case that we do not wish to perform inference on \u03b8 or \u2206x, these variables are \ufb01xed to zero when\ncalculating the expected log probability and their KL divergence terms are ignored:\n\n\u03b8), and p(\u2206x) = N (0, s2\n\nE\n\nz\u223cq(z|y)\n\n[log pg(y|z, \u03b8 = 0, \u2206x = 0)] \u2212 DKL(q(z|y)||p(z)).\n\n(5)\nWhen we wish to impose no prior constraints on the mean of \u03b8, the DKL(q(\u03b8|y)||p(\u03b8)) term is\nsubstituted for the modi\ufb01ed KL divergence in equation 3. We estimate the expectation of the log\nprobability with a single sample during model training.\n\n3 Related Work\n\nExtensive research on developing deep generative models of images has led to a diversity of frame-\nworks, including Generative Adversarial Networks [7\u20139], Variational Autoencoders [5, 10], hybrid\nVAE/GANs [11], and \ufb02ow-based models [12]. These models have been broadly successful for unsu-\npervised representation learning of images and/or the synthesis of realistic images. These approaches,\nhowever, do not impose any semantic structure on their inferred latent space. In contrast, we are\ninterested in separating latent variables into latent translation/rotation and unstructured latent variables\nencoding image content. Also in this category are models for unsupervised scene understanding such\nas AIR [13] which seeks to learn to break down scenes into constitutive objects. The individual object\nrepresentations are unstructured. Instead, we seek to learn object representations and corresponding\ngenerative models that are invariant to rotation and translation by explicitly structuring the latent\nvariables to remove these sources of variation from the object representations. In this way, our\nwork is related to transformation invariant image clustering where it is well understood that in the\npresence of random rotation and translation, discrete clustering methods tend to \ufb01nd cluster centers\nthat explain these transformations [1]. We extend this to learning continuous, rather than discrete,\nlatent representations of images in the presence of random rotation and translation.\nRecent literature on learning disentangled or factorized representations include \u03b2-VAE [4] and Info-\nGAN [14]. These approaches tackle the problem of disentangling latent variables in an unconstrained\nsetting, whereas we explicitly detangle latent pose from image content variables by constraining\ntheir effect to transformations of input pixel coordinates. This approach has the added bene\ufb01t of\nintroducing no additional hyperparameters or optimization complexity that would be required to\nimpose this disentanglement through modi\ufb01cations to the loss function. Other efforts to capture\nuseful latent representations include constraining the manifold of latent values to be homeomorphic\nto some known underlying data manifold [15, 16].\nThe general framework of modeling images as functions mapping spatial coordinates to values has\nnot been extensively explored in the neural network literature. The \ufb01rst example to our knowledge\nis with compositional pattern producing networks (CPPNs) [17], although their focus was on using\nCPPNs to model complex functions (i.e. images) as an analogy to development in nature, with the\nform of the image model being incidental. The only other example, to our knowledge, is given\nby CocoNet [18], a deep neural network which maps 2D pixel coordinates to RGB color values.\n\n4\n\n\fCocoNet learns an image model from single images, using the capacity of the network to memorize\nthe image. The trained network can then be used on various low-level image processing tasks, such\nas image denoising and upsampling. While we use a similar coordinate-based image model, we are\ninstead interested in learning latent generative factors of the dataset from many rotated and translated\nimages. In the EM setting, for example, we would like to learn the distribution over protein structures\nfrom many randomly oriented images. Spatial coordinates have also been used as additional features\nin convolutional neural networks for supervised learning [19] and generative modeling [20]. However,\nthe latter uses these only to augment standard convolutional decoder architectures. In contrast, we\ncondition our generative model on the spatial coordinates to enable structured inference of pose\nparameters.\nIn EM, methods for addressing protein structural heterogeneity can broadly be characterized into ones\nthat operate on 2D images [2] and ones that operate on 3D volumes [3, 21]. These methods assume\nthat images arise from a discrete set of latent conformations and often require signi\ufb01cant manual data\ncuration to group similar conformations. To understand continuous \ufb02exibility, Dashti et al. [22] use\nstatistical manifold embedding of prealigned protein images. More recently, multi-body re\ufb01nement\n[23], in which electron densities of protein substructures are re\ufb01ned separately, has been used to\nmodel independently moving rigid domains, but requires manual de\ufb01nition of these components. Our\nwork is the \ufb01rst attempt to use deep generative models and approximate inference to automatically\nlearn continuous representations of proteins from electron microscopy data.\n\n4 Results\n\n4.1 Experiment setup, implementation, and training details\n\nWe represent the coordinate space of\nan image using Cartesian coordinates\nwith the origin at the image center.\nThe coordinates are normalized such\nthat the left-most and bottom-most\npixels occur at -1 and the right-most\nand top-most pixels occur at +1 along\neach axis. The generator network is\nimplemented as an MLP with tanh ac-\ntivations. The input dimension is 2 +\nthe dimensions of the unstructured la-\ntent variables (Z-D). The output is ei-\nther a single output probability, in the\ncase of binary valued pixels, or a mean\nand standard deviation in the case of\nreal valued pixels. In the following\nexperiments, the binary valued pixel\nlog probability is used for MNIST and\nthe galaxy zoo dataset and the real val-\nued pixel log probability is used for\nthe EM datasets. In all cases, the in-\nference network uses the same MLP\narchitecture as the generator, except\nthat the inputs are \ufb02attened images\nand the outputs are mean and standard\ndeviations of the latent variables. For\ncomparison, we de\ufb01ne a vanilla VAE\nusing a standard MLP generator net-\nwork in which the Z-D latent variables are mapped directly to the distribution parameters for each\npixel (i.e. the model outputs an n-dimensional vector for binary data and 2n-dimensional vector for\nreal valued data where n is the number of pixels in the image). Parameters are \ufb01t to maximize the\nevidence lower bound (ELBO). All models are trained using ADAM [24] with a learning rate of\n0.0001 and minibatch size of 100. Models were implemented using PyTorch [25].\n\nFigure 2: Comparison of VAEs in terms of the ELBO for\nvarying dimensions of the unstructured latent variables (Z-\nD) on MNIST and transformed MNIST datasets. (Green)\nspatial-VAE, (orange) spatial-VAE trained with \ufb01xed \u03b8 = 0\nand \u2206x = 0, (blue) standard VAE baseline. Spatial-VAE\nachieves better ELBO on both transformed datasets with\nmore pronounced bene\ufb01t when the dimension of the latent\nvariables is small. Remarkably, the spatial-VAE even gives\nsome bene\ufb01t on the plain MNIST dataset with small z di-\nmension, likely due to slight variation in digit slant and\npositioning.\n\n5\n\n\f4.2 Spatial-VAE improves reconstruction when images are transformed\n\nWe \ufb01rst ask whether our spatial-VAE model can succesfully reconstruct image content when images\nhave been transformed through random rotation and translation. To this end, we generate two\nrandomly transformed variants of MNIST. In the \ufb01rst, which we refer to as \"rotated MNIST\", each\nMNIST digit is randomly rotated by an angle sampled from N (0, \u03c02\n16 ) and randomly translated by a\nsmall amount, N (0, 1.42). We then generate a second, harder dataset, where the MNIST digits are\nrandomly rotated by the same amount but we apply much greater translation, sampling the horizontal\nand vertical shift from N (0, 142). We refer to this as the rotated and translated MNIST dataset.\nWe train spatial-VAE models with rotation and translation inference on these datasets and the original\nMNIST dataset. The prior on \u03b8 is set to N (0, \u03c02\n64 ) in the case of regular MNIST where there is very\nlittle rotation of the digits and N (0, \u03c02\n16 ) for the two rotated MNIST datasets. We infer translations\nwith the prior set to N (0, 1.42) on both transformed MNIST datasets. The encoder and decoder\narchitectures are both 2-layer MLPs with 500 hidden units each and tanh activations. The encoder\ntakes a 28*28 dimension input, encoding a \ufb02attened image, and outputs the mean and standard\ndeviation of the approximate posterior distribution over each of the unconstrained latent variables,\nthe rotation, and the translation. The decoder network takes the unconstrained latent variables and the\nrotated and translated x and y coordinates of each pixel and returns the reconstructed pixel value at\nthat coordinate.\n\nFigure 3: Visualization of the learned data manifolds of MNIST and transformed MNIST for models\nwith 2-D unconstrained latent variables. We plot pg(y|z, \u03b8 = 0, \u2206x = 0) for values of the latent\nvariable z equally spaced through the inverse CDF. The vanilla VAE and spatial-VAE with \u03b8 and \u2206x\n\ufb01xed to 0 are forced to model digit orientation only with z, whereas the full spatial-VAE model only\nuses z to capture digit style.\n\nWe compare the performance of the spatial-VAE against two baselines. The \ufb01rst is a standard\n(vanilla) autoencoder in which each pixel value is decoded directly from the unconstrained latent\nvariables also using a 2-layer MLP with 500 hidden units and tanh activations (this model takes\nas input Z-D latent variables and outputs a 28*28 dimension vector giving the reconstructed pixel\nvalues). The second is a spatial-VAE with identical architecture to above, but without rotation and\ntranslation inference (i.e. \u03b8 = 0 and \u2206x = 0 for all images). Our spatial-VAE model outperforms\nboth baselines in terms of maximizing the ELBO of the test set images on all three datasets across a\nvariety of sizes of the unconstrained latent variables (Figure 2). As expected, spatial-VAE provides an\nenormous improvement on both transformed MNIST datasets when there are few unconstrained latent\nvariables. Even when the standard VAE is given additional unstructured latent variables to account\nfor rotation and translation, spatial-VAE still achieves identical performance (Appendix Figure 1). As\nthe dimension of z increases, the models lacking rotation and translation inference are able to account\n\n6\n\n\ffor this variability within z so the size of the improvement decreases. Remarkably, spatial-VAE\neven offers some improvement on untransformed MNIST when the dimension of z is small, perhaps\nbecause there is some small variability in digit orientation. This is evident in the manifold of digits\nlearned by these models (Figure 3, where we see that the spatial-VAE model encodes digit shape but\nnot rotation and translation in the unconstrained latent variables). Spatial-VAE learns to generate\nrecognizable digits even from the transformed MNIST images whereas the baseline models do not.\nAlthough we set the rotation and translation priors to match the true distribution above, we observe\nthat spatial-VAE is not sensitive to this setting and achieves the same results with wide priors on these\nparameters (Appendix Figures 2 & 3).\n\nFigure 4: Visualization of the ground truth conformations of 5HDB (top), simulated particle samples\nwith random rotation and added noise (second from top), visualization of the 1-D data manifold\nlearned by spatial-VAE (second from bottom), and visualization of the 1-D data manifold learned\nby the vanilla VAE (bottom). The spatial-VAE model captures the protein\u2019s conformation in the\nunconstrained latent variable separately from protein orientation.\n\n4.3 Spatial-VAE recovers true variation in image content\n\nWe next ask if the spatial-VAE can recover true semantic variability in images that have been observed\nwith random transformations. Speci\ufb01cally, we consider the ability of our spatial-VAE framework to\ncapture protein conformation separately from in-plane rotation in simulated single particle cryo-EM\ndata. To this end, we generated 20,000 simulated projections of integrin \u03b1-IIb in complex with\nintegrin \u03b2-3 (5HDB) [26]) with varying conformations given by a single degree of freedom (see\nappendix for details). These images are 40 by 40 pixels and were randomly split into 16,000 train and\n4,000 test images. We then \ufb01t a spatial-VAE with a 1-D unconstrained latent variable and rotation\ninference. This is compared with vanilla VAEs trained with one- and two-dimensional latent variables.\nAll models are 2-layer MLPs with 500 hidden units in each layer and tanh activations. Models were\ntrained for 500 epochs. Because the simulated particles have a uniform prior on the orientation angle,\nwe maximize the variational lower bound with the KL divergence variant presented in equation 3 with\nthe prior on \u03b8 set to have \u03c3 = \u03c0. In order to avoid bad local optima that could arise early on during\nspatial-VAE training, we freeze the unconstrained latent variable to zero for the \ufb01rst two training\nepochs. Furthermore, we apply data augmentation to the inference network by randomly rotating\nimages by \u03b3 and then removing \u03b3 from the predicted rotation angle, where \u03b3 is randomly sampled\nfrom [0, 2\u03c0] for each image at each iteration.\nOur spatial-VAE dramatically\noutperforms the 1-D vanilla VAE\nand slightly outperforms the 2-D\nvanilla VAE in terms of ELBO\ndespite being constrained to only\nrepresent rotation with its second\nlatent variable (Appendix Figure\n4). Furthermore, in visualizations\nof the learned data manifold (Fig-\nure 4), we see that spatial-VAE\ncorrectly reproduces the ground\ntruth protein conformations with orientation removed. The vanilla VAE, on the other hand, does not\nseparate rotation and conformation in the latent space (Appendix Figure 5). We con\ufb01rm this \ufb01nding\nquantitatively by measuring the correlation between the latent variables inferred by these models and\nthe ground truth rotation and conformation factors (Table 1). For the conformation factor we calculate\n\nVariable Conformation Rotation\nz1\nz1\nz2\nz1\n\u03b8\n\nTable 1: Correlation coef\ufb01cients of the inferred latent variables\nwith the ground truth factors in the 5HDB dataset.\n\nModel\nvanilla-VAE (1-d)\nvanilla-VAE (2-d)\nvanilla-VAE (2-d)\nspatial-VAE\nspatial-VAE\n\n0.00\n0.09\n0.07\n0.95\n0.01\n\n0.18\n0.02\n0.04\n0.01\n0.92\n\n7\n\n\fPearson correlation and for the rotation factor we calculate the circular correlation measure [27]. The\nlatent space learned by spatial-VAE correlates strongly with the ground truth conformation whereas\nthe latent spaces learned by the standard VAEs do not. The same is true for the inferred rotation.\n\nFigure 5: Visualization of samples from the galaxy zoo spatial-VAEs. (Left) Random examples from\nthe training images showing the diversity of galaxy shapes, sizes, and colors. These differences are\nfurther confounded by random rotation and position of galaxies within the image frame. (Left-middle,\nright-middle, right) Samples from spatial-VAE models with 20-, 50-, and 100-D unconstrained\nlatent variables. z is \ufb01rst sampled from N (0, 1), then pg(x|z, \u03b8 = 0, \u2206x = 0) is plotted for each\nRGB value. These samples demonstrate the diversity of shapes, sizes, and colors while being invariant\nto rotation and translation. Best viewed zoomed in.\n\nFigure 6: Visualization of interpolation between observed antibody conformations (Left) and ob-\nserved CODH/ACS conformations (Right) in the latent space of the 20-D spatial-VAEs trained\non each dataset. We plot the mean of the pixel distribution given by the generator network for z\ninterpolated along equally sized steps between the mean of the approximate posterior distributions\ngiven by the inference network to randomly selected test image pairs. We show the sampled images\nin the far left and far right columns respectively. Note that the visualizations are generated with \u03b8 = 0\nand \u2206x = 0, which removes the orientation from the observed images.\n\n4.4 Learning transformation-invariant protein and galaxy models\n\nWe demonstrate that our spatial-VAE can capture conformational variability separately from rotation\nand translation in astronomical images from the galazy zoo dataset and in real single particle electron-\nmicroscopy images.\n\nGalaxy zoo. The galaxy zoo dataset contains 61,578 training color images of galaxies from the\nSloan Digital Sky Survey. We crop each image with random translation and downsample to 64x64\npixels following common practice [28]. We train spatial-VAEs with 20, 50, and 100 dimension\nunconstrained latent variables for 300 epochs following the description above except that our inference\nnetwork has 5,000 units in each hidden layer. Furthermore, because these are RGB images, our\ngenerator network outputs three values given the spatial coordinate rather than one. We use the\n\n8\n\n\fKL divergence variant in eq. 3 and set the translation prior standard deviation to 8 pixels. We \ufb01nd\nthat spatial-VAE captures galaxy size, shape, and color independently from rotation and translation\n(Figure 5).\n\nSingle particle electron-microscopy. We also train spatial-VAEs on two negative stain EM\ndatasets. The \ufb01rst is a dataset containing the StrepMAB-Classic antibody and the second con-\ntains the CODH/ACS protein complex (see appendix for details). We split the antibody dataset into\n10,821 training and 2,705 testing images and the CODH/ACS dataset into 11,473 training and 2,868\ntesting images. Each image is 40 by 40 pixels. Models are 2-layer MLPs with 1,000 hidden units in\neach layer and tanh activations and are trained for 1,000 epochs. Again, we use the KL-divergence\nvariant in equation 3 with prior \u03c3 = \u03c0. We also perform inference on \u2206x with a N (0, 4) pixel prior\nand apply random rotation augmentation to the inference network training.\nConsistent with other work in VAEs and our above results, we do not observe that adding additional\nlatent variables causes over\ufb01tting of spatial-VAE (Appendix Figure 6). Furthermore, spatial-VAE\nrecovers continuous conformations of the proteins in these datasets. Figure 6 visualizes interpolation\nbetween images using the spatial-VAE model trained with 20-D z. EM images have low signal-to-\nnoise ratios, but the random orientations of the proteins can be seen in the real images in the left\nand right columns. In the CODH/ACS dataset, spatial-VAE learns a variety of con\ufb01gurations of\nthe \"arms\" of the complex. We note that spatial-VAE models also capture structure in the image\nbackground (Appendix Figures 7 & 8) which can be mitigated by constraining the mean of the pixel\nvalue distribution given by the generator network to be non-negative.\n\n5 Conclusion\n\nWe have presented spatial-VAE, a method for learning latent image representations disentangled from\nrotation and translation. We showed that formulating the image generative model as a function of the\nspatial coordinates not only enables ef\ufb01cient inference of the pose parameters but that this framework\nleads to improved image modeling. Furthermore, our general approach is highly extensible. The\nspatial generator network can operate on signals of any dimension, suggesting that this could be a\npromising approach to 3-D object modeling, although this application will require additional work in\nhigh dimensional pose inference and image formation processes. As a second direction, decoupling\nrotation, translation, and semantic content in the inference network could lead to improvements in the\ninference process. We are hopeful that these ideas will enable new directions in generative models of\nimages and objects.\n\nAcknowledgements\n\nThis work was supported by NIH R01-GM081871.\nWe would like to thank Bridget Carragher and Clint Potter at NYSBC for their support in providing the\nantibody dataset. The NYSBC portion of this work was supported by Simons Foundation SF349247,\nNYSTAR, and NIH GM103310 with additional support from Agouron Institute F00316 and NIH\nOD019994-01.\nWe would also like to thank the laboratory of HHMI investigator Catherine L. Drennan, MIT, for\nproviding the CODH/ACS dataset that was collected with support from the National Institutes of\nHealth (R35 GM126982).\n\nReferences\n[1] B. J. Frey and N. Jojic. Transformation-invariant clustering using the em algorithm. IEEE Transactions on\nPattern Analysis and Machine Intelligence, 25(1):1\u201317, Jan 2003. ISSN 0162-8828. doi: 10.1109/TPAMI.\n2003.1159942.\n\n[2] Sjors HW Scheres, Mikel Valle, Rafael Nu\u00f1ez, Carlos OS Sorzano, Roberto Marabini, Gabor T Herman,\nand Jose-Maria Carazo. Maximum-likelihood multi-reference re\ufb01nement for electron microscopy images.\nJournal of molecular biology, 348(1):139\u2013149, 2005.\n\n[3] Sjors HW Scheres. Maximum-likelihood methods in cryo-em. part ii: Application to experimental data.\n\nMethods in enzymology, 482:295, 2010.\n\n9\n\n\f[4] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and\n\nAlexander Lerchner. Understanding disentangling in \u03b2-VAE. arXiv.org, April 2018.\n\n[5] Diederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\nConference on Learning Representations (ICLR), 2013.\n\nIn The 2nd International\n\n[6] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyperspherical\n\nvariational auto-encoders. arXiv preprint arXiv:1804.00891, 2018.\n\n[7] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative Adversarial Networks. arXiv.org, June 2014.\n\n[8] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep Generative Image Models using a\n\nLaplacian Pyramid of Adversarial Networks. arXiv.org, June 2015.\n\n[9] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep\n\nConvolutional Generative Adversarial Networks. arXiv.org, November 2015.\n\n[10] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and\n\nAaron Courville. PixelVAE: A Latent Variable Model for Natural Images. arXiv.org, November 2016.\n\n[11] Larsen, Anders Boesen Lindbo, S\u00f8nderby, S\u00f8ren Kaae, Larochelle, Hugo, and Winther, Ole. Autoencoding\n\nbeyond pixels using a learned similarity metric. arXiv.org, December 2015.\n\n[12] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative Flow with Invertible 1x1 Convolutions.\n\narXiv.org, July 2018.\n\n[13] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al.\nAttend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information\nProcessing Systems, pages 3225\u20133233, 2016.\n\n[14] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN:\nInterpretable Representation Learning by Information Maximizing Generative Adversarial Nets. arXiv.org,\nJune 2016.\n\n[15] Luca Falorsi, Pim de Haan, Tim R Davidson, Nicola De Cao, Maurice Weiler, Patrick Forr\u00e9, and Taco S\n\nCohen. Explorations in Homeomorphic Variational Auto-Encoding. arXiv.org, July 2018.\n\n[16] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyperspherical\n\nVariational Auto-Encoders. arXiv.org, April 2018.\n\n[17] Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of development.\n\nGenetic Programming and Evolvable Machines, 8(2):131\u2013162, May 2007.\n\n[18] Paul Andrei Bricman and Radu Tudor Ionescu. Coconet: A deep neural network for mapping pixel\ncoordinates to color values.\nIn Long Cheng, Andrew Chi Sing Leung, and Seiichi Ozawa, editors,\nNeural Information Processing, pages 64\u201376, Cham, 2018. Springer International Publishing. ISBN\n978-3-030-04179-3.\n\n[19] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason\nYosinski. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances\nin Neural Information Processing Systems, pages 9605\u20139616, 2018.\n\n[20] Nicholas Watters, Loic Matthey, Christopher P Burgess, and Alexander Lerchner. Spatial broadcast decoder:\nA simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017,\n2019.\n\n[21] Lyumkis, Dmitry, Brilot, Axel F, Theobald, Douglas L, and Grigorieff, Nikolaus. Likelihood-based\nclassi\ufb01cation of cryo-EM images using FREALIGN. Journal of structural biology, 183(3):377\u2013388,\nSeptember 2013.\n\n[22] Ali Dashti, Peter Schwander, Robert Langlois, Russell Fung, Wen Li, Ahmad Hosseinizadeh, Hstau Y\nLiao, Jesper Pallesen, Gyanesh Sharma, Vera A Stupina, Anne E Simon, Jonathan D Dinman, Joachim\nFrank, and Abbas Ourmazd. Trajectories of the ribosome as a Brownian nanomachine. Proceedings of the\nNational Academy of Sciences of the United States of America, 111(49):17492\u201317497, December 2014.\n\n[23] Takanori Nakane, Dari Kimanius, Erik Lindahl, and Sjors Hw Scheres. Characterisation of molecular\nmotions in cryo-EM single-particle data by multi-body re\ufb01nement in RELION. eLife, 7:e36861, June\n2018.\n\n10\n\n\f[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS\nAutodiff Workshop, 2017.\n\n[26] Fu-Yang Lin, Jianghai Zhu, Edward T Eng, Nathan E Hudson, and Timothy A Springer. \u03b2-subunit binding\nis suf\ufb01cient for ligands to open the integrin \u03b1iib\u03b23 headpiece. Journal of Biological Chemistry, 291(9):\n4537\u20134546, 2016.\n\n[27] S.R. Jammalamadaka and A. Sengupta. Topics in Circular Statistics. Series on multivariate analy-\nsis. World Scienti\ufb01c, 2001. ISBN 9789812779267. URL https://books.google.com/books?id=\nsKqWMGqQXQkC.\n\n[28] Sander Dieleman, Kyle W. Willett, and Joni Dambre. Rotation-invariant convolutional neural networks for\ngalaxy morphology prediction. Monthly Notices of the Royal Astronomical Society, 450(2):1441\u20131459, 04\n2015. ISSN 0035-8711. doi: 10.1093/mnras/stv632. URL https://doi.org/10.1093/mnras/stv632.\n\n11\n\n\f", "award": [], "sourceid": 8930, "authors": [{"given_name": "Tristan", "family_name": "Bepler", "institution": "MIT"}, {"given_name": "Ellen", "family_name": "Zhong", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Kotaro", "family_name": "Kelley", "institution": "New York Structural Biology Center"}, {"given_name": "Edward", "family_name": "Brignole", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Bonnie", "family_name": "Berger", "institution": "MIT"}]}