{"title": "MaCow: Masked Convolutional Generative Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 5893, "page_last": 5902, "abstract": "Flow-based generative models, conceptually attractive due to tractability of both the exact log-likelihood computation and latent-variable inference, and efficiency of both training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations.\nDespite their computational efficiency, the density estimation performance of flow-based generative models significantly falls behind those of state-of-the-art autoregressive models. In this work, we introduce masked convolutional generative flow (MaCow), a simple yet effective architecture of generative flow using masked convolution. By restricting the local connectivity in a small kernel, MaCow enjoys the properties of fast and stable training, and efficient sampling, while achieving significant improvements over Glow for density estimation on standard image benchmarks, considerably narrowing the gap to autoregressive models.", "full_text": "MaCow: Masked Convolutional Generative Flow\n\nXuezhe Ma, Xiang Kong, Shanghang Zhang, Eduard Hovy\n\nxuezhem,xiangk@cs.cmu.edu, shanghaz@andrew.cmu.edu, hovy@cmu.edu\n\nCarnegie Mellon University\n\nPittsburgh, PA, USA\n\nAbstract\n\nFlow-based generative models, conceptually attractive due to tractability of the\nexact log-likelihood computation and latent-variable inference as well as ef\ufb01ciency\nin training and sampling, has led to a number of impressive empirical successes\nand spawned many advanced variants and theoretical investigations. Despite com-\nputational ef\ufb01ciency, the density estimation performance of \ufb02ow-based generative\nmodels signi\ufb01cantly falls behind those of state-of-the-art autoregressive models.\nIn this work, we introduce masked convolutional generative \ufb02ow (MACOW), a\nsimple yet effective architecture for generative \ufb02ow using masked convolution. By\nrestricting the local connectivity to a small kernel, MACOW features fast and stable\ntraining along with ef\ufb01cient sampling while achieving signi\ufb01cant improvements\nover Glow for density estimation on standard image benchmarks, considerably\nnarrowing the gap with autoregressive models.\n\n1\n\nIntroduction\n\nUnsupervised learning of probabilistic models is a central yet challenging problem. Deep gen-\nerative models have shown promising results in modeling complex distributions such as natural\nimages (Radford et al., 2015), audio (Van Den Oord et al., 2016) and text (Bowman et al., 2015).\nMultiple approaches emerged in recent years, including Variational Autoencoders (VAEs) (Kingma\nand Welling, 2014), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), autoregres-\nsive neural networks (Larochelle and Murray, 2011; Oord et al., 2016), and \ufb02ow-based generative\nmodels (Dinh et al., 2014, 2016; Kingma and Dhariwal, 2018). Among these, \ufb02ow-based genera-\ntive models gained popularity for this capability of estimating densities of complex distributions,\nef\ufb01ciently generating high-\ufb01delity syntheses, and automatically learning useful latent spaces.\nFlow-based generative models typically warp a simple distribution into a complex one by mapping\npoints from the simple distribution to the complex data distribution through a chain of invertible\ntransformations with Jacobian determinants that are ef\ufb01cient to compute. This design guarantees that\nthe density of the transformed distribution can be analytically estimated, making maximum likelihood\nlearning feasible. Flow-based generative models have spawned signi\ufb01cant interests for improving\nand analyzing its algorithms both theoretically and practically, and applying them to a wide range of\ntasks and domains.\nIn their pioneering work, Dinh et al. (2014) \ufb01rst proposed Non-linear Independent Component\nEstimation (NICE) to apply \ufb02ow-based models for modeling complex high-dimensional densities.\nRealNVP (Dinh et al., 2016) extended NICE with a more \ufb02exible invertible transformation to\nexperiment with natural images. However, these \ufb02ow-based generative models resulted in worse\ndensity estimation performance compared to state-of-the-art autoregressive models, and are incapable\nof realistic synthesis of large images compared to GANs (Karras et al., 2018; Brock et al., 2019).\nRecently, Kingma and Dhariwal (2018) proposed Glow as a generative \ufb02ow with invertible 1 \u00d7 1\nconvolutions, which signi\ufb01cantly improved the density estimation performance on natural images.\nImportantly, they demonstrated that \ufb02ow-based generative models optimized towards the plain\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flikelihood-based objective are capable of generating realistic high-resolution natural images ef\ufb01ciently.\nPrenger et al. (2018) investigated applying \ufb02ow-based generative models to speech synthesis by\ncombining Glow with WaveNet (Van Den Oord et al., 2016). Ziegler and Rush (2019) adopted\nvariational inference to apply generative \ufb02ows to discrete sequential data. Unfortunately, the density\nestimation performance of Glow on natural images remains behind autoregressive models, such as\nPixelRNN/CNN (Oord et al., 2016; Salimans et al., 2017), Image Transformer (Parmar et al., 2018),\nPixelSNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019). There is also some\nwork (Rezende and Mohamed, 2015; Kingma et al., 2016; Zheng et al., 2017) trying to apply \ufb02ow to\nvariational inference.\nIn this paper, we propose a novel architecture of generative \ufb02ow, masked convolutional generative\n\ufb02ow (MACOW), which leverages masked convolutional neural networks (Oord et al., 2016). The\nbijective mapping between input and output variables is easily established while the computation of the\ndeterminant of the Jacobian remians ef\ufb01cient. Compared to inverse autoregressive \ufb02ow (IAF) (Kingma\net al., 2016), MACOW offers stable training and ef\ufb01cient inference and synthesis by restricting the\nlocal connectivity in a small \u201cmasked\u201d kernel as well as large receptive \ufb01elds by stacking multiple\nlayers of convolutional \ufb02ows and using rotational ordering masks (\u00a73.1). We also propose a \ufb01ne-\ngrained version of the multi-scale architecture adopted in previous \ufb02ow-based generative models to\nfurther improve the performance (\u00a73.2). Experimenting with four benchmark datasets for images,\nCIFAR-10, ImageNet, LSUN, and CelebA-HQ, we demonstrate the effectiveness of MACOW as\na density estimator by consistently achieving signi\ufb01cant improvements over Glow on all the three\ndatasets. When equipped with the variational dequantization mechanism (Ho et al., 2019), MACOW\nconsiderably narrows the gap of the density estimation with autoregressive models (\u00a74).\n\n2 Flow-based Generative Models\n\nIn this section, we \ufb01rst setup notations, describe \ufb02ow-based generative models, and review\nGlow (Kingma and Dhariwal, 2018) as it is the foundation for MACOW.\n\n2.1 Notations\n\nThroughout the paper, uppercase letters represent random variables and lowercase letters for realiza-\ntions of their corresponding random variables. Let X \u2208 X be the random variables of the observed\ndata, e.g., X is an image or a sentence for image and text generation, respectively.\nLet P denote the true distribution of the data, i.e., X \u223c P , and D = {x1, . . . , xN} be our training\nsample, where xi, i = 1, . . . , N, are usually i.i.d. samples of X. Let P = {P\u03b8 : \u03b8 \u2208 \u0398} denote a\nparametric statistical model indexed by the parameter \u03b8 \u2208 \u0398, where \u0398 is the parameter space. p\ndenotes the density of the corresponding distribution P . In the deep generative model literature, deep\nneural networks are the most widely used parametric models. The goal of generative models is to\nlearn the parameter \u03b8 such that P\u03b8 can best approximate the true distribution P . In the context of\nmaximum likelihood estimation, we minimize the negative log-likelihood of the parameters with:\n\nmin\n\u03b8\u2208\u0398\n\n1\nN\n\n\u2212 log p\u03b8(xi) = min\n\u03b8\u2208\u0398\n\nE(cid:101)P (X)[\u2212 log p\u03b8(X)],\n\n(1)\n\nN(cid:88)\n\ni=1\n\nwhere \u02dcP (X) is the empirical distribution derived from training data D.\n\n2.2 Flow-based Models\nIn the framework of \ufb02ow-based generative models, a set of latent variables Z \u2208 Z are introduced\nwith a prior distribution pZ(z), which is typically a simple distribution like a multivariate Gaussian.\nFor a bijection function f : X \u2192 Z (with g = f\u22121), the change of the variable formula de\ufb01nes the\nmodel distribution on X by\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:18) \u2202f\u03b8(x)\n\n\u2202x\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n(2)\n\np\u03b8(x) = pZ (f\u03b8(x))\n\nwhere \u2202f\u03b8(x)\n\n\u2202x\n\nis the Jacobian of f\u03b8 at x.\n\n2\n\n\fThe generative process is de\ufb01ned straightforwardly as the following:\n\nz \u223c pZ(z)\nx = g\u03b8(z).\n\n(3)\n\nFlow-based generative models focus on certain types of transformations f\u03b8 that allow the inverse\nfunctions g\u03b8 and Jacobian determinants to be tractable to compute. By stacking multiple such\ninvertible transformations in a sequence, which is also called a (normalizing) \ufb02ow (Rezende and\nMohamed, 2015), the \ufb02ow is then capable of warping a simple distribution (pZ(z)) into a complex\none (p(x)) through:\n\nX\n\nf1\u2190\u2192\n\ng1\n\nH1\n\nf2\u2190\u2192\n\ng2\n\nf3\u2190\u2192\n\n\u00b7\u00b7\u00b7 fK\u2190\u2192\n\nZ,\n\nH2\n\ng3\n\ngK\n\nwhere f = f1 \u25e6 f2 \u25e6 \u00b7\u00b7\u00b7 \u25e6 fK is a \ufb02ow of K transformations. For brevity, we omit the parameter \u03b8\nfrom f\u03b8 and g\u03b8.\n\n2.3 Glow\n\nRecently, several types of invertible transformations emerged to enhance the expressiveness of \ufb02ows,\namong which Glow (Kingma and Dhariwal, 2018) has stood out for its simplicity and effectiveness\non both density estimation and high-\ufb01delity synthesis. The following brie\ufb02y describes the three types\nof transformations that comprise Glow.\n\nActnorm. Kingma and Dhariwal (2018) proposed an activation normalization layer (Actnorm) as\nan alternative for batch normalization (Ioffe and Szegedy, 2015) to alleviate the challenges in model\ntraining. Similar to batch normalization, Actnorm performs an af\ufb01ne transformation of the activations\nusing a scale and bias parameter per channel for 2D images, such that\n\nyi,j = s (cid:12) xi,j + b,\n\nwhere both x and y are tensors of shape [h \u00d7 w \u00d7 c] with spatial dimensions (h, w) and channel\ndimension c.\nInvertible 1 \u00d7 1 convolution. To incorporate a permutation along the channel dimension, Glow\nincludes a trainable invertible 1 \u00d7 1 convolution layer to generalize the permutation operation as:\n\nwhere W is the weight matrix with shape c \u00d7 c.\n\nyi,j = W xi,j,\n\nAf\ufb01ne Coupling Layers. Following Dinh et al. (2016), Glow includes af\ufb01ne coupling layers in its\narchitecture of:\n\nxa, xb = split(x)\n\nya = xa\nyb = s(xa) (cid:12) xb + b(xa)\ny = concat(ya, yb),\n\nwhere s(xa) and b(xa) are outputs of two neural networks with xa as input. The split() and concat()\nfunctions perform operations along the channel dimension.\nFrom this designed architecture of Glow, we see that interactions between spatial dimensions are\nincorporated only in the coupling layers. The coupling layer, however, is typically costly for memory\nresources, making it infeasible to stack a signi\ufb01cant number of coupling layers into a single model,\nespecially when processing high-resolution images. The main goal of this work is to design a new\ntype of transformation that simultaneously models the dependencies in both the spatial and channel\ndimensions while maintaining a relatively small memory footprint to improve the capacity of the\ngenerative \ufb02ow.\n\n3 Masked Convolutional Generative Flows\n\nIn this section, we describe the architectural components of the masked convolutional generative\n\ufb02ow (MACOW). First, we introduce the proposed \ufb02ow transformation using masked convolutions in\n\u00a73.1. Then, we present a \ufb01ne-grained version of the multi-scale architecture adopted by previous\ngenerative \ufb02ows (Dinh et al., 2016; Kingma and Dhariwal, 2018) in \u00a73.2.\n\n3\n\n\fFigure 1: Visualization of the receptive \ufb01eld of four masked convolutions with rotational ordering.\n\n3.1 Flow with Masked Convolutions\n\nApplying autoregressive models to normalizing \ufb02ows has been previously explored in studies (Kingma\net al., 2016; Papamakarios et al., 2017), with idea of sequentially modeling the input random variables\nin an autoregressive order to ensure the model cannot read input variables behind the current one:\n\nyt = s(x<t) (cid:12) xt + b(x<t),\n\n(4)\n\nwhere x<t denotes the input variables in x positioned ahead of xt in the autoregressive order. s()\nand b() are two autoregressive neural networks typically implemented using spatial masks (Germain\net al., 2015; Oord et al., 2016).\nDespite effectiveness in high-dimensional space, autoregressive \ufb02ows suffer from two crucial prob-\nlems: (1) The training procedure is unstable when stacking multiple layers to increase the \ufb02ow\ncapacities for complex distributions. (2) Inference and synthesis are inef\ufb01cient, due to the non-\nparallelizable inverse function.\nWe propose to use masked convolutions to restrict the local connectivity in a small \u201cmasked\u201d kernel\nto address these two problems. The two autoregressive neural networks, s() and b(), are implemented\nwith one-layer masked convolutional networks with small kernels (e.g. 2 \u00d7 5 in Figure 1) to ensure\nthey only read contexts in a small neighborhood based on:\n\ns(x<t) = s(xt(cid:63)),\n\nb(x<t) = b(xt(cid:63)),\n\n(5)\n\nwhere xt(cid:63) denotes the input variables, restricted in a small kernel, on which xt depends. By using\nmasks in rotational ordering and stacking multiple layers of \ufb02ows, the model captures a large receptive\n\ufb01eld (see Figure 1), and models dependencies in both the spatial and channel dimensions.\n\nEf\ufb01cient Synthesis. As discussed above, synthesis from autoregressive \ufb02ows is inef\ufb01cient since\nthe inverse must be computed by sequentially traversing through the autoregressive order. In the\ncontext of 2D images with shape [h \u00d7 w \u00d7 c], the time complexity of synthesis is quadratic, i.e.\nO(h \u00d7 w \u00d7 NN(h, w, c)), where NN(h, w, c) is the time of computing the outputs from the neural\nnetwork s() and b() with input shape [h \u00d7 w \u00d7 c]. In our proposed \ufb02ow with masked convolutions,\ncomputation of xi,j begins as soon as all xt(cid:63) are available, contrary to the autoregressive requirement\nthat all x<i,j must have been already computed. Moreover, at each step we only need to feed a slice\nof the image (with shape [h\u00d7 kw\u00d7 c] or [kh\u00d7 w\u00d7 c] depending on the direction of the mask) into s()\nand b(). Here [kh \u00d7 kw \u00d7 c] is the shape of the kernel in the convolution. Thus, the time complexity\nreduces signi\ufb01cantly from quadratic to linear, which is O(h\u00d7NN(kh, w , c)) or O(w\u00d7NN(kw , h, c))\nfor horizontal and vertical masks, respectively.\n\nDiscussion The previous work closely related to MACOW is the Emerging Convolutions proposed\nin Hoogeboom et al. (2019). There are two main differences. i) the pattern of the mask is different.\nEmerging Convolutions use \u201ccausal masks\u201d (Oord et al., 2016) whose inverse function falls into a\ncomplete autoregressive transformation. In contrast, MACOW achieves signi\ufb01cantly more ef\ufb01cient\ninference and sampling (\u00a74.3), due to the carefully designed masks (Figure 1). ii) the Emerging\nConvolutional Flow, proposed as an alternative to the invertible 1\u00d7 1 convolution in Glow, is basically\na linear transformation with masked convolutions, which does not introduce \u201cnonlinearity\u201d to the\nrandom variables. MACOW, however, introduces such nonlinearity similar to the coupling layers.\n\n4\n\n\f(a) One step of MACOW (b) Original multi-scale architecture\nFigure 2: The architecture of the proposed MACOW model, where each step (a) consists of T units of\nActNorm followed by two masked convolutions with rotational ordering, and a Glow step. This \ufb02ow\nis combined with either an original multi-scale (b) or a \ufb01ne-grained architecture (c).\n\n(c) Fine-grained multi-scale architecture\n\n3.2 Fine-grained Multi-Scale Architecture\n\nDinh et al. (2016) proposed a multi-scale architecture using a squeezing operation, which has been\ndemonstrated to be helpful for training very deep \ufb02ows. In the original multi-scale architecture, the\nmodel factors out half of the dimensions at each scale to reduce computational and memory costs. In\nthis paper, inspired by the size upscaling in subscale ordering (Menick and Kalchbrenner, 2019) that\ngenerates an image as a sequence of sub-images with equal size, we propose a \ufb01ne-grained multi-scale\narchitecture to improve model performance further. In this \ufb01ne-grained multi-scale architecture, each\nscale consists of M/2 blocks, and after each block, the model splits out 1/M dimensions of the\ninput1. Figure 2 illustrates the graphical speci\ufb01cation of the two architecture versions. Note that the\n\ufb01ne-grained architecture reduces the number of parameters compared with the original architecture\nwith the same number of \ufb02ow layers. Experimental improvements demonstrate the effectiveness of\nthe \ufb01ne-grained multi-scale architecture (\u00a74).\n\n4 Experiments\n\nWe evaluate our MACOW model on both low-resolution and high-resolution datasets. For a step of\nMACOW, we use T = 2 masked convolution units, and the Glow step is the same as that described\nin Kingma and Dhariwal (2018) where an ActNorm is followed by an Invertible 1 \u00d7 1 convolution,\nwhich is followed by a coupling layer. Each coupling layer includes three convolution layers where\nthe \ufb01rst and last convolutions are 3 \u00d7 3, while the center convolution is 1 \u00d7 1. For low-resolution\nimages, we use af\ufb01ne coupling layers with 512 hidden channels, while for high-resolution images\nwe use additive layers with 256 hidden channels to reduce memory cost. ELU (Clevert et al., 2015)\nis used as the activation function throughout the \ufb02ow architecture. For variational dequantization,\nthe dequantization noise distribution q\u03c6(u|x) is modeled with a conditional MACOW with shallow\narchitecture. Additional details on architectures, results, and analysis of the conducted experiments\nare provided in Appendix B.\n\n4.1 Low-Resolution Images\n\nWe begin our experiments with an evaluation of the density estimation performance of MACOW on\ntwo low-resolution image datasets that are commonly used to evaluate the deep generative models:\nCIFAR-10 with images of size 32\u00d7 32 (Krizhevsky and Hinton, 2009) and the 64\u00d7 64 downsampled\nversion of ImageNet (Oord et al., 2016).\nWe perform experiments to dissect the effectiveness of each component of our MACOW model with\nablation studies. The Org model utilizes the original multi-scale architecture, while the +\ufb01ne-grained\nmodel augments the original one with the \ufb01ne-grained multi-scale architecture proposed in \u00a73.2. The\n\n1In our experiments, we set M = 4. Note that the original multi-scale architecture is a special case of the\n\n\ufb01ne-grained version with M = 2.\n\n5\n\nMasked Conv (B)Masked Conv (A)Actnorm\u00d7TGlow StepMasked Conv (D)Masked Conv (C)Actnormsplitstepsqueeze z/2\u00d7K\u00d7(L\u22121) zsplitstepsqueeze z/M\u00d7K\u00d7(L\u22121) z\u00d7M/2Masked Conv (B)Masked Conv (A)Actnorm\u00d7TGlow Stepsplitstepsqueeze z/2\u00d7K\u00d7(L\u22121) zsplitstepsqueeze z/M\u00d7K\u00d7(L\u22121) z\u00d7M/2Masked Conv (B)Masked Conv (A)Actnorm\u00d7TGlow Step\fTable 1: Density estimation performance on CIFAR-10 32 \u00d7 32 and ImageNet 64 \u00d7 64. Results are\nreported in bits/dim.\n\nAutoregressive\n\nFlow-based\n\nModel\nIAF VAE (Kingma et al., 2016)\nParallel Multiscale (Reed et al., 2017)\nPixelRNN (Oord et al., 2016)\nGated PixelCNN (van den Oord et al., 2016)\nMAE (Ma et al., 2019)\nPixelCNN++ (Salimans et al., 2017)\nPixelSNAIL (Chen et al., 2017)\nSPN (Menick and Kalchbrenner, 2019)\nReal NVP (Dinh et al., 2016)\nGlow (Kingma and Dhariwal, 2018)\nFlow++: Unif (Ho et al., 2019)\nFlow++: Var (Ho et al., 2019)\nMACOW: Org\nMACOW: +\ufb01ne-grained\nMACOW: +var\n\nCIFAR-10\n\nImageNet-64\n\n3.11\n\n\u2013\n\n3.00\n3.03\n2.95\n2.92\n2.85\n\u2013\n\n3.49\n3.35\n3.29\n3.09\n3.31\n3.28\n3.16\n\n\u2013\n\n3.70\n3.63\n3.57\n\n\u2013\n\u2013\n3.52\n3.52\n3.98\n3.81\n\n\u2013\n\n3.69\n3.78\n3.75\n3.69\n\nTable 2: Negative log-likelihood scores for 5-bit LSUN and CelebA-HQ datasets in bits/dim.\n\nModel\nGlow (Kingma and Dhariwal, 2018)\nSPN (Menick and Kalchbrenner, 2019)\nMACOW: Unif\nMACOW: Var\n\nLSUN\nCelebA-HQ bedroom tower\n\n1.03\n0.61\n0.95\n0.67\n\n1.20\n\n\u2013\n\n1.16\n0.98\n\n\u2013\n\u2013\n\n1.22\n1.02\n\nchurch\n\n\u2013\n\u2013\n\n1.36\n1.09\n\n+var model further implements the variational dequantization on the top of +\ufb01ne-grained to replace\nthe uniform dequantization (see Appendix A for details). For each ablation, we slightly adjust the\nnumber of steps in each level so that all the models have a similar number of parameters.\nTable 1 provides the density estimation performance for different variations of our MACOW model\nalong with the top-performing autoregressive models (\ufb01rst section) and \ufb02ow-based generative models\n(second section). First, on both datasets, \ufb01ne-grained models outperform Org ones, demonstrating\nthe effectiveness of the \ufb01ne-grained multi-scale architecture. Second, with the uniform dequan-\ntization, MACOW combined with the \ufb01ne-grained multi-scale architecture signi\ufb01cantly improves\nthe performance over Glow on both datasets, and obtains slightly better results than Flow++ on\nCIFAR-10. In addition, with variational dequantization, MACOW achieves comparable result in\nbits/dim with Flow++ on ImageNet 64 \u00d7 64. On CIFAR-10, however, the performance of MaCow is\naround 0.07 bits/dim behind Flow++.\nCompared with the state-of-the-art autoregressive generative models PixelSNAIL (Chen et al., 2017)\nand SPN (Menick and Kalchbrenner, 2019), the performance of MACOW is approximately 0.31\nbits/dim worse on CIFAR-10 and 0.14 worse on ImageNet 64 \u00d7 64. Further improving the density\nestimation performance of MACOW on natural images is left to future work.\n\n4.2 High-Resolution Images\n\nWe next demonstrate experimentally that our MACOW model is capable of high \ufb01delity samples at\nhigh-resolution. Following Kingma and Dhariwal (2018), we choose the CelebA-HQ dataset (Karras\net al., 2018), which consists of 30,000 high-resolution images from the CelebA dataset (Liu et al.,\n2015), and the LSUN (Yu et al., 2015) datasets including categories bedroom, tower and church. We\ntrain our models on 5-bit images with the \ufb01ne-grained multi-scale architecture and both the uniform\nand variational dequantization. For each model, we adjust the number of steps in each level so that\nall the models have similar numbers of parameters with Glow for a fair comparison.\n\n6\n\n\f(a) CelebA-HQ\n\n(b) LSUN church\n\n(c) LSUN tower\n\n(d) LSUN bedroom\n\nFigure 3: (a) 5-bit 256 \u00d7 256 CelebA-HQ samples with temperature 0.7; (b)(c)(d) 5-bit 128 \u00d7 128\nLSUN church, tower and bedroom samples, with temperature 0.9, respectively.\n\n4.2.1 Density Estimation\n\nTable 2 illustrates the negative log-likelihood scores in bits/dim of two versions of MACOW on the\n5-bit 128 \u00d7 128 LSUN and 256 \u00d7 256 CelebA-HQ datasets. With uniform dequantization, MACOW\nimproves the log-likelihood over Glow from 1.03 bits/dim to 0.95 bits/dim on CelebA-HQ dataset.\nEquipped with variational dequantization, MACOW obtains 0.67 bits/dim, which is 0.06 bits/dim\nbehind the state-of-the-art autoregressive generative model SPN (Menick and Kalchbrenner, 2019)\nand signi\ufb01cantly narrows the gap. On the LSUN datasets, MACOW with uniform dequantization\noutperforms Glow with 0.4 bits/dim on the bedroom category. With variational dequantization, the\nmodel achieves further improvements on all the three categories of LSUN datasets,\n\n4.2.2 Image Generation\n\nConsistent with previous work on likelihood-based generative models (Parmar et al., 2018; Kingma\nand Dhariwal, 2018), we found that sampling from a reduced-temperature model often results in\nhigher-quality samples. Figure 3 showcases some random samples for 5-bit CelebA-HQ 256 \u00d7 256\nwith temperature 0.7 and LSUN 128 \u00d7 128 with temperature 0.9. The images are extremely high\n\n7\n\n\fTable 3: (a) Image synthesis speed on CIFAR10. Glow re-implemented in PyTorch is masked with\n\u2020. \u2021 denotes results shown in Hoogeboom et al. (2019). (b) Image synthesis speed of MACOW on\ndatasets with different image sizes. The time is measured in milliseconds to sample a datapoint when\ncomputed in mini-batchs with size 100.\n\n(a)\n\n(b)\n\nCIFAR10\nGlow\u2021\nMAF \u2021\nEmerging\u2021\nGlow\u2020\nMACOW\n\ntime (ms)\n\nSlow-down\n\n5\n\n3000\n1800\n5.3\n38.7\n\n1.0\n600.0\n360.0\n1.0\n7.3\n\nimage size\nDataset\n32 \u00d7 32\nCIFAR10\n64 \u00d7 64\nImageNet\n128 \u00d7 128\nLSUN\nCelebA-HQ 256 \u00d7 256\n\ntime (ms)\n\n38.7\n104.7\n267.9\n434.2\n\nquality for non-autoregressive likelihood models, despite that maximum likelihood is a principle that\nvalues diversity over sample quality in a limited capacity setting (Theis et al., 2016). More samples\nof images, including samples of low-resolution ones, are provided in Appendix C2.\n\n4.3 Comparison on Synthesis Speed\n\nIn this section, we compare the synthesis speed of MACOW at test time with that of Glow (Kingma\nand Dhariwal, 2018), Masked Autoregressive Flows (MAF) (Papamakarios et al., 2017) and Emerging\nConvolutions (Hoogeboom et al., 2019). Following Hoogeboom et al. (2019), we measure the time\nto sample a datapoint when computed in mini-batchs with size 100. For fair comparison, we re-\nimplemented Glow using PyTorch (Paszke et al., 2017), and all experiments are conducted on a single\nNVIDIA TITAN X GPU.\nTable 3a shows the sampling speed of MACOW on CIFAR-10, together with that of the baselines.\nMACOW is 7.3 times slower than Glow, much faster than Emerging Convolution and MAF, whose\nfactors are 360 and 600 respectively. The sampling speed of MACOW on datasets with different\nimage sizes is shown in Table 3b. We see that the time of synthesis increases approximately linearly\nwith the increase of image resolution.\n\n5 Conclusion\n\nIn this paper, we propose a new type of generative \ufb02ow, coined MACOW, which exploits masked\nconvolutional neural networks. By restricting the local dependencies in a small masked kernel,\nMACOW boasts fast and stable training as well as ef\ufb01cient sampling. Experiments on both low-\nand high-resolution benchmark datasets of images show the capability of MACOW on both density\nestimation and high-\ufb01delity generation, achieving state-of-the-art or comparable likelihood as well as\nits superior quality of samples compared to previous top-performing models3\nA potential direction for future work is to extend MACOW to other forms of data, in particular text,\non which no attempt (to the best of our knowledge) has been made to apply \ufb02ow-based generative\nmodels. Another exciting direction is to combine MACOW with variational inference to automatically\nlearn meaningful (low-dimensional) representations from raw data.\n\nReferences\nSamuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\nGenerating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\nAndrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high \ufb01delity natural\n\nimage synthesis. In International Conference on Learning Representations (ICLR), 2019.\n\n2The reduced-temperature sampling is only applied to LSUN and CelebA-HQ 5-bits images, where MACOW\nadopts additive coupling layers. For CIFAR-10 and ImageNet 8-bits images, we sample with temperature 1.0.\n\n3https://github.com/XuezheMax/macow\n\n8\n\n\fXi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autore-\n\ngressive generative model. arXiv preprint arXiv:1712.09763, 2017.\n\nDjork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network\n\nlearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\nLaurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv\n\npreprint arXiv:1605.08803, 2016.\n\nMathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for\ndistribution estimation. In International Conference on Machine Learning, pages 881\u2013889, 2015.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa-\ntion processing systems (NIPS-2014), pages 2672\u20132680, 2014.\n\nJonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving \ufb02ow-\nbased generative models with variational dequantization and architecture design. In International\nConference on Machine Learning, pages 2722\u20132730, 2019.\n\nEmiel Hoogeboom, Rianne Van Den Berg, and Max Welling. Emerging convolutions for generative\nnormalizing \ufb02ows. In International Conference on Machine Learning, pages 2771\u20132780, 2019.\n\nSergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\nTero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\nimproved quality, stability, and variation. In International Conference on Learning Representations\n(ICLR), 2018.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2th\nInternational Conference on Learning Representations (ICLR-2014), Banff, Canada, April 2014.\n\nDiederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems, pages 4743\u20134751, 2016.\n\nDurk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\n\nAdvances in Neural Information Processing Systems, pages 10236\u201310245, 2018.\n\nAlex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\nHugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceedings\nof the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS-2011,\npages 29\u201337, 2011.\n\nZiwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), pages 3730\u20133738, 2015.\n\nXuezhe Ma, Chunting Zhou, and Eduard Hovy. MAE: Mutual posterior-divergence regularization for\nvariational autoencoders. In International Conference on Learning Representations (ICLR), 2019.\n\nJacob Menick and Nal Kalchbrenner. Generating high \ufb01delity images with subscale pixel networks\n\nand multidimensional upscaling, 2019.\n\nAaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.\n\nIn Proceedings of International Conference on Machine Learning (ICML-2016), 2016.\n\n9\n\n\fGeorge Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive \ufb02ow for density\n\nestimation. In Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\nNiki Parmar, Ashish Vaswani, Jakob Uszkoreit, \u0141ukasz Kaiser, Noam Shazeer, and Alexander Ku.\n\nImage transformer. arXiv preprint arXiv:1802.05751, 2018.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In NIPS Autodiff Workshop, 2017.\n\nRyan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A \ufb02ow-based generative network for\n\nspeech synthesis. arXiv preprint arXiv:1811.00002, 2018.\n\nAlec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\nScott Reed, A\u00e4ron Oord, Nal Kalchbrenner, Sergio G\u00f3mez Colmenarejo, Ziyu Wang, Yutian Chen,\nDan Belov, and Nando Freitas. Parallel multiscale autoregressive density estimation. In Interna-\ntional Conference on Machine Learning, pages 2912\u20132921, 2017.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\nTim Salimans, Andrej Karpathy, Xi Chen, Diederik P Kingma, and Yaroslav Bulatov. Pixelcnn++: A\npixelcnn implementation with discretized logistic mixture likelihood and other modi\ufb01cations. In\nInternational Conference on Learning Representations (ICLR), 2017.\n\nL Theis, A van den Oord, and M Bethge. A note on the evaluation of generative models.\n\nInternational Conference on Learning Representations (ICLR 2016), pages 1\u201310, 2016.\n\nIn\n\nBenigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive\ndensity-estimator. In Advances in Neural Information Processing Systems, pages 2175\u20132183, 2013.\n\nA\u00e4ron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,\nNal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw\naudio. 2016.\n\nAaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\nimage generation with pixelcnn decoders. In Advances in Neural Information Processing Systems,\npages 4790\u20134798, 2016.\n\nFisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-\nscale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365,\n2015.\n\nGuoqing Zheng, Yiming Yang, and Jaime G. Carbonell. Convolutional normalizing \ufb02ows. CoRR,\n\nabs/1711.02255, 2017.\n\nZachary M Ziegler and Alexander M Rush. Latent normalizing \ufb02ows for discrete sequences. In\n\nProceedings of International Conference on Machine Learning (ICML-2019), 2019.\n\n10\n\n\f", "award": [], "sourceid": 3167, "authors": [{"given_name": "Xuezhe", "family_name": "Ma", "institution": "Carnegie Mellon University"}, {"given_name": "Xiang", "family_name": "Kong", "institution": "Carnegie Mellon University"}, {"given_name": "Shanghang", "family_name": "Zhang", "institution": "Carnegie Mellon University"}, {"given_name": "Eduard", "family_name": "Hovy", "institution": "Carnegie Mellon University"}]}