{"title": "MintNet: Building Invertible Neural Networks with Masked Convolutions", "book": "Advances in Neural Information Processing Systems", "page_first": 11004, "page_last": 11014, "abstract": "We propose a new way of constructing invertible neural networks by combining simple building blocks with a novel set of composition rules. This leads to a rich set of invertible architectures, including those similar to ResNets. Inversion is achieved with a locally convergent iterative procedure that is parallelizable and very fast in practice. Additionally, the determinant of the Jacobian can be computed analytically and efficiently, enabling their generative use as flow models. To demonstrate their flexibility, we show that our invertible neural networks are competitive with ResNets on MNIST and CIFAR-10 classification. When trained as generative models, our invertible networks achieve competitive likelihoods on MNIST, CIFAR-10 and ImageNet 32x32, with bits per dimension of 0.98, 3.32 and 4.06 respectively.", "full_text": "MintNet: Building Invertible Neural Networks with\n\nMasked Convolutions\n\nYang Song\u2217\n\nStanford University\n\nyangsong@cs.stanford.edu\n\nChenlin Meng\u2217\nStanford University\n\nchenlin@cs.stanford.edu\n\nAbstract\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nWe propose a new way of constructing invertible neural networks by combining\nsimple building blocks with a novel set of composition rules. This leads to a\nrich set of invertible architectures, including those similar to ResNets. Inversion\nis achieved with a locally convergent iterative procedure that is parallelizable\nand very fast in practice. Additionally, the determinant of the Jacobian can be\ncomputed analytically and ef\ufb01ciently, enabling their generative use as \ufb02ow models.\nTo demonstrate their \ufb02exibility, we show that our invertible neural networks are\ncompetitive with ResNets on MNIST and CIFAR-10 classi\ufb01cation. When trained\nas generative models, our invertible networks achieve competitive likelihoods on\nMNIST, CIFAR-10 and ImageNet 32\u00d732, with bits per dimension of 0.98, 3.32\nand 4.06 respectively.\n\n1\n\nIntroduction\n\nInvertible neural networks have many applications in machine learning. They have been employed to\ninvestigate representations of deep classi\ufb01ers [15], understand the cause of adversarial examples [14],\nlearn transition operators for MCMC [28, 18], create generative models that are directly trainable by\nmaximum likelihood [6, 5, 24, 16, 9, 1], and perform approximate inference [27, 17].\nMany applications of invertible neural networks require that both inverting the network and computing\nthe Jacobian determinant be ef\ufb01cient. While typical neural networks are not invertible, achieving these\nproperties often imposes restrictive constraints to the architecture. For example, planar \ufb02ows [27]\nand Sylvester \ufb02ow [2] constrain the number of hidden units to be smaller than the input dimension.\nNICE [5] and Real NVP [6] rely on dimension partitioning heuristics and speci\ufb01c architectures\nsuch as coupling layers, which could make training more dif\ufb01cult [1]. Methods like FFJORD [9],\ni-ResNets [1] have fewer architectural constraints. However, their Jacobian determinants have to be\napproximated, which is problematic if repeatedly performed at training time as in \ufb02ow models.\nIn this paper, we propose a new method of constructing invertible neural networks which are \ufb02exible,\nef\ufb01cient to invert, and whose Jacobian can be computed exactly and ef\ufb01ciently. We use triangular\nmatrices as our basic module. Then, we provide a set of composition rules to recursively build\nmore complex non-linear modules from the basic module, and show that the composed modules are\ninvertible as long as their Jacobians are non-singular. As in previous work [6, 24], the Jacobians\nof our modules are triangular, allowing ef\ufb01cient determinant computation. The inverse of these\nmodules can be obtained by an ef\ufb01ciently parallelizable \ufb01xed-point iteration method, making the cost\nof inversion comparable to that of an i-ResNet [1] block.\nUsing our composition rules and masked convolutions as the basic triangular building block, we\nconstruct a rich set of invertible modules to form a deep invertible neural network. The architecture of\nour proposed invertible network closely follows that of ResNet [10]\u2014the state-of-the-art architecture\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fof discriminative learning. We call our model Masked Invertible Network (MintNet). To demonstrate\nthe capacity of MintNets, we \ufb01rst test them on image classi\ufb01cation. We found that a MintNet\nclassi\ufb01er achieves 99.6% accuracy on MNIST, matching the performance of a ResNet with a similar\narchitecture. On CIFAR-10, it achieves 91.2% accuracy, comparable to the 92.6% accuracy of ResNet.\nWhen using MintNets as generative models, they achieve the new state-of-the-art results of bits per\ndimension (bpd) on uniformly dequantized images. Speci\ufb01cally, MintNet achieves bpd values of 0.98,\n3.32, and 4.06 on MNIST, CIFAR-10 and ImageNet 32\u00d732, while former best published results are\n0.99 (FFJORD [9]), 3.35 (Glow [16]) and 4.09 (Glow) respectively. Moreover, MintNet uses fewer\nparameters and less computational resources. Our MNIST model uses 30% fewer parameters than\nFFJORD [9]. For CIFAR-10 and ImageNet 32\u00d732, MintNet uses 60% and 74% fewer parameters\nthan the corresponding Glow [16] models. When training on dataset such as CIFAR-10, MintNet\nrequired 2 GPUs for approximately 5 days, while FFJORD [9] used 6 GPUs for approximately 5\ndays, and Glow [16] used 8 GPUs for approximately 7 days.\n\n2 Background\nConsider a neural network f : RD \u2192 RL that maps a data point x \u2208 RD to a latent representation\nz \u2208 RL. When for every z \u2208 RL there exists a unique x \u2208 RD such that f (x) = z, we call f an\ninvertible neural network. There are several basic properties of invertible networks. First, when f (x)\nis continuous, a necessary condition for f to be invertible is D = L. Second, if f1 : RD \u2192 RD\nand f2 : RD \u2192 RD are both invertible, f = f2 \u25e6 f1 will also be invertible. In this work, we mainly\nconsider applications of invertible neural networks to classi\ufb01cation and generative modeling.\n\n2.1 Classi\ufb01cation with invertible neural networks\n\nNeural networks for classi\ufb01cation are usually not invertible because the number of classes L is usually\ndifferent from the input dimension D. Therefore, when discussing invertible neural networks for\nclassi\ufb01cation, we separate the classi\ufb01er into two parts f = f2 \u25e6 f1: feature extraction z = f1(x) and\nclassi\ufb01cation y = f2(z), where f2 is usually the softmax function. We say the classi\ufb01er is invertible\nwhen f1 is invertible. Invertible classi\ufb01ers are arguably more interpretable, because a prediction can\nbe traced down by inverting latent representations [15, 14].\n\n2.2 Generative modeling with invertible neural networks\nAn invertible network f : x \u2208 RD (cid:55)\u2192 z \u2208 RD can be used to warp a complex probability density\np(x) to a simple base distribution \u03c0(z) (e.g., a multivariate standard Gaussian) [5, 6]. Under the\ncondition that both f and f\u22121 are differentiable, the densities of p(x) and \u03c0(z) are related by the\nfollowing change of variable formula\n\nlog p(x) = log \u03c0(z) + log | det(Jf (x))|,\n\n(1)\nwhere Jf (x) denotes the Jacobian of f (x) and we require Jf (x) to be non-singular so that\nlog | det(Jf (x))| is well-de\ufb01ned. Using this formula, p(x) can be easily computed if the Jaco-\nbian determinant det(Jf (x)) is cheaply computable and \u03c0(z) is known.\nTherefore, an invertible neural network f\u03b8(x) implicitly de\ufb01nes a normalized density model p\u03b8(x),\nwhich can be directly trained by maximum likelihood. The invertibility of f\u03b8 is critical to fast sample\ngeneration. Speci\ufb01cally, in order to generate a sample x from p\u03b8(x), we can \ufb01rst draw z \u223c \u03c0(z),\nand warp it back through the inverse of f\u03b8 to obtain x = f\u22121\nNote that multiple invertible models f1, f2,\u00b7\u00b7\u00b7 , fK can be stacked together to form a deeper invertible\nmodel f = fK \u25e6 \u00b7\u00b7\u00b7 \u25e6 f2 \u25e6 f1, without much impact on the inverse and determinant computation.\n\u25e6 \u00b7\u00b7\u00b7 \u25e6 f\u22121\nThis is because we can sequentially invert each component, i.e., f\u22121 = f\u22121\nK ,\nand the total Jacobian determinant equals the product of each individual Jacobian determinant, i.e.,\n| det(Jf )| = | det(Jf1)|| det(Jf2)|\u00b7\u00b7\u00b7| det(JfK )|.\n\n\u03b8 (z).\n\n\u25e6 f\u22121\n\n2\n\n1\n\n3 Building invertible modules compositionally\n\nIn this section, we discuss how simple blocks like masked convolutions can be composed to build\ninvertible modules that allow ef\ufb01cient, parallelizable inversion and determinant computation. To this\n\n2\n\n\fFigure 1: Illustration of a masked convolution with 3 \ufb01lters and kernel size 3 \u00d7 3. Solid checkerboard\ncubes inside each \ufb01lter represent unmasked weights, while the transparent blue blocks represent the\nweights that have been masked out. The receptive \ufb01eld of each \ufb01lter on the input feature maps is\nindicated by regions shaded with the pattern (the colored square) below the corresponding \ufb01lter.\n\nend, we \ufb01rst introduce the basic building block of our models. Then, we propose a set of composition\nrules to recursively build up complex non-linear modules with triangular Jacobians. Next, we prove\nthat these composed modules are invertible as long as their Jacobians are non-singular. Finally, we\ndiscuss how these modules can be inverted ef\ufb01ciently using numerical methods.\n\n3.1 The basic module\nWe start from considering linear transformations f (x) = Wx + b, with W \u2208 RD\u00d7D, and b \u2208 RD.\nFor a general W, computing its Jacobian determinant requires O(D3) operations. We therefore\nchoose W to be a triangular matrix. In this case, the Jacobian determinant det(Jf (x)) = det(W) is\nthe product of all diagonal entries of W, and the computational complexity is reduced to O(D). The\nlinear function f (x) = Wx + b with W being triangular is our basic module.\n\nMasked convolutions. Convolution is a special type of linear transformation that is very effective\nfor image data. The triangular structure of the basic module can be achieved using masked con-\nvolutions (e.g., causal convolutions in PixelCNN [22]). We provide the formula of our masks in\nAppendix B and an illustration of a 3 \u00d7 3 masked convolution with 3 \ufb01lters in Fig. 1. Intuitively, the\ncausal structure of the \ufb01lters (ordering of the pixels) enforces a triangular structure.\n\n3.2 The calculus of building invertible modules\n\nComplex non-linear invertible functions can be constructed from our basic modules in two steps.\nFirst, we follow several composition rules so that the composed module has a triangular Jacobian.\nNext, we impose appropriate constraints so that the module is invertible. To simplify the discussion,\nwe only consider modules with lower triangular Jacobians here, and we note that it is straightforward\nto extend the analysis to modules with upper triangular Jacobians.\nThe following proposition summarizes several rules to compositionally build new modules with\ntriangular Jacobians using existing ones.\nProposition 1. De\ufb01ne F as the set of all continuously differentiable functions whose Jacobian is\nlower triangular. Then F contains the basic module in Section 3.1, and is closed under the following\ncomposition rules.\n\n\u2022 Rule of addition. f1 \u2208 F \u2227 f2 \u2208 F \u21d2 \u03bbf1 + \u00b5f2 \u2208 F, where \u03bb, \u00b5 \u2208 R.\n\u2022 Rule of composition. f1 \u2208 F \u2227 f2 \u2208 F \u21d2 f2 \u25e6 f1 \u2208 F. A special case is f \u2208 F \u21d2 h\u25e6 f \u2208\nF, where h(\u00b7) is a continuously differentiable non-linear activation function that is applied\nelement-wise.\n\nThe proof of this proposition is straightforward and deferred to Appendix A. By repetitively applying\nthe rules in Proposition 1, our basic linear module can be composed to construct complex non-linear\nmodules having continuous and triangular Jacobians. Note that besides our linear basic modules,\n\n3\n\n\fFigure 2: Venn Diagram relationships between invertible functions (I), the function sets of F and\nM, functions that meet the conditions of Theorem 1 (det(Jf ) (cid:54)= 0), functions whose Jacobian is\ntriangular and Jacobian diagonals are strictly positive (diag(Jf ) > 0), functions whose Jacobian is\ntriangular and Jacobian diagonals are all 1s (diag(Jf ) = 1).\n\nother functions with triangular and continuous Jacobians can also be made more expressive using the\ncomposition rules. For example, the layers of dimension partitioning models (e.g., NICE [5], Real\nNVP [6], Glow [16]) and autoregressive \ufb02ows (e.g., MAF [24]) all have continuous and triangular\nJacobians and therefore belong to F. Note that the rule of addition in Proposition 1 preserves\ntriangular Jacobians but not invertibility. Therefore, we need additional constraints if we want the\ncomposed functions to be invertible.\nNext, we state the condition for f \u2208 F to be invertible, and denote the invertible subset of F as M.\nTheorem 1. If f \u2208 F and Jf (x) is non-singular for all x in the domain, then f is invertible.\n\nProof. A proof can be found in Appendix A.\n\nThe non-singularity of Jf (x) constraint in Theorem 1 is natural in the context of generative modeling.\nThis is because in order for Eq. (1) to make sense, log | det(Jf )| has to be well-de\ufb01ned, which\nrequires Jf (x) to be non-singular.\nIn many cases, Theorem 1 can be easily used to check and enforce the invertibility of f \u2208 F. For\nexample, the layers of autoregressive \ufb02ow models and dimension partitioning models can all be\nviewed as elements of F because they are continuously differentiable and have triangular Jacobians.\nSince the diagonal entries of their Jacobians are always strictly positive and hence non-singular, we\ncan immediately conclude that they are invertible with Theorem 1, thus generalizing their model-\nspeci\ufb01c proofs of invertibility.\nIn Fig. 2, we provide a Venn Diagram to illustrate the set of functions that satisfy the condition of\nTheorem 1. As depicted by the orange set labeled by det(Jf ) (cid:54)= 0, Theorem 1 captures a subset of\nM where the Jacobians of functions are non-singular so that the change of variable formula is usable.\nNote the condition in Theorem 1 is suf\ufb01cient but not necessary. For example, f (x) = x3 \u2208 M is\ninvertible, but Jf (x = 0) = 3x2|x=0 = 0 is singular. Many previous invertible models with special\narchitectures, such as NICE, Real NVP, and MAF, can be viewed as elements belonging to subsets of\ndet(Jf ) (cid:54)= 0.\n\n3.3 Ef\ufb01cient inversion of the invertible modules\n\nIn this section, we show that when the conditions in Theorem 1 hold, not only do we know that f is\ninvertible (f \u2208 M), but also we have a \ufb01xed-point iteration method to invert f with strong theoretical\nguarantees and good performance in practice.\nThe pseudo-code of our proposed inversion algorithm is described in Algorithm 1. Theoretically, we\ncan prove that this method is locally convergent\u2014as long as the initial value is close to the true value,\nthe method is guaranteed to \ufb01nd the correct inverse. We formally summarize this result in Theorem 2.\nTheorem 2. The iterative method of Algorithm 1 is locally convergent whenever 0 < \u03b1 < 2.\n\n4\n\n\f(cid:46) T is the number of iterations; 0 < \u03b1 < 2 is the step size.\n\nAlgorithm 1 Fixed-point iteration method for computing f\u22121(z).\nRequire: T, \u03b1\n1: Initialize x0\n2: for t \u2190 1 to T do\n3:\n4:\n5:\n6: end for\n\nCompute f (xt\u22121)\nCompute diag(Jf (xt\u22121))\nxt \u2190 xt\u22121 \u2212 \u03b1 diag(Jf (xt\u22121))\u22121(f (xt\u22121) \u2212 z)\n\nreturn xT\n\nProof. We provide a more rigorous proof in Appendix A.\n\nIn practice, the method is also easily parallelizable on GPUs, making the cost of inverting f \u2208 M\nsimilar to that of an i-ResNet [1] layer. Within each iteration, the computation is mostly matrix\noperations that can be vectorized and run ef\ufb01ciently in parallel. Therefore, the time cost will be\nroughly proportional to the number of iterations, i.e., O(T ). As will be shown in our experiments,\nAlgorithm 1 converges fast and usually the error quickly becomes negligible when T (cid:28) D. This is in\nstark contrast to existing methods of inverting autoregressive \ufb02ow models such as MAF [24], where\nD univariate equations need to be solved sequentially, requiring at least O(D) iterations. There are\nalso other approaches for inverting f. For example, the bisection method is guaranteed to converge\nglobally, but its computational cost is O(D), and is usually much more expensive than Algorithm 1.\nNote that as discussed earlier, autoregressive \ufb02ow models can also be viewed as special cases of our\nframework. Therefore, Algorithm 1 is also applicable to inverting autoregressive \ufb02ow models and\ncould potentially result in large improvements of sampling speed.\n\n4 Masked Invertible Networks\n\nWe show that techniques developed in Section 3 can be used to build our Masked Invertible Network\n(MintNet). First, we discuss how we compose several masked convolutions to form the Masked\nInvertible Layer (Mint layer). Next, we stack multiple Mint layers to form a deep neural network, i.e.,\nthe MintNet. Finally, we compare MintNets with several existing invertible architectures.\n\n4.1 Building the Masked Invertible Layer\nWe construct an invertible module in M that serves as the basic layer of our MintNet. This invertible\nmodule, named Mint layer, is de\ufb01ned as\n\nL(x) = t (cid:12) x +\n\nW3\n\ni h\n\nW2\n\nijh(W1\n\nj x + b1\n\nj ) + b2\nij\n\n+ b3\ni ,\n\n(2)\n\nK(cid:88)\n\n(cid:18) K(cid:88)\n\ni=1\n\nj=1\n\n(cid:19)\n\ni }|K\n\ni=1, {W2\n\nij}|1\u2264i,j\u2264K, and {W3\n\nwhere (cid:12) denotes the elementwise multiplication, {W1\ni=1 are all\nlower triangular matrices with additional constraints to be speci\ufb01ed later, and t > 0. Additionally,\nMint layers use a monotonic activation function h, so that h(cid:48) \u2265 0. Common choices of h include\nELU [4], tanh and sigmoid. Note that every individual weight matrix has the same size, and the 3\ngroups of weights {W1\ni=1 can be implemented with 3 masked\nconvolutions (see Appendix B). We design the form of L(x) so that it resembles a ResNet / i-ResNet\nblock that also has 3 convolutions with K \u00d7 C \ufb01lters, with C being the number of channels of x.\nWhen using Algorithm 1 to invert Mint layers, we initialize x0 = z (cid:12) 1\nt .\nFrom Proposition 1 in Section 3.2, we can easily conclude that L \u2208 F. Now, we consider additional\nconstraints on the weights so that L \u2208 M, i.e., it is invertible. Note that the analytic form of its\nJacobian is\n\nij}|1\u2264i,j\u2264K and {W3\n\ni }|K\n\ni=1, {W2\n\ni }|K\n\ni }|K\n\nJL(x) =\n\nW3\n\ni Ai\n\nW2\n\nijBjW1\n\nj + t,\n\n(3)\n\nK(cid:88)\n\nK(cid:88)\n\ni=1\n\nj=1\n\n5\n\n\fwith Ai = diag(h(cid:48)(cid:0)(cid:80)K\n\nt > 0. Therefore, once we impose the following constraint\n\nj=1 W2\n\nijh(W1\n\nj x + b1\n\nj ) + b2\nij\n\n(cid:1)) \u2265 0, Bj = diag(h(cid:48)(W1\n\nj ) \u2265 0,\u22001 \u2264 i, j \u2264 K,\n\nj x + b1\n\nj )) \u2265 0, and\n\ndiag(W3\n\ni ) diag(W2\n\nij) diag(W1\n\n(4)\nwe have diag(JL(x)) > 0, which satis\ufb01es the condition of Theorem 1 and as a conse-\nquence we know L \u2208 M.\nIn practice, the constraint Eq. (4) can be easily implemented.\nFor all 1 \u2264 i, j \u2264 K, we impose no constraint on W3\nj , but replace W2\nij with\nij) has the same signs as\nV2\nij = W2\nij is almost\ndiag(W3\neverywhere differentiable w.r.t. W2\n\nij)) sign(diag(W3\nj ) and therefore diag(W3\n\ni and W1\nj )). Note that diag(V2\n\nij, which allows gradients to backprop through.\n\nj ) \u2265 0. Moreover, V2\n\ni W1\ni ) diag(V2\n\nij) diag(W1\n\nij sign(diag(W2\n\ni ) diag(W1\n\n4.2 Constructing the Masked Invertible Network\n\nIn this section, we introduce design choices that help stack multiple Mint layers together to form\nan expressive invertible neural network, namely the MintNet. The full MintNet is constructed by\nstacking the following paired Mint layers and squeezing layers.\n\nPaired Mint layers. As discussed above, our Mint layer L(x) always has a triangular Jacobian. To\nmaximize the expressive power of our invertible neural network, it is undesirable to constrain the\nJacobian of the network to be triangular since this limits capacity and will cause blind spots in the\nreceptive \ufb01eld of masked convolutions. We thus always pair two Mint layers together\u2014one with a\nlower triangular Jacobian and the other with an upper triangular Jacobian, so that the Jacobian of the\npaired layers is not triangular, and blind spots can be eliminated.\n\nSqueezing layers. Subsampling is important for enlarging the receptive \ufb01eld of convolutions.\nHowever, common subsampling operations such as pooling and strided convolutions are usually not\ninvertible. Following [6] and [1], we use a \u201csqueezing\u201d operation to reshape the feature maps so\nthat they have smaller resolution but more channels. After a squeezing operation, the height and\nwidth will decrease by a factor of k , but the number of channels will increase by a factor of k2. This\nprocedure is invertible and the Jacobian is an identity matrix. Throughout the paper, we use k = 2.\n\n4.3 Comparison to other approaches\n\nIn what follows we compare MintNets to several existing methods for developing invertible archi-\ntectures. We will focus on architectures with a tractable Jacobian determinant. However, we note\nthat there are models (cf ., [7, 21, 8]) that allow fast inverse computation but do not have tractable\nJacobian determinants. Following [1], we also provide some comparison in Tab. 5 (see Appendix E).\n\n4.3.1 Models based on identities of determinants\n\nSome identities can be used to speed up the computation of determinants if the Jacobians have\nspecial structures. For example, in Sylvester \ufb02ow [2], the invertible transformation has the form\nf (x) (cid:44) x + Ah(Bx + b), where h(\u00b7) is a nonlinear activation function, A \u2208 RD\u00d7M , B \u2208 RM\u00d7D,\nb \u2208 RM and M \u2264 D. By Sylvester\u2019s determinant identity, det(Jf (x)) can be computed in O(M 3),\nwhich is much less than O(D3) if M (cid:28) D. However, the requirement that M is small becomes a\nbottleneck of the architecture and limits its expressive power. Similarly, Planar \ufb02ow [27] uses the\nmatrix determinant lemma, but has an even narrower bottleneck.\nThe form of L(x) bears some resemblance to Sylvester \ufb02ow. However, we improve the capacity of\nSylvester \ufb02ow in two ways. First, we add one extra non-linear convolutional layer. Second, we avoid\nthe bottleneck that limits the maximum dimension of latent representations in Sylvester \ufb02ow.\n\n4.3.2 Models based on dimension partitioning\n\nNICE [5], Real NVP [6], and Glow [16] all depend on an af\ufb01ne coupling layer. Given d < D, x is\n\ufb01rst partitioned into two parts x = [x1:d; xd+1:D]. The coupling layer is an invertible transformation,\nde\ufb01ned as f : x (cid:55)\u2192 z,\nzd+1:D = xd+1:D (cid:12) exp(s(x1:d)) + t(x1:d), where s(\u00b7)\nand t(\u00b7) are two arbitrary functions. However, the partitioning of x relies on heuristics, and the\nperformance is sensitive to this choice (cf ., [16, 1]). In addition, the Jacobian of f is a triangular\n\nz1:d = x1:d,\n\n6\n\n\fmatrix with diagonal [1d; exp(s(x1:d))]. In contrast, the Jacobian of MintNets has more \ufb02exible\ndiagonals\u2014without being partially restricted to 1\u2019s.\n\n4.3.3 Models based on autoregressive transformations\n\nBy leveraging autoregressive transformations, the Jacobian can be made triangular. For example,\nMAF [24] de\ufb01nes the invertible tranformation as f : x (cid:55)\u2192 z,\nzi = \u00b5(x1:i\u22121) + \u03c3(x1:i\u22121)xi, where\n\u00b5(\u00b7) \u2208 R and \u03c3(\u00b7) \u2208 R+. Note that f\u22121(z) can be obtained by sequentially solving xi based on\nprevious solutions x1:i\u22121. Therefore, a na\u00efve approach requires \u2126(D) computations for inverting\nautoregressive models. Moreover, the architecture of f is only an af\ufb01ne combination of autoregressive\nfunctions with x. In contrast, MintNets are inverted with faster \ufb01xed-point iteration methods, and the\narchitecture of MintNets is arguably more \ufb02exible.\n\n4.3.4 Free-form invertible models\n\nSome work proposes invertible transformations whose Jacobians are not limited by special structures.\nFor example, FFJORD [9] uses a continuous version of change of variables formula [3] where the\ndeterminant is replaced by trace. Unlike MintNets, FFJORD needs an ODE solver to compute\nits value and inverse, and uses a stochastic estimator to approximate the trace. Another work is\ni-ResNet [1] which constrains the Lipschitz-ness of ResNet layers to make it invertible. Both i-\nResNet and MintNet use ResNet blocks with 3 convolutions. The inverse of i-ResNet can be obtained\nef\ufb01ciently by a parallelizable \ufb01xed-point iteration method, which has comparable computational\ncost as our Algorithm 1. However, unlike MintNets whose Jacobian determinants are exact, the\nlog-determinant of Jacobian of an i-ResNet must be approximated by truncating a power series and\nestimating each term with stochastic estimators.\n\n4.3.5 Other models using masked convolutions\nEmerging convolutions [13] and MaCow [20] improve the Glow architecture by replacing 1 \u00d7 1\nconvolutions in the original Glow model with masked convolutions similar to those employed in\nMintNets. Emerging convolutions and MaCow are both inverted using forward/back substitutions\ndesigned for inverting triangular matrices, which requires the same number of iterations as the input\ndimension. In stark contrast, MintNets use a \ufb01xed-point iteration method (Algorithm 1) for inversion,\nwhich is similar to i-ResNet and requires substantially fewer iterations than the input dimension. For\nexample, our method of inversion takes 120 iterations to converge on CIFAR-10, while inverting\nemerging convolutions will need 3072 iterations. In other words, our inversion can be 25 times faster\non powerful GPUs. Additionally, the architecture of MintNet is very different. The architectures of\n[13] and [20] are both built upon Glow. In contrast, MintNet is a ResNet architecture where normal\nconvolutions are replaced by causal convolutions.\n\n5 Experiments\n\nIn this section, we evaluate our MintNet architectures on both image classi\ufb01cation and density\nestimation. We focus on three common image datasets, namely MNIST, CIFAR-10 and ImageNet\n32\u00d732. We also empirically verify that Algorithm 1 can provide accurate solutions within a small\nnumber of iterations. We provide more details about settings and model architectures in Appendix D.\n\n5.1 Classi\ufb01cation\n\nTo check the capacity of MintNet and understand the trade-off of invertibility, we test its classi\ufb01cation\nperformance on MNIST and CIFAR-10, and compare it to a ResNet with a similar architecture.\nOn MNIST, MintNet achieves a test accuracy of 99.6%, which is the same as that of the ResNet.\nOn CIFAR-10, MintNet reaches 91.2% test accuracy while ResNet reaches 92.6%. Both MintNet\nand ResNet achieve 100% training accuracy on MNIST and CIFAR-10 datasets. This indicates\nthat MintNet has enough capacity to \ufb01t all data labels on the training dataset, and the invertible\nrepresentations learned by MintNet are comparable to representations learned by non-invertible\nnetworks in terms of generalizability. Note that the small degradation in classi\ufb01cation accuracy is\nalso observed in other invertible networks. For example, depending on the Lipschitz constant, the\ngap between test accuracies of i-ResNet and ResNet can be as large as 1.92% on CIFAR-10.\n\n7\n\n\fTable 1: MNIST, CIFAR-10, ImageNet 32\u00d732 bits per dimension (bpd) results. Smaller values are\nbetter. \u2020Result not directly comparable because ZCA preprocssing was used.\n\nMethod\nNICE [5]\nMAF [24]\nReal NVP [6]\nGlow [16]\nFFJORD [9]\ni-ResNet [1]\nMintNet (ours)\n\nMNIST CIFAR-10\n\nImageNet 32\u00d732\n\n4.36\n1.89\n1.06\n1.05\n0.99\n1.06\n0.98\n\n4.48\u2020\n4.31\n3.49\n3.35\n3.40\n3.45\n3.32\n\n4.28\n4.09\n\n-\n-\n\n-\n-\n\n4.06\n\n5.2 Density estimation and veri\ufb01cation of invertibility\n\nIn this section, we demonstrate the superior performance of MintNet on density estimation by training\nit as a \ufb02ow generative model. In addition, we empirically verify that Algorithm 1 can accurately\nproduce the inverse using a small number of iterations. We show that samples can be ef\ufb01ciently\ngenerated from MintNet by inverting each Mint layer with Algorithm 1.\n\nDensity estimation.\nIn Tab. 1, we report bits per dimension (bpd) on MNIST, CIFAR-10, and\nImageNet 32\u00d732 datasets. It is notable that MintNet sets the new records of bpd on all three datasets.\nMoreover, when compared to previous best models, our MNIST model uses 30% fewer parameters\nthan FFJORD, and our CIFAR-10 and ImageNet 32\u00d732 models respectively use 60% and 74% fewer\nparameters than Glow. When trained on datasets such as CIFAR-10, MintNet requires 2 GPUs for\napproximately \ufb01ve days, while FFJORD is trained on 6 GPUs for \ufb01ve days, and Glow on 8 GPUs for\nseven days. Note that all values in Tab. 1 are with respect to the continuous distribution of uniformly\ndequantized images, and results of models that view images as discrete distributions are not directly\ncomparable (e.g., PixelCNN [22], IAF-VAE [17], and Flow++ [12]). To show that MintNet learns\nsemantically meaningful representations of images, we also perform latent space interpolation similar\nto the interpolation experiments in Real NVP (see Appendix C).\n\nVeri\ufb01cation of invertibility. We \ufb01rst examine the performance of Algorithm 1 by measuring the\nreconstruction error of MintNets. We compute the inverse of MintNet by sequentially inverting each\nMint layer with Algorithm 1. We used grid search to select the step size \u03b1 in Algorithm 1 and chose\n\u03b1 = 3.5, 1.1, 1.15 respectively for MNIST, CIFAR-10 and ImageNet 32\u00d732. An interesting fact\nis for MNIST, \u03b1 = 3.5 actually works better than other values of \u03b1 within (0, 2), even though it\ndoes not have the theoretical gurantee of local convergence. As Fig. 4a shows, the normalized L2\nreconstruction error converges within 120 iterations for all datasets considered. Additionally, Fig. 4b\ndemonstrates that the reconstructed images look visually indistinguishable to true images.\n\nSamples. Using Algorithm 1, we can generate samples ef\ufb01ciently by computing the inverse of\nMintNets. We use the same step sizes as in the reconstruction error analysis, and run Algorithm 1 for\n120 iterations for all three datasets. We provide uncurated samples in Fig. 3, and more samples can\nbe found in Appendix F. In addition, we compare our sampling time to that of the other models (see\nTab. 6 in Appendix E). Our sampling method has comparable speed as i-ResNet. It is approximately\n5 times faster than autoregressive sampling on MNIST, and is roughly 25 times faster on CIFAR-10\nand ImageNet 32\u00d732.\n\n6 Conclusion\n\nWe propose a new method to compositionally construct invertible modules that are \ufb02exible, ef\ufb01cient\nto invert, and with a tractable Jacobian. Starting from linear transformations with triangular matrices,\nwe apply a set of composition rules to recursively build new modules that are non-linear and more\nexpressive (Proposition 1). We then show that the composed modules are invertible as long as their\nJacobians are non-singular (Theorem 1), and propose an ef\ufb01ciently parallelizable numerical method\n\n8\n\n\f(a) MNIST\n\n(c) ImageNet-32\u00d732\nFigure 3: Uncurated samples on MNIST, CIFAR-10, and ImageNet 32\u00d732 datasets.\n\n(b) CIFAR-10\n\n(a) Reconstruction error analysis.\n\n(b) Reconstructed images.\n\nFigure 4: Accuracy analysis of Algorithm 1 on MNIST, CIFAR-10, and ImageNet 32\u00d732 datasets.\nEach curve in (a) represents the mean value of normalized reconstruction errors for 128 images. The\n2nd, 4th and 6th rows in (b) are reconstructions, while other rows are original images.\n\n(Algorithm 1) with theoretical guarantees (Theorem 2) to compute the inverse. The Jacobians of our\nmodules are all triangular, which allows ef\ufb01cient and exact determinant computation.\nAs an application of this idea, we use masked convolutions as our basic module. Using our com-\nposition rules, we compose multiple masked convolutions together to form a module named Mint\nlayer, following the architecture of a ResNet block. To enforce its invertibility, we constrain the\nmasked convolutions to satisfy the condition of Theorem 1. We show that multiple Mint layers can\nbe stacked together to form a deep invertible network which we call MintNet. The architecture can\nbe ef\ufb01ciently inverted using a \ufb01xed point iteration algorithm (Algorithm 1). Experimentally, we show\nthat MintNet performs well on MNIST and CIFAR-10 classi\ufb01cation. Moreover, when trained as a\ngenerative model, MintNet achieves new state-of-the-art performance on MNIST, CIFAR-10 and\nImageNet 32\u00d732.\n\nAcknowledgements\n\nThis research was supported by Intel Corporation, Amazon AWS, TRI, NSF (#1651565, #1522054,\n#1733686), ONR (N00014-19-1-2145), AFOSR (FA9550- 19-1-0024).\n\nReferences\n\n[1] J. Behrmann, D. D. Will Grathwohl, Ricky T. Q. Chen, and J.-H. Jacobsen. Invertible residual\n\nnetworks. arXiv preprint arXiv:1811.00995, 2019.\n\n9\n\n\f[2] R. v. d. Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing \ufb02ows for\n\nvariational inference. arXiv preprint arXiv:1803.05649, 2018.\n\n[3] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential\nequations.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6571\u20136583.\nCurran Associates, Inc., 2018.\n\n[4] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by\n\nexponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[5] L. Dinh, D. Krueger, and Y. Bengio. NICE: non-linear independent components estimation. In\n3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,\nMay 7-9, 2015, Workshop Track Proceedings, 2015.\n\n[6] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. arXiv preprint\n\narXiv:1605.08803, 2016.\n\n[7] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. The reversible residual network: Back-\npropagation without storing activations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,\nR. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 30, pages 2214\u20132224. Curran Associates, Inc., 2017.\n\n[8] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. The reversible residual network: Back-\npropagation without storing activations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,\nR. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 30, pages 2214\u20132224. Curran Associates, Inc., 2017.\n\n[9] W. Grathwohl, I. S. Ricky T. Q. Chen, Jesse Bettencourt, and D. Duvenaud. Ffjord:\nFree-form continuous dynamics for scalable reversible generative models. arXiv preprint\narXiv:1810.01367, 2018.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013\n778, 2016.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European\n\nconference on computer vision, pages 630\u2013645. Springer, 2016.\n\n[12] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving \ufb02ow-based generative\n\nmodels with variational dequantization and architecture design, 2019.\n\n[13] E. Hoogeboom, R. Van Den Berg, and M. Welling. Emerging convolutions for generative\nnormalizing \ufb02ows. In International Conference on Machine Learning, pages 2771\u20132780, 2019.\n[14] J.-H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge. Excessive invariance causes adversarial\n\nvulnerability. In International Conference on Learning Representations, 2019.\n\n[15] J.-H. Jacobsen, A. W. Smeulders, and E. Oyallon.\n\ni-revnet: Deep invertible networks. In\n\nInternational Conference on Learning Representations, 2018.\n\n[16] D. P. Kingma and P. Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. arXiv\n\npreprint arXiv:1807.03039, 2018.\n\n[17] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved\nvariational inference with inverse autoregressive \ufb02ow.\nIn D. D. Lee, M. Sugiyama, U. V.\nLuxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems\n29, pages 4743\u20134751. Curran Associates, Inc., 2016.\n\n[18] D. Levy, M. D. Hoffman, and J. Sohl-Dickstein. Generalizing hamiltonian monte carlo with\n\nneural networks. In International Conference on Learning Representations, 2018.\n\n[19] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint\n\narXiv:1608.03983, 2016.\n\n[20] X. Ma and E. Hovy. Macow: Masked convolutional generative \ufb02ow.\n\narXiv:1902.04208, 2019.\n\narXiv preprint\n\n[21] M. MacKay, P. Vicol, J. Ba, and R. B. Grosse. Reversible recurrent neural networks.\n\nIn\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 9029\u20139040. Curran Associates,\nInc., 2018.\n\n10\n\n\f[22] A. V. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In M. F.\nBalcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on\nMachine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1747\u20131756,\nNew York, New York, USA, 20\u201322 Jun 2016. PMLR.\n\n[23] J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several variables,\n\nvolume 30. Siam, 1970.\n\n[24] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive \ufb02ow for density estimation.\n\nIn Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\n[25] T. T. Phuong and L. T. Phong. On the convergence proof of amsgrad and a new version. arXiv\n\npreprint arXiv:1904.03590, 2019.\n\n[26] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International\n\nConference on Learning Representations, 2018.\n\n[27] D. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. In F. Bach and\nD. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning,\nvolume 37 of Proceedings of Machine Learning Research, pages 1530\u20131538, Lille, France,\n07\u201309 Jul 2015. PMLR.\n\n[28] J. Song, S. Zhao, and S. Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in\n\nNeural Information Processing Systems, pages 5140\u20135150, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5902, "authors": [{"given_name": "Yang", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Chenlin", "family_name": "Meng", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}