{"title": "Joint Autoregressive and Hierarchical Priors for Learned Image Compression", "book": "Advances in Neural Information Processing Systems", "page_first": 10771, "page_last": 10780, "abstract": "Recent models for learned image compression are based on autoencoders that learn approximately invertible mappings from pixels to a quantized latent representation. The transforms are combined with an entropy model, which is a prior on the latent representation that can be used with standard arithmetic coding algorithms to generate a compressed bitstream. Recently, hierarchical entropy models were introduced as a way to exploit more structure in the latents than previous fully factorized priors, improving compression performance while maintaining end-to-end optimization. Inspired by the success of autoregressive priors in probabilistic generative models, we examine autoregressive, hierarchical, and combined priors as alternatives, weighing their costs and benefits in the context of image compression. While it is well known that autoregressive models can incur a significant computational penalty, we find that in terms of compression performance, autoregressive and hierarchical priors are complementary and can be combined to exploit the probabilistic structure in the latents better than all previous learned models. The combined model yields state-of-the-art rate-distortion performance and generates smaller files than existing methods: 15.8% rate reductions over the baseline hierarchical model and 59.8%, 35%, and 8.4% savings over JPEG, JPEG2000, and BPG, respectively. To the best of our knowledge, our model is the first learning-based method to outperform the top standard image codec (BPG) on both the PSNR and MS-SSIM distortion metrics.", "full_text": "Joint Autoregressive and Hierarchical Priors for\n\nLearned Image Compression\n\nDavid Minnen, Johannes Ball\u00e9, George Toderici\n\nGoogle Research\n\n{dminnen, jballe, gtoderici}@google.com\n\nAbstract\n\nRecent models for learned image compression are based on autoencoders that learn\napproximately invertible mappings from pixels to a quantized latent representation.\nThe transforms are combined with an entropy model, which is a prior on the latent\nrepresentation that can be used with standard arithmetic coding algorithms to gener-\nate a compressed bitstream. Recently, hierarchical entropy models were introduced\nas a way to exploit more structure in the latents than previous fully factorized priors,\nimproving compression performance while maintaining end-to-end optimization.\nInspired by the success of autoregressive priors in probabilistic generative mod-\nels, we examine autoregressive, hierarchical, and combined priors as alternatives,\nweighing their costs and bene\ufb01ts in the context of image compression. While it\nis well known that autoregressive models can incur a signi\ufb01cant computational\npenalty, we \ufb01nd that in terms of compression performance, autoregressive and hier-\narchical priors are complementary and can be combined to exploit the probabilistic\nstructure in the latents better than all previous learned models. The combined\nmodel yields state-of-the-art rate\u2013distortion performance and generates smaller\n\ufb01les than existing methods: 15.8% rate reductions over the baseline hierarchical\nmodel and 59.8%, 35%, and 8.4% savings over JPEG, JPEG2000, and BPG, re-\nspectively. To the best of our knowledge, our model is the \ufb01rst learning-based\nmethod to outperform the top standard image codec (BPG) on both the PSNR and\nMS-SSIM distortion metrics.\n\n1\n\nIntroduction\n\nMost recent methods for learning-based, lossy image compression adopt an approach based on\ntransform coding [1]. In this approach, image compression is achieved by \ufb01rst mapping pixel\ndata into a quantized latent representation and then losslessly compressing the latents. Within the\ndeep learning research community, the transforms typically take the form of convolutional neural\nnetworks (CNNs), which learn nonlinear functions with the potential to map pixels into a more\ncompressible latent space than the linear transforms used by traditional image codecs. This nonlinear\ntransform coding method resembles an autoencoder [2], [3], which consists of an encoder transform\nfrom the data (in this case, pixels) to a reduced-dimensionality latent space, and a decoder, an\napproximate inverse function that maps latents back to pixels. While dimensionality reduction\ncan be seen as a simple form of compression, it is not equivalent since the goal of compression\nis to reduce the entropy of the representation under a prior probability model shared between the\nsender and the receiver (the entropy model), not just to reduce the dimensionality. To improve\ncompression performance, recent methods have focused on better encoder/decoder transforms and\non more sophisticated entropy models [4]\u2013[14]. Finally, the entropy model is used in conjunction\nwith standard entropy coding algorithms such as arithmetic, range, or Huffman coding [15]\u2013[17] to\ngenerate a compressed bitstream.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe training goal is to minimize the expected length of the bitstream as well as the expected distortion\nof the reconstructed image with respect to the original, giving rise to a rate\u2013distortion optimization\nproblem:\n\nR + \u03bb \u00b7 D = Ex\u223cpx\n\n(cid:124)\n\n(cid:2)\u2212 log2 p \u02c6y((cid:98)f (x)(cid:101))(cid:3)\n(cid:125)\n\n(cid:123)(cid:122)\n\nrate\n\n+\u03bb \u00b7 Ex\u223cpx\n\n(cid:124)\n\n(cid:2)d(x, g((cid:98)f (x)(cid:101)))(cid:3)\n(cid:123)(cid:122)\n(cid:125)\n\n,\n\ndistortion\n\n(1)\n\nwhere \u03bb is the Lagrange multiplier that determines the desired rate\u2013distortion trade-off, px is the\nunknown distribution of natural images, (cid:98)\u00b7(cid:101) represents rounding to the nearest integer (quantization),\ny = f (x) is the encoder, \u02c6y = (cid:98)y(cid:101) are the quantized latents, p \u02c6y is a discrete entropy model, and\n\u02c6x = g(\u02c6y) is the decoder with \u02c6x representing the reconstructed image. The rate term corresponds\nto the cross entropy between the marginal distribution of the latents and the learned entropy model,\nwhich is minimized when the two distributions are identical. The distortion term may correspond to a\nclosed-form likelihood, such as when d(x, \u02c6x) represents mean squared error (MSE), which allows\nthe model to be interpreted as a variational autoencoder [5], [6]. When optimizing the model for\nother distortion metrics (e.g., SSIM or MS-SSIM), it is simply minimized as an energy function.\nThe models we analyze in this paper build on the work of Ball\u00e9 et al. [13], which uses a noise-based\nrelaxation to apply gradient descent methods to the loss function in Eq. (1) and which introduces a\nhierarchical prior to improve the entropy model. While most previous research uses a \ufb01xed, though\npotentially complex, entropy model, Ball\u00e9 et al. use a Gaussian scale mixture (GSM) [18] where\nthe scale parameters are conditioned on a hyperprior. Their model allows for end-to-end training,\nwhich includes joint optimization of a quantized representation of the hyperprior, the conditional\nentropy model, and the base autoencoder. The key insight of their work is that the compressed\nhyperprior can be added to the generated bitstream as side information, which allows the decoder\nto use the conditional entropy model. In this way, the entropy model itself is image-dependent and\nspatially adaptive, which allows for a richer and more accurate model. Ball\u00e9 et al. show that standard\noptimization methods for deep neural networks are suf\ufb01cient to learn a useful balance between\nthe size of the side information and the savings gained from a more accurate entropy model. The\nresulting compression model provides state-of-the-art image compression results compared to earlier\nlearning-based methods.\nWe extend this GSM-based entropy model in two ways: \ufb01rst, by generalizing the hierarchical GSM\nmodel to a Gaussian mixture model (GMM), and, inspired by recent work on generative models, by\nadding an autoregressive component. We assess the compression performance of both approaches,\nincluding variations in the network architectures, and discuss bene\ufb01ts and potential drawbacks of\nboth extensions. For the results in this paper, we did not investigate the effect of reducing the capacity\n(i.e., the number of layers and number of channels) of the deep networks to optimize computational\ncomplexity, since we are interested in determining the potential of different forms of priors rather\nthan trading off complexity against performance. Note that increasing capacity alone is not suf\ufb01cient\nto obtain arbitrarily good compression performance [13, appendix 6.3].\n\n2 Architecture Details\n\nFigure 1 provides a high-level overview of our generalized compression model, which contains two\nmain sub-networks1. The \ufb01rst is the core autoencoder, which learns a quantized latent representation\nof images (Encoder and Decoder blocks). The second sub-network is responsible for learning a\nprobabilistic model over quantized latents used for entropy coding. It combines the Context Model,\nan autoregressive model over latents, with the hyper-network (Hyper Encoder and Hyper Decoder\nblocks), which learns to represent information useful for correcting the context-based predictions.\nThe data from these two sources is combined by the Entropy Parameters network, which generates\nthe mean and scale parameters for a conditional Gaussian entropy model.\nOnce training is complete, a valid compression model must prevent any information from passing\nbetween the encoder to the decoder unless that information is available in the compressed \ufb01le. In\nFigure 1, the arithmetic encoding (AE) blocks produce the compressed representation of the symbols\ncoming from the quantizer, which is stored in a \ufb01le. Therefore at decoding time, any information that\ndepends on the quantized latents may be used by the decoder once it has been decoded. In order for\n\n1See Section 4 in the supplemental materials for an in-depth visual comparison between our architecture\n\nvariants and previous learning-based methods.\n\n2\n\n\fComponent\n\nInput Image\n\nEncoder\nLatents\n\nLatents (quantized)\n\nDecoder\n\nHyper Encoder\nHyper-latents\n\nHyper-latents (quant.)\n\nHyper Decoder\nContext Model\n\nEntropy Parameters\n\nReconstruction\n\nSymbol\n\nx\n\nf (x; \u03b8e)\n\ny\n\u02c6y\n\ng( \u02c6y; \u03b8d)\n\nfh(y; \u03b8he)\n\nz\n\u02c6z\n\ngcm(y<i; \u03b8cm)\n\ngh(\u02c6z; \u03b8hd)\ngep(\u00b7; \u03b8ep)\n\n\u02c6x\n\nFigure 1: Our combined model jointly optimizes an autoregressive component that predicts latents\nfrom their causal context (Context Model) along with a hyperprior and the underlying autoencoder.\nReal-valued latent representations are quantized (Q) to create integer-valued latents (\u02c6y) and hyper-\nlatents (\u02c6z), which are compressed into a bitstream using an arithmetic encoder (AE) and decompressed\nby an arithmetic decoder (AD). The highlighted region corresponds to the components that are\nexecuted by the receiver to recover an image from a compressed bitstream.\n\nthe context model to work, at any point it can only access the latents that have already been decoded.\nWhen starting to decode an image, we assume that the previously decoded latents have all been set to\nzero.\nThe learning problem is to minimize the expected rate\u2013distortion loss de\ufb01ned in Eq. 1 over the model\nparameters. Following the work of Ball\u00e9 et al. [13], we model each latent, \u02c6yi, as a Gaussian convolved\nwith a unit uniform distribution. This ensures a good match between encoder and decoder distributions\nof both the quantized latents, and continuous-valued latents subjected to additive uniform noise during\ntraining. While [13] predicted the scale of each Gaussian conditioned on the hyperprior, \u02c6z, we extend\nthe model by predicting the mean and scale parameters conditioned on both the hyperprior as well as\nthe causal context of each latent \u02c6yi, which we denote by \u02c6y<i. The predicted Gaussian parameters\nare functions of the learned parameters of the hyper-decoder, context model, and entropy parameters\nnetworks (\u03b8hd, \u03b8cm, and \u03b8ep, respectively):\n\np \u02c6y(\u02c6y | \u02c6z, \u03b8hd, \u03b8cm, \u03b8ep) =\n\n(cid:16)N(cid:0)\u00b5i, \u03c32\n(cid:89)\n\ni\n\n(cid:1) \u2217 U(cid:0)\u2212 1\n\n(cid:1)(cid:17)\n\n2 , 1\n\n2\n\n(\u02c6yi)\n\ni\n\nwith \u00b5i, \u03c3i = gep(\u03c8, \u03c6i; \u03b8ep), \u03c8 = gh(\u02c6z; \u03b8hd), and \u03c6i = gcm(\u02c6y<i; \u03b8cm).\n\n(2)\n\nThe entropy model for the hyperprior is the same as in [13], although we expect the hyper-encoder\nand hyper-decoder to learn signi\ufb01cantly different functions in our combined model, since they now\nwork in conjunction with an autoregressive network to predict the parameters of the entropy model.\nSince we do not make any assumptions about the distribution of the hyper-latents, a non-parametric,\nfully factorized density model is used. A more powerful entropy model for the hyper-latents may\nimprove compression rates, e.g., we could stack multiple instances of our contextual model, but we\nexpect the net effect to be minimal since, empirically, \u02c6z comprises only a very small percentage of\nthe total \ufb01le size. Because both the compressed latents and the compressed hyper-latents are part of\nthe generated bitstream, the rate\u2013distortion loss from Equation 1 must be expanded to include the cost\nof transmitting \u02c6z. Coupled with a squared error distortion metric, the full loss function becomes:\n\n(cid:2)\u2212 log2 p \u02c6y(\u02c6y)(cid:3)\n(cid:125)\n(cid:123)(cid:122)\n\nrate (latents)\n\n(cid:2)\u2212 log2 p\u02c6z(\u02c6z)(cid:3)\n(cid:125)\n(cid:123)(cid:122)\n\nrate (hyper-latents)\n\n+ Ex\u223cpx\n\n(cid:124)\n\n(cid:125)\n+\u03bb \u00b7 Ex\u223cpx(cid:107)x \u2212 \u02c6x(cid:107)2\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n2\n\ndistortion\n\n(3)\n\nR + \u03bb \u00b7 D = Ex\u223cpx\n\n(cid:124)\n\n2.1 Layer Details and Constraints\n\nDetails about the individual network layers in each component of our models are outlined in Table 1.\nWhile the internal structure of the components is fairly unrestricted, e.g., one could exchange the\nconvolutional layers for residual blocks or dilated convolution without fundamentally changing the\n\n3\n\nReconstructionAEADFactorizedEntropyModelInput ImageEncoderDecoderQQAEADEntropyParametersN(\u03bc, \u03b8)HyperEncoderHyperDecoderContextModelBitsBitsxyz\u00e2y\u00e2y\u00e2y\u00e2z\u00e2z\u00e2z\u03a6\u03a8\u00e2x\fEncoder\n\nDecoder\n\nHyper\nEncoder\n\nHyper\nDecoder\n\nContext\nPrediction\n\nEntropy\n\nParameters\n\nConv: 5\u00d75 c192 s2 Deconv: 5\u00d75 c192 s2 Conv: 3\u00d73 c192 s1 Deconv: 5\u00d75 c192 s2 Masked: 5\u00d75 c384 s1 Conv: 1\u00d71 c640 s1\nConv: 1\u00d71 c512 s1\nConv: 5\u00d75 c192 s2 Deconv: 5\u00d75 c192 s2 Conv: 5\u00d75 c192 s2 Deconv: 5\u00d75 c288 s2\nConv: 5\u00d75 c192 s2 Deconv: 5\u00d75 c192 s2 Conv: 5\u00d75 c192 s2 Deconv: 3\u00d73 c384 s1\nConv: 1\u00d71 c384 s1\nConv: 5\u00d75 c192 s2\n\nDeconv: 5\u00d75 c3 s2\n\nLeaky ReLU\n\nLeaky ReLU\n\nGDN\n\nGDN\n\nGDN\n\nLeaky ReLU\n\nIGDN\n\nIGDN\n\nIGDN\n\nLeaky ReLU\n\nLeaky ReLU\n\nLeaky ReLU\n\nTable 1: Each row corresponds to a layer of our generalized model. Convolutional layers are speci\ufb01ed\nwith the \u201cConv\u201d pre\ufb01x followed by the kernel size, number of channels and downsampling stride (e.g.,\nthe \ufb01rst layer of the encoder uses 5\u00d75 kernels with 192 channels and a stride of two). The \u201cDeconv\u201d\npre\ufb01x corresponds to upsampled convolutions (i.e., in TensorFlow, tf.conv2d_transpose), while\n\u201cMasked\u201d corresponds to masked convolution as in [19]. GDN stands for generalized divisive\nnormalization, and IGDN is inverse GDN [20].\n\nmodel, certain components must be constrained to ensure that the bitstreams alone is suf\ufb01cient for\nthe receiver to reconstruct the image.\nThe last layer of the encoder corresponds to the bottleneck of the base autoencoder. The number of\noutput channels determines the number of elements that must be compressed and stored. Depending\non the rate\u2013distortion trade-off, our models learn to ignore certain channels by deterministically\ngenerating the same latent value and assigning it a probability of 1, which wastes computation but\ngenerates no additional entropy. This modeling \ufb02exibility allows us to set the bottleneck larger than\nnecessary, and then let the model determine the number of channels that yields the best performance.\nSimilar to reports in other work, we found that too few channels can impede rate\u2013distortion per-\nformance when training models that target high bit rates, but having too many does not harm the\ncompression performance [9], [13].\nThe \ufb01nal layer of the decoder must have three channels to generate RGB images, and the \ufb01nal layer\nof the Entropy Parameters sub-network must have exactly twice as many channels as the bottleneck.\nThis constraint arises because the Entropy Parameters network predicts two values, the mean and\nscale of a Gaussian distribution, for each latent. The number of output channels of the Context Model\nand Hyper Decoder components are not constrained, but we also set them to twice the bottleneck size\nin all of our experiments.\nAlthough the formal de\ufb01nition of our model allows the autoregressive component to condition its\npredictions \u03c6i = gcm(\u02c6y<i; \u03b8cm) on all previous latents, in practice we use a limited context (5\u00d75\nconvolution kernels) with masked convolution similar to the approach used by PixelCNN [19].\nThe Entropy Parameters network is also constrained, since it can not access predictions from the\nContext Model beyond the current latent element. For simplicity, we use 1\u00d71 convolution in the\nEntropy Parameters network, although masked convolution is also permissible. Section 3 provides\nan empirical evaluation of the model variants we assessed, exploring the effects of different context\nsizes and more complex autoregressive networks.\n\n3 Experimental Results\n\nWe evaluate our generalized models by calculating the rate\u2013distortion (RD) performance averaged\nover the publicly available Kodak image set [21]2. Figure 2 shows RD curves using peak signal-\nto-noise ratio (PSNR) as the image quality metric. While PSNR is known to be a relatively poor\nperceptual metric [22], it is still a standard metric used to evaluate image compression algorithms\nand is the primary metric used for tuning conventional codecs. The RD graph on the left of Figure 2\ncompares our combined context + hyperprior model to existing image codecs (standard codecs\nand learned models) and shows that this model outperforms all of the existing methods including\nBPG [23], a state-of-the-art codec based on the intra-frame coding algorithm from HEVC [24]. To\nthe best of our knowledge, this is the \ufb01rst learning-based compression model to outperform BPG on\n\n2Please see the supplemental material for additional evaluation results including full-page RD curves, example\n\nimages, and results on the larger Tecnick image set (100 images with resolution 1200\u00d71200).\n\n4\n\n\fFigure 2: Our combined approach (context + hyperprior) has better rate\u2013distortion performance on\nthe Kodak image set as measured by PSNR (RGB) compared to all of the baselines methods (left).\nTo our knowledge, this is the \ufb01rst learning-based method to outperform BPG on PSNR. The right\ngraph compares the relative performance of different versions of our method. It shows that using a\nhyperprior is better than a purely autoregressive (context-only) approach and that combining both\n(context + hyperprior) yields the best RD performance.\n\nPSNR. The right RD graph compares different versions of our models and shows that the combined\nmodel performs the best, while the context-only model performs slightly worse than either hierachical\nversion.\nFigure 3 shows RD curves for Kodak using multiscale structural similarity (MS-SSIM) [25] as the\nimage quality metric. The graph includes two versions of our combined model: one optimized for\nMSE and one optimized for MS-SSIM. The latter outperforms all existing methods including all\nstandard codecs and other learning-based methods that were also optimized for MS-SSIM ([6], [9],\n[13]). As expected, when our model is optimized for MSE, performance according to MS-SSIM\nfalls. Nonetheless, the MS-SSIM scores for this model still exceed all standard codecs and all\nlearning-based methods that were not speci\ufb01cally optimized for MS-SSIM.\nAs outlined in Table 1, our baseline architecture for the combined model uses 5\u00d75 masked convolution\nin a single linear layer for the context model, and it uses a conditional Gaussian distribution for the\nentropy model. Figure 4 compares this baseline to several variants by showing the relative increase in\n\ufb01le size at a single rate-point. The green bars show that exchanging the Gaussian distribution for a\nlogistic distribution has almost no effect (the 0.3% increase is smaller than the training variance),\nwhile switching to a Laplacian distribution decreases performance more substantially. The blue bars\ncompare different context con\ufb01gurations. Masked 3\u00d73 and 7\u00d77 convolution both perform slightly\nworse, which is surprising since we expected the additional context provided by the 7\u00d77 kernels\nto improve prediction accuracy. Similarly, a 3-layer, nonlinear context model using 5\u00d75 masked\nconvolution also performed slightly worse than the linear baseline. Finally, the purple bars show the\neffect of using a severely restricted context such as only a single neighbor or three neighbors from\nthe previous row. The primary bene\ufb01t of these models is increased parallelization when calculating\ncontext-based predictions since the dependence is reduced from two dimensions down to one. While\nboth cases show a non-negligible rate increase (2.1% and 3.1%, respectively), the increase may be\nworthwhile in a practical implementation where runtime speed is a major concern.\nFinally, Figure 5 provides a visual comparison for one of the Kodak images. Creating accurate\ncomparisons is dif\ufb01cult since most compression methods do not have the ability to target a precise bit\nrate. We therefore selected comparison images with sizes that are as close as possible, but always\nlarger than our encoding (up to 9.4% larger in the case of BPG). Nonetheless, our compression model\nprovides clearly better visual quality compared to the scale hyperprior baseline [13] and JPEG. The\nperceptual quality relative to BPG is much closer. For example, BPG preserves more detail in the sky\nand parts of the fence, but at the expense of introducing geometric artifacts in the sky, mild ringing\nnear the building/sky boundaries, and some boundary artifacts where neighboring blocks have widely\ndifferent levels of detail (e.g., in the grass and lighthouse).\n\n5\n\n0.00.51.01.52.02.53.0Bits per pixel (BPP)30354045PSNR (RGB) on Kodak vs. baseline modelsContext + HyperpriorBPG (4:4:4)Ball\u00e9 (2018) opt. for MSE [13]Minnen (2018) [14]JPEG2000 (OpenJPEG)JPEG (4:2:0)0.00.20.40.60.81.0Bits per pixel (BPP)26283032343638PSNR (RGB) on Kodak vs. model variantsContext + HyperpriorMean & Scale HyperpriorScale-only Hyperprior [13]Context-only (no hyperprior)JPEG2000 (OpenJPEG)JPEG (4:2:0)\fFigure 3: When evaluated using MS-SSIM (RGB)\non Kodak, our combined approach has better RD\nperformance than all previous methods when op-\ntimized for MS-SSIM. When optimized for MSE,\nour method still provides better MS-SSIM scores\nthan all of the standard codecs.\n\nFigure 4: The baseline implementation of our\nmodel uses a hyperprior and a linear context\nmodel with 5\u00d75 masked convolution. Opti-\nmized with \u03bb = 0.025 (bpp \u2248 0.61 on Kodak),\nthe baseline outperforms the other variants we\ntested (see text for details).\n\n4 Related Work\n\nThe earliest research that used neural networks to compress images dates back to the 1980s and\nrelies on an autoencoder with a small bottleneck using either uniform quantization [26] or vector\nquantization [27], [28]. These approaches sought equal utilization of the codes and thus did not learn\nan explicit entropy model. Considerable research followed these initial models, and Jiang provides a\ncomprehensive survey covering methods published through the late 1990s [29].\nMore recently, image compression with deep neural networks became a popular research topic\nstarting with the work of Toderici et al. [30] who used a recurrent architecture based on LSTMs\nto learn multi-rate, progressive models. Their approach was improved by exploring other recurrent\narchitectures for the autoencoder, training an LSTM-based entropy model, and adding a post-process\nthat spatially adapts the bit rate based on the complexity of the local image content [4], [8]. Related\n\n(a) Ours (0.2149 bpp)\n\n(c) BPG (0.2352 bpp)\n\n(b) Scale-only (0.2205 bpp)\n\n(d) JPEG (0.2309 bpp)\nFigure 5: At similar bit rates, our combined method provides the highest visual quality. Note the\naliasing in the fence in the scale-only version as well as a slight global color cast and blurriness in\nthe yellow rope. BPG shows more \u201cclassical\u201d compression artifacts, e.g., ringing around the top of\nthe lighthouse and the roof of the middle building. BPG also introduces a few geometric artifacts in\nthe sky, though it does preserve more detail in the sky and fence compared to our model, albeit with\n9.4% more bits. JPEG shows severe blocking artifacts at this bit rate.\n\n6\n\n0.00.51.01.52.0Bits per pixel (BPP)10152025MS-SSIM (RGB) on KodakOur Method (opt. for MS-SSIM)Ball\u00e9 (2018) opt. for MS-SSIM [13]Ball\u00e9 (2017) opt. for MS-SSIM [6]Rippel (2017) opt. for MS-SSIM [9]Our Method (opt. for MSE)BPG (4:4:4)Johnston (2018) [8]BPG (4:2:0)WebPJPEG (4:2:0)Logistic with 5x5 contextLaplacian with 5x5 context3x3 linear context model7x7 linear context model5x5 3-layer context modelContext = left neigbhor 3 codes from previous row0.00.51.01.52.02.53.03.5Percent size increase (lower is better)\fresearch followed a more traditional image coding approach and explicitly divided images into\npatches instead of using a fully convolutional model [10], [31]. Inspired by modern image codecs and\nlearned inpainting algorithms, these methods trained a neural network to predict each image patch\nfrom its causal context (in the image, not the latent space) before encoding the residual. Similarly,\nmost modern image compression standards use context to predict pixel values combined with a\ncontext-adaptive entropy model [23], [32], [33].\nMany learning-based methods take the form of an autoencoder, and separate models are trained to\ntarget different bit rates instead of training a single recurrent model [5]\u2013[7], [9], [11], [12], [14], [34],\n[35]. Some use a fully factorized entropy model [5], [6], while others make use of context in code\nspace to improve compression rates [4], [7]\u2013[9], [12], [35]. Other methods do not make use of context\nvia an autoregressive model and instead rely on side information that is either predicted by a neural\nnetwork [13] or composed of indices into a (shared) dictionary of non-parametric code distributions\nused locally by the entropy coder [14]. In concurrent research, Klopp et al. also explore an approach\nthat jointly optimizes a context model and a hierarchical prior [35]. They introduce a sparse variant\nof GDN to improve the encoder and decoder networks and use a multimodal entropy distribution.\nTheir integration method between the context model and hyperprior is somewhat simpler than our\napproach, which leads to a \ufb01nal model with slightly worse rate\u2013distortion performance (~10% higher\nbit rates for equivalent MS-SSIM).\nLearned image compression is also related to Bayesian generative models such as PixelCNN [19],\nvariational autoencoders [36], PixelVAE [37], \u03b2-VAE [38], and VLAE [39]. In general, Bayesian\nimage models seek to maximize the evidence Ex\u223cpx log p(x), which is typically intractable, and\nuse the joint likelihood, as in Eq. (1), as a lower bound, while compression models seek to directly\noptimize Eq. (1). It has been noted that under certain conditions, compression models are formally\nequivalent to VAEs [5], [6]. \u03b2-VAEs have a particularly strong connection since \u03b2 controls the\ntrade-off between the data log-likelihood (distortion) and prior (rate), as does \u03bb in our formulation,\nwhich is derived from classical rate\u2013distortion theory.\nAnother signi\ufb01cant difference are the constraints imposed on compression models by the need to\nquantize and arithmetically encode the latents, which require certain choices regarding the parametric\nform of the densities and a transition from continuous (differential) to discrete (Shannon) entropies.\nWe can draw strong conceptual parallels between our models and PixelCNN autoencoders [19], and\nespecially PixelVAE [37] and VLAE [39], when applied to discrete latents. These models are often\nevaluated by comparing average likelihoods (which correspond to differential entropies), whereas\ncompression models are typically evaluated by comparing several bit rates (corresponding to Shannon\nentropies) and distortion values across the rate\u2013distortion frontier, which makes direct comparison\nmore complex.\n\n5 Discussion\n\nOur approach extends the work of Ball\u00e9 et al. [13] in two ways. First, we generalize the GSM\nmodel to a conditional Gaussian mixture model (GMM). Supporting this model is simply a matter of\ngenerating both a mean and a scale parameter conditioned on the hyperprior. Intuitively, the average\nlikelihood of the observed latents increases when the center of the conditional Gaussian is closer\nto the true value and a smaller scale is predicted, i.e., more structure can be exploited by modeling\nconditional means. The core question is whether or not the bene\ufb01ts of this more sophisticated\nmodel outweigh the cost of the associated side information. We showed in Figure 2 (right) that a\nGMM-based entropy model provides a net bene\ufb01t and outperforms the simpler GSM-based model in\nterms of rate\u2013distortion performance without increasing the asymptotic complexity of the model.\nThe second extension is the idea of combining an autoregressive model with the hyperprior. Intuitively,\nwe can see how these components are complementary in two ways. First, starting from the perspective\nof the hyperprior, we see that for identical hyper-network architectures, improvements to the entropy\nmodel require more side information. The side information increases the total compressed \ufb01le size,\nwhich limits its bene\ufb01t. In contrast, introducing an autoregressive component into the prior does not\nincur a rate penalty since the predictions are based only on the causal context, i.e., on latents that\nhave already been decoded. Similarly, from the perspective of the autoregressive model, we expect\nsome amount of uncertainty that can not be eliminated solely from the causal context. The hyperprior,\nhowever, can \u201clook into the future\u201d since it is part of the compressed bitstream and is fully known by\n\n7\n\n\fFigure 6: Each row corresponds to a different model variant and shows information for the channel\nwith the highest entropy. The visualizations show that more powerful models reduce the prediction\nerror, require smaller scale parameters, and remove structure from the normalized latents, which\ndirectly translates into a more accurate entropy model and thus higher compression rates.\n\nthe decoder. The hyperprior can thus learn to store information needed to reduce the uncertainty in\nthe autoregressive model while avoiding information that can be accurately predicted from context.\nFigure 6 visualizes some of the internal mechanisms of our models. We show three of the variants: one\nGaussian scale mixture equivalent to [13], another strictly hierarchical prior extended to a Gaussian\nmixture model, and one combined model using an autoregressive component and a hyperprior. After\nencoding the lighthouse image shown in Figure 5, we extracted the latents for the channel with the\nhighest entropy. These latents are visualized in the \ufb01rst column of Figure 6. The second column holds\nthe conditional means and clearly shows the added detail attained with an autoregressive component,\nwhich is reminiscent of the observation that VAE-based models tend to produce blurrier images than\nautoregressive models [37]. This improvement leads to a lower prediction error (third column) and\nsmaller predicted scales, i.e. smaller uncertainty (fourth column). Our entropy model assumes that\nlatents are conditionally independent given the hyperprior, which implies that the normalized latents,\ni.e. values with the predicted mean and scale removed, should be closer to i.i.d. Gaussian noise. The\n\ufb01fth column of Figure 6 shows that the combined model is closest to this ideal and that both the mean\nprediction and autoregressive model help signi\ufb01cantly. Finally, the last two columns show how the\nentropy is distributed across the image for the latents and hyper-latents.\nFrom a practical standpoint, autoregressive models are less desirable than hierarchical models since\nthey are inherently serial, and therefore can not be sped up using techniques such as parallelization.\nTo report the rate\u2013distortion performance of the compression models which contain an autoregressive\ncomponent, we refrained from implementing a full decoder for this paper, and instead compare\nShannon entropies. We have empirically veri\ufb01ed that these measurements are within a fraction of a\npercent of the size of the actual bitstream generated by arithmetic coding.\nProbability density distillation has been successfully used to get around the serial nature of autoregres-\nsive models for the task of speech synthesis [40], but unfortunately the same type of method cannot\nbe applied in the domain of compression due to the coupling between the prior and the arithmetic\ndecoder. To address these computational concerns, we have begun to explore very lightweight context\nmodels as described in Section 3 and Figure 4, and are considering further techniques to reduce the\ncomputational requirements of the Context Model and Entropy Parameters networks, such as engi-\nneering a tight integration of the arithmetic decoder with a differentiable autoregressive model. An\nalternative direction for future research may be to avoid the causality issue altogether by introducing\nyet more complexity into strictly hierarchical priors or adopt an interleaved decomposition for context\nprediction that allows partial parallelization [34], [41].\n\n8\n\nScale Hyperprior [13]Mean & ScaleContext + Hyperprior\fReferences\n\n[1] V. K. Goyal, \u201cTheoretical foundations of transform coding,\u201d IEEE Signal Processing Magazine,\n\nvol. 18, no. 5, 2001. DOI: 10.1109/79.952802.\n\n[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, \u201cParallel distributed processing: Explo-\nrations in the microstructure of cognition, vol. 1,\u201d in, D. E. Rumelhart, J. L. McClelland,\nand C. PDP Research Group, Eds., Cambridge, MA, USA: MIT Press, 1986, ch. Learning\nInternal Representations by Error Propagation, pp. 318\u2013362, ISBN: 0-262-68053-X. [Online].\nAvailable: http://dl.acm.org/citation.cfm?id=104279.104293.\n\n[3] G. E. Hinton and R. R. Salakhutdinov, \u201cReducing the dimensionality of data with neural\nnetworks,\u201d Science, vol. 313, no. 5786, pp. 504\u2013507, Jul. 2006. DOI: 10.1126/science.\n1127647.\n\n[4] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell,\n\u201cFull resolution image compression with recurrent neural networks,\u201d in 2017 IEEE Conf. on\nComputer Vision and Pattern Recognition (CVPR), 2017. DOI: 10.1109/CVPR.2017.577.\narXiv: 1608.05148.\n\n[5] L. Theis, W. Shi, A. Cunningham, and F. Husz\u00e1r, \u201cLossy image compression with compressive\n\nautoencoders,\u201d 2017, presented at the 5th Int. Conf. on Learning Representations.\nJ. Ball\u00e9, V. Laparra, and E. P. Simoncelli, \u201cEnd-to-end optimized image compression,\u201d arXiv e-\nprints, 2017, presented at the 5th Int. Conf. on Learning Representations. arXiv: 1611.01704.\n[7] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, \u201cLearning convolutional networks for content-\n\n[6]\n\nweighted image compression,\u201d arXiv e-prints, 2017. arXiv: 1703.10553.\n\n[8] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. J. Hwang, J. Shor, and\nG. Toderici, \u201cImproved lossy image compression with priming and spatially adaptive bit rates\nfor recurrent networks,\u201d in 2018 IEEE Conf. on Computer Vision and Pattern Recognition\n(CVPR), 2018.\n\n[9] O. Rippel and L. Bourdev, \u201cReal-time adaptive image compression,\u201d in Proc. of Machine\n\nLearning Research, vol. 70, 2017, pp. 2922\u20132930.\n\n[10] D. Minnen, G. Toderici, M. Covell, T. Chinen, N. Johnston, J. Shor, S. J. Hwang, D. Vincent,\nand S. Singh, \u201cSpatially adaptive image compression using a tiled deep network,\u201d International\nConference on Image Processing, 2017.\n\n[11] E. \u00de. \u00c1g\u00fastsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool,\n\u201cSoft-to-hard vector quantization for end-to-end learning compressible representations,\u201d in\nAdvances in Neural Information Processing Systems 30, 2017, pp. 1141\u20131151.\n\n[12] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, \u201cConditional probability\nmodels for deep image compression,\u201d in 2018 IEEE Conf. on Computer Vision and Pattern\nRecognition (CVPR), 2018.\nJ. Ball\u00e9, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, \u201cVariational image compression\nwith a scale hyperprior,\u201d 6th Int. Conf. on Learning Representations, 2018. [Online]. Available:\nhttps://openreview.net/forum?id=rkcQFMZRb.\n\n[13]\n\n[14] D. Minnen, G. Toderici, S. Singh, S. J. Hwang, and M. Covell, \u201cImage-dependent local entropy\nmodels for image compression with deep networks,\u201d International Conference on Image\nProcessing, 2018.\nJ. Rissanen and G. G. Langdon Jr., \u201cUniversal modeling and coding,\u201d IEEE Transactions on\nInformation Theory, vol. 27, no. 1, 1981. DOI: 10.1109/TIT.1981.1056282.\n\n[15]\n\n[16] G. Martin, \u201cRange encoding: An algorithm for removing redundancy from a digitized message,\u201d\n\nin Video & Data Recording Conference, Jul. 1979.\nJ. van Leeuwen, \u201cOn the construction of huffman trees,\u201d in ICALP, 1976, pp. 382\u2013410.\n\n[17]\n[18] M. J. Wainwright and E. P. Simoncelli, \u201cScale mixtures of gaussians and the statistics of\nnatural images,\u201d in Proceedings of the 12th International Conference on Neural Information\nProcessing Systems, ser. NIPS\u201999, Denver, CO: MIT Press, 1999, pp. 855\u2013861.\n\n[19] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves,\n\u201cConditional image generation with pixelcnn decoders,\u201d in Advances in Neural Information\nProcessing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,\nEds., Curran Associates, Inc., 2016, pp. 4790\u20134798.\n\n9\n\n\f[20]\n\nJ. Ball\u00e9, V. Laparra, and E. P. Simoncelli, \u201cDensity modeling of images using a generalized\nnormalization transformation,\u201d arXiv e-prints, 2016, presented at the 4th Int. Conf. on Learning\nRepresentations. arXiv: 1511.06281.\n\n[21] E. Kodak, Kodak lossless true color image suite (PhotoCD PCD0992). [Online]. Available:\n\nhttp://r0k.us/graphics/kodak/.\n\n[22] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi,\nM. Carli, F. Battisti, and C.-C. Jay Kuo, \u201cImage database TID2013,\u201d Image Commun., vol. 30,\nno. C, pp. 57\u201377, Jan. 2015, ISSN: 0923-5965. DOI: 10.1016/j.image.2014.10.009.\n\n[23] F. Bellard, BPG image format (http://bellard.org/bpg/), Accessed: 2017-01-30. [Online].\n\nAvailable: http://bellard.org/bpg/.\nITU-R rec. H.265 & ISO/IEC 23008-2: High ef\ufb01ciency video coding, 2013.\n\n[24]\n[25] Z. Wang, E. P. Simoncelli, and A. C. Bovik, \u201cMultiscale structural similarity for image\nquality assessment,\u201d in Signals, Systems and Computers, 2004. Conference Record of the\nThirty-Seventh Asilomar Conference on, IEEE, vol. 2, 2003, pp. 1398\u20131402.\n\n[26] G. W. Cottrell, P. Munro, and D. Zipser, \u201cImage compression by back propagation: An example\nof extensional programming,\u201d in Models of Cognition: A Review of Cognitive Science, N. E.\nSharkey, Ed., Also presented at the Ninth Ann Meeting of the Cognitive Science Society, 1987,\npp. 461-473, vol. 1, Norwood, NJ, 1989.\n\n[27] S. Luttrell, \u201cImage compression using a neural network,\u201d in Pattern Recognition Letters,\n\nvol. 10, Oct. 1988, pp. 1231\u20131238.\n\n[28] E. Watkins Bruce, Data compression using arti\ufb01cial neural networks. 1991. [Online]. Available:\n\nhttps://calhoun.nps.edu/handle/10945/25801.\nJ. Jiang, \u201cImage compression with neural networks\u2013a survey,\u201d Signal Processing: Image\nCommunication, vol. 14, pp. 737\u2013760, 1999.\n\n[29]\n\n[30] G. Toderici, S. M. O\u2019Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R.\nSukthankar, \u201cVariable rate image compression with recurrent neural networks,\u201d arXiv e-prints,\n2016, presented at the 4th Int. Conf. on Learning Representations. arXiv: 1511.06085.\n\n[31] M. H. Baig, V. Koltun, and L. Torresani, \u201cLearning to inpaint for image compression,\u201d in\nAdvances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio,\nH. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., Curran Associates, Inc., 2017,\npp. 1246\u20131255.\n\u201cInformation technology\u2013JPEG 2000 image coding system,\u201d International Organization for\nStandardization, Geneva, CH, Standard, Dec. 2000.\n\n[32]\n\n[33] Google, WebP: Compression techniques, Accessed: 2017-01-30. [Online]. Available: http:\n\n//developers.google.com/speed/webp/docs/compression.\n\n[35]\n\n[34] K. Nakanishi, S.-i. Maeda, T. Miyato, and D. Okanohara, \u201cNeural multi-scale image compres-\n\nsion,\u201d arXiv preprint arXiv:1805.06386, 2018.\nJ. P. Klopp, Y.-C. F. Wang, S.-Y. Chien, and L.-G. Chen, \u201cLearning a code-space predictor by\nexploiting intra-image-dependencies,\u201d in British Machine Vision Conference (BMVC), 2018.\n[36] D. P. Kingma and M. Welling, \u201cAuto-encoding variational bayes,\u201d arXiv e-prints, 2014,\n\n[37]\n\n[38]\n\nPresented at the 2nd Int. Conf. on Learning Representations. arXiv: 1312.6114.\nI. Gulrajani, K. Kumar, F. Ahmed, A. Ali Taiga, F. Visin, D. Vazquez, and A. Courville,\n\u201cPixelVAE: A latent variable model for natural images,\u201d 2017, presented at the 5th Int. Conf.\non Learning Representations.\nI. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A.\nLerchner, \u201c\u03b2-VAE: Learning basic visual concepts with a constrained variational framework,\u201d\n2017, presented at the 5th Int. Conf. on Learning Representations.\n\n[39] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and\nP. Abbeel, \u201cVariational lossy autoencoder,\u201d 2017, presented at the 5th Int. Conf. on Learning\nRepresentations.\n\n[40] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d.\nDriessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al., \u201cParallel wavenet: Fast high-\ufb01delity\nspeech synthesis,\u201d arXiv preprint arXiv:1711.10433, 2017.\n\n[41] S. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov,\nand N. de Freitas, \u201cParallel multiscale autoregressive density estimation,\u201d in Int. Conf. on\nMachine Learning (ICML), Sydney, Australia, 2017.\n\n10\n\n\f", "award": [], "sourceid": 6863, "authors": [{"given_name": "David", "family_name": "Minnen", "institution": "Google"}, {"given_name": "Johannes", "family_name": "Ball\u00e9", "institution": "Google"}, {"given_name": "George", "family_name": "Toderici", "institution": "Google"}]}