{"title": "Deep Generative Models for Distribution-Preserving Lossy Compression", "book": "Advances in Neural Information Processing Systems", "page_first": 5929, "page_last": 5940, "abstract": "We propose and study the problem of distribution-preserving lossy compression. Motivated by recent advances in extreme image compression which allow to maintain artifact-free reconstructions even at very low bitrates, we propose to optimize the rate-distortion tradeoff under the constraint that the reconstructed samples follow the distribution of the training data. The resulting compression system recovers both ends of the spectrum: On one hand, at zero bitrate it learns a generative model of the data, and at high enough bitrates it achieves perfect reconstruction. Furthermore, for intermediate bitrates it smoothly interpolates between learning a generative model of the training data and perfectly reconstructing the training samples. We study several methods to approximately solve the proposed optimization problem, including a novel combination of Wasserstein GAN and Wasserstein Autoencoder, and present an extensive theoretical and empirical characterization of the proposed compression systems.", "full_text": "Deep Generative Models for\n\nDistribution-Preserving Lossy Compression\n\nMichael Tschannen\n\nETH Z\u00fcrich\n\nmichaelt@nari.ee.ethz.ch\n\nEirikur Agustsson\n\nGoogle AI Perception\neirikur@google.com\n\nMario Lucic\nGoogle Brain\n\nlucic@google.com\n\nAbstract\n\nWe propose and study the problem of distribution-preserving lossy compression.\nMotivated by recent advances in extreme image compression which allow to main-\ntain artifact-free reconstructions even at very low bitrates, we propose to optimize\nthe rate-distortion tradeoff under the constraint that the reconstructed samples fol-\nlow the distribution of the training data. The resulting compression system recovers\nboth ends of the spectrum: On one hand, at zero bitrate it learns a generative\nmodel of the data, and at high enough bitrates it achieves perfect reconstruction.\nFurthermore, for intermediate bitrates it smoothly interpolates between learning\na generative model of the training data and perfectly reconstructing the training\nsamples. We study several methods to approximately solve the proposed optimiza-\ntion problem, including a novel combination of Wasserstein GAN and Wasserstein\nAutoencoder, and present an extensive theoretical and empirical characterization of\nthe proposed compression systems.\n\n1\n\nIntroduction\n\nData compression methods based on deep neural networks (DNNs) have recently received a great\ndeal of attention. These methods were shown to outperform traditional compression codecs in image\ncompression [1\u201310], speech compression [11], and video compression [12] under several distortion\nmeasures. In addition, DNN-based compression methods are \ufb02exible and can be adapted to speci\ufb01c\ndomains leading to further reductions in bitrate, and promise fast processing thanks to their internal\nrepresentations that are amenable to modern data processing pipelines [13].\n\nIn the context of image compression, learning-based methods arguably excel at low bitrates by\nlearning to realistically synthesize local image content, such as texture. While learning-based methods\ncan lead to larger distortions w.r.t. measures optimized by traditional compression algorithms, such as\npeak signal-to-noise ratio (PSNR), they avoid artifacts such as blur and blocking, producing visually\nmore pleasing results [1\u201310]. In particular, visual quality can be improved by incorporating generative\nadversarial networks (GANs) [14] into the learning process [4, 15]. Work [4] leveraged GANs for\nartifact suppression, whereas [15] used them to learn synthesizing image content beyond local texture,\nsuch as facades of buildings, obtaining visually pleasing results at very low bitrates.\n\nIn this paper, we propose a formalization of this line of work: A compression system that respects\nthe distribution of the original data at all rates\u2014a system whose decoder generates i.i.d. samples\nfrom the data distribution at zero bitrate, then gradually produces reconstructions containing more\ncontent of the original image as the bitrate increases, and eventually achieves perfect reconstruction\nat high enough bitrate (see Figure 1 for examples). Such a system can be learned from data in a fully\nunsupervised fashion by solving what we call the distribution-preserving lossy compression (DPLC)\nproblem: Optimizing the rate-distortion tradeoff under the constraint that the reconstruction follows\nthe distribution of the training data. Enforcing this constraint promotes artifact-free reconstructions,\nat all rates.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f0.000\n\n0.008\n\n0.031\n\n0.125\n\n0.500 original\n\n0.000\n\n0.008\n\n0.031\n\n0.125\n\n0.500 original\n\nFigure 1: Example (testing) reconstructions for CelebA (left) and LSUN bedrooms (right) obtained\nby our DPLC method based on Wasserstein++ (rows 1\u20134), global generative compression GC [15]\n(rows 5\u20136), and a compressive autoencoder (CAE) baseline (row 7), as a function of the bitrate\n(in bits per pixel). We stress that as the bitrate decreases, DPLC manages to generate diverse and\nrealistic-looking images, whereas GC struggles to produce diverse reconstructions and the CAE\nreconstructions become increasingly blurry.\n\nWe then show that the algorithm proposed in [15] is solving a special case of the DPLC problem,\nand demonstrate that it fails to produce stochastic decoders as the rate tends to zero in practice, i.e.,\nit is not effective in enforcing the distribution constraint at very low bitrates. This is not surprising\nas it was designed with a different goal in mind. We then propose and study different alternative\napproaches based on deep generative models that overcome the issues inherent with [15]. In a\nnutshell, one \ufb01rst learns a generative model and then applies it to learn a stochastic decoder, obeying\nthe distribution constraint on the reconstruction, along with a corresponding encoder. To quantify the\ndistribution mismatch of the reconstructed samples and the training data in the learning process we\nrely on the Wasserstein distance. One distinct advantage of our approach is that we can theoretically\ncharacterize the distribution of the reconstruction and bound the distortion as a function of the bitrate.\n\nOn the practical side, to learn the generative model, we rely on Wasserstein GAN (WGAN) [16] and\nWasserstein autoencoder (WAE) [17], as well as a novel combination thereof termed Wasserstein++.\nThe latter attains high sample quality comparable to WGAN (when measured in terms of the Fr\u00e9chet\ninception distance (FID) [18]) and yields a generator with good mode coverage as well as a structured\nlatent space suited to be combined with an encoder, like WAE. We present an extensive empirical\nevaluation of the proposed approach on two standard GAN data sets, CelebA [19] and LSUN\nbedrooms [20], realizing the \ufb01rst system that effectively solves the DPLC problem.\n\nOutline. We formally de\ufb01ne and motivate the DPLC problem in Section 2. We then present several\napproaches to solve the DPLC problem in Section 3. Practical aspects are discussed in Section 4 and\nan extensive evaluation is presented in Section 5. Finally, we discuss related work in Section 6.\n\n2 Problem formulation\n\nNotation. We use uppercase letters to denote random variables, lowercase letters to designate their\nvalues, and calligraphy letters to denote sets. We use the notation PX for the distribution of the\nrandom variable X and EX [X] for its expectation. The relation W \u223c PX designates that W follows\nthe distribution PX , and X \u223c Y indicates that X and Y are identically distributed.\n\n2\n\n\fSetup. Consider a random variable X \u2208 X with distribution PX . The latter could be modeling, for\nexample, natural images, text documents, or audio signals. In standard lossy compression, the goal is\nto create a rate-constrained encoder E : X \u2192 W := {1, . . . , 2R}, mapping the input to a code of\nR bits, and a decoder D : W \u2192 X , mapping the code back to the input space, such as to minimize\nsome distortion measure d : X \u00d7 X \u2192 R+. Formally, one aims at solving\n\nmin\nE,D\n\nEX [d(X, D(E(X)))].\n\n(1)\n\nIn the classic lossy compression setting, both E and D are typically deterministic. As a result, the\nnumber of distinct reconstructed inputs \u02c6X := D(E(X)) is bounded by 2R. The main drawback is,\nas R decreases, the reconstruction \u02c6X will incur increasing degradations (such as blur or blocking in\nthe case of natural images), and will be constant for R = 0. Note that simply allowing E, D in (1) to\nbe stochastic does not resolve this problem as discussed in Section 3.\n\nDistribution-preserving lossy compression. Motivated by recent advances in extreme image com-\npression [15], we propose and study a novel compression problem: Solve (1) under the constraint that\nthe distribution of reconstructed instances \u02c6X follows the distribution of the training data X. Formally,\nwe want to solve the problem\n\nmin\nE,D\n\nEX,D[d(X, D(E(X)))]\n\ns.t. D(E(X)) \u223c X,\n\n(2)\n\nwhere the decoder is allowed to be stochastic.1 The goal of the distribution matching constraint is\nto enforce artifact-free reconstructions at all rates. Furthermore, as the rate R \u2192 0, the solution\nconverges to a generative model of X, while for suf\ufb01ciently large rates R the solution guarantees\nperfect reconstruction and trivially satis\ufb01es the distribution constraint.\n\n3 Deep generative models for distribution-preserving lossy compression\n\nThe distribution constraint makes solving the problem (2) extremely challenging, as it amounts to\nlearning an exact generative model of the generally unknown distribution PX for R = 0. As a remedy,\none can relax the problem and consider the regularized formulation,\n\nmin\nE,D\n\nEX,D[d(X, D(E(X)))] + \u03bbdf (P \u02c6X , PX ),\n\n(3)\n\nwhere \u02c6X = D(E(X)), and df is a (statistical) divergence that can be estimated from samples using,\ne.g., the GAN framework [14].\n\nChallenges of the extreme compression regime. At any \ufb01nite rate R, the distortion term and the\ndivergence term in (3) have strikingly opposing effects. In particular, for distortion measures for\nwhich miny d(x, y) has a unique minimizer for every x, the decoder minimizing the distortion term\nis constant, conditioned on the code w. For example, if d(x, y) = kx \u2212 yk2, the optimal decoder\nD for a \ufb01xed encoder E obeys D(w) = EX [X|E(X) = w], i.e., it is biased to output the mean.\nFor many popular distortions measures, D, E minimizing the distortion term therefore produce\nreconstructions \u02c6X that follow a discrete distribution, which is at odds with the often continuous\nnature of the data distribution. In contrast, the distribution divergence term encourages D \u25e6 E to\ngenerate outputs that are as close as possible to the data distribution PX , i.e., it encourages D \u25e6 E to\nfollow a continuous distribution if PX is continuous. While in practice the distortion term can have a\nstabilizing effect on the optimization of the divergence term (see [15]), it discourages the decoder\nform being stochastic\u2014the decoder learns to ignore the noise fed as an input to provide stochasticity,\nand does so even when adjusting \u03bb to compensate for the increase in distortion when R decreases\n(see the experiments in Section 5). This is in line with recent results for deep generative models\nin conditional settings: As soon as they are provided with context information, they tend to ignore\nstochasticity as discussed in [21, 22], and in particular [23] and references therein.\n\nProposed method. We propose and study different generative model-based approaches to approx-\nimately solve the DPLC problem. These approaches overcome the aforementioned problems and\ncan be applied for all bitrates R, enabling a gentle tradeoff between matching the distribution of the\ntraining data and perfectly reconstructing the training samples. Figure 2 provides an overview of the\nproposed method.\n\n1Note that a stochastic decoder is necessary if PX is continuous.\n\n3\n\n\fIn order to mitigate the bias-to-the-mean-issues with relaxations of the form (3), we decompose D as\nD = G \u25e6 B, where G is a generative model taking samples from a \ufb01xed prior distribution PZ as an\ninput, trained to minimize a divergence between PG(Z) and PX , and B is a stochastic function that is\ntrained together with E to minimize distortion for a \ufb01xed G.\n\nOut of the plethora of divergences commonly used for learning generative models G [14, 24], the\nWasserstein distance between P \u02c6X and PX is particularly well suited for DPLC. In fact, it has a\ndistinct advantage as it can be de\ufb01ned for an arbitrary transportation cost function, in particular\nfor the distortion measure d quantifying the quality of the reconstruction in (2). For this choice\nof transportation cost, we can analytically quantify the distortion as a function of the rate and the\nWasserstein distance between PG(Z) and PX .\n\nLearning the generative model G. The Wasserstein distance between two distributions PX and PY\nw.r.t. the measurable cost function c : X \u00d7 X \u2192 R+ is de\ufb01ned as\n\nWc(PX , PY ) :=\n\ninf\n\n\u03a0\u2208P(PX ,PY )\n\nE(X,Y )\u223c\u03a0[c(X, Y )],\n\n(4)\n\nwhere P(PX , PY ) is a set of all joint distributions of (X, Y ) with marginals PX and PY , respectively.\nWhen (X , d\u2032) is a metric space and we set c(x, y) = d\u2032(x, y) we have by Kantorovich-Rubinstein\nduality [25] that\n\nWd\u2032 (PX , PY ) := sup\nf \u2208F1\n\nEX [f (X)] \u2212 EY [f (Y )],\n\n(5)\n\nwhere F1 is the class of bounded 1-Lipschitz functions f : X \u2192 R. Let G : Z \u2192 X and set\nY = G(Z) in (5), where Z is distributed according to the prior distribution PZ . Minimizing the\nlatter over the parameters of the mapping G, one recovers the Wasserstein GAN (WGAN) proposed\nin [16]. On the other hand, for Y = G(Z) with deterministic G, (4) is equivalent to factorizing\nthe couplings P(PX , PG(Z)) through Z using a conditional distribution function Q(Z|X) (with\nZ-marginal QZ(Z)) and minimizing over Q(Z|X) [17], i.e.,\n\ninf\n\n\u03a0\u2208P(PX ,PG(Z))\n\nE(X,Y )\u223c\u03a0[c(X, Y )] =\n\ninf\n\nQ : QZ =PZ\n\nEX E\n\nQ(Z|X)[c(X, G(Z))].\n\n(6)\n\nIn this model, the so-called Wasserstein Autoencoder (WAE), Q(Z|X) is parametrized as the push-\nforward of PX , through some possibly stochastic function F : X \u2192 Z and (6) becomes\n\ninf\n\nF : F (X)\u223cPZ\n\nEX EF [c(X, G(F (X)))],\n\n(7)\n\nwhich is then minimized over G.\n\nNote that, in order to solve (2), one cannot simply set c(x, y) = d(x, y) and replace F in (7) with a\nrate-constrained version \u02c6F = B \u25e6 E, where E is a rate-constrained encoder as introduced in Section 2\nand B : W \u2192 Z a stochastic function. Indeed, the tuple (X, G(F (X))) in (7) parametrizes the\ncouplings P(PX , PG(Z)) and G \u25e6 F should therefore be of high model capacity. Using \u02c6F instead of\nF severely constrains the model capacity of G \u25e6 \u02c6F (for small R) compared to G \u25e6 F , and minimizing\n(7) over G \u25e6 \u02c6F would hence not compute a G(Z) which approximately minimizes Wc(PX , PG(Z)).\nLearning the function B \u25e6 E. To circumvent this issue, instead of replacing F in (7) by \u02c6F , we\npropose to \ufb01rst learn G\u22c6 by either minimizing the primal form (6) via WAE or the dual form (5) via\nWGAN (if d is a metric) for c(x, y) = d(x, y), and subsequently minimize the distortion as\n\nmin\n\nB,E : B(E(X))\u223cPZ\n\nEX,B[d(X, G\u22c6(B(E(X))))]\n\n(8)\n\nw.r.t. the \ufb01xed generator G\u22c6. We then recover the stochastic decoder D in (2) as D = G\u22c6 \u25e6 B.\nClearly, the distribution constraint in (8) ensures that G\u22c6(B(E(X))) \u223c G\u22c6(Z) since G was trained\nto map PZ to PX .\n\nReconstructing the Wasserstein distance. The proposed method has the following guarantees.\nTheorem 1. Suppose Z = Rm and k \u00b7 k is a norm on Rm. Further, assume that E[kZk1+\u03b4] < \u221e for\nsome \u03b4 > 0, let d be a metric and let G\u22c6 be K-Lipschitz, i.e., d(G\u22c6(x), G\u22c6(y)) \u2264 Kkx \u2212 yk. Then,\n\nWd(PX , PG\u22c6(Z)) \u2264\n\nmin\nB,E :\n\nB(E(X))\u223cPZ\n\nEX,B[d(X, G\u22c6(B(E(X))))] \u2264 Wd(PX , PG\u22c6(Z)) + 2\u2212 R\n\nm KC,\n\n(9)\n\n4\n\n\fX\n\nF\n\nG\n\nZ\n\nWAE\n\nWasserstein++\n\nf\n\n\u02c6X\n\nWGAN\n\nD\n\nX\n\nE\n\nB\n\nN\n\nG\u22c6\n\n\u02c6X\n\nFigure 2: Left: A generative model G of the data distribution is commonly learned by minimizing\nthe Wasserstein distance between PX and PG(Z) either (i) via Wasserstein Autoencoder (WAE) [17],\nwhere G \u25e6 F parametrizes the couplings between PX and PG(Z), or (ii) via Wasserstein GAN\n(WGAN) [16], which relies on the critic f . We propose Wasserstein++, a novel approach subsuming\nboth WAE and WGAN. Right: Combining the trained generative model G\u22c6 with a rate-constrained\nencoder E (quantization denoted by \u2666-symbol), and a stochastic function B (stochasticity is provided\nthrough the noise vector N ) to realize a distribution-preserving compression (DPLC) system which\nminimizes the distortion between X and \u02c6X, while ensuring that PX and P \u02c6X are similar at all rates.\n\nwhere C > 0 is an absolute constant that depends on \u03b4, m, E[kZk1+\u03b4], and k \u00b7 k. Furthermore, for\nan arbitrary distortion measure d and arbitrary G\u22c6 it holds for all R \u2265 0\n\nWd(PX , PG\u22c6(B(E(X)))) = Wd(PX , PG\u22c6(Z)).\n\n(10)\n\nThe proof is presented in Appendix A. Theorem 1 states that the distortion incurred by the proposed\nprocedure is equal to Wd(PX , PG\u22c6(Z)) up to an additive error term that decays exponentially in R,\nhence converging to Wd(PX , PG\u22c6(Z)) as R \u2192 \u221e. Intuitively, as E is no longer rate-constrained\nasymptotically, we can replace F in (6) by B \u25e6 E and our two-step procedure is equivalent to\nminimizing (7) w.r.t. G, which amounts to minimizing Wd(PX , PG(Z)) w.r.t. G by (6).\nFurthermore, according to Theorem 1, the distribution mismatch between G\u22c6(B(E(X))) and PX is\ndetermined by the quality of the generative model G\u22c6, and is independent of R. This is natural given\nthat we learn G\u22c6 independently.\n\nWe note that the proof of (9) in Theorem 1 hinges upon the fact that Wd is de\ufb01ned w.r.t. the distortion\nmeasure d. The bound can also be applied to a generator G\u2032 obtained by minimizing, e.g., some f -\ndivergence [26] between PX and PG(Z). However, if Wd(PX , PG\u2032(Z)) > Wd(PX , PG\u22c6(Z)) (which\nwill generally be the case in practice) then the distortion obtained by using G\u2032 will asymptotically be\nlarger than that obtained for G\u22c6. This suggests using Wd rather than f -divergences to learn G.\n\n4 Unsupervised training via Wasserstein++\n\nTo learn G, B, and E from data, we parametrize each component as a DNN and solve the correspond-\ning optimization problems via stochastic gradient descent (SGD). We embed the code W as vectors\n(henceforth referred to as \u201ccenters\u201d) in Euclidean space. Note that the centers can also be learned\nfrom the data [6]. Here, we simply \ufb01x them to the set of vectors {\u22121, 1}R and use the differentiable\napproximation from [9] to backpropagate gradients through this non-differentiable embedding. To\nensure that the mapping B is stochastic, we feed noise together with the (embedded) code E(X).\n\nThe distribution constraint in (8), i.e., ensuring that B(E(X)) \u223c PZ , can be implemented using a\nmaximum mean discrepancy (MMD) [27] or GAN-based [17] regularizer. Firstly, we note that both\nMMD and GAN-based regularizers can be learned from the samples\u2014for MMD via the corresponding\nU-estimator, and for GAN via the adversarial framework. Secondly, matching the (simple) prior\ndistribution PZ is much easier than matching the likely complex distribution PX as in (3). Intuitively,\nat high rates, B should learn to ignore the noise at its input and map the code to PZ . On the other hand,\nas R \u2192 0, the code becomes low-dimensional and B is forced to combine it with the stochasticity of\nthe noise at its input to match PZ . In practice, we observe that MMD is robust and allows to enforce\nPZ at all rates R, while GAN-based regularizers are prone to mode collapse at low rates.\nWasserstein++. As previously discussed, G\u22c6 can be learned via WGAN [16] or WAE [17]. As the\nWAE framework naturally includes an encoder, it ensures that the structure of the latent space Z is\namenable to encode into. On the other hand, there is no reason that such a structure should emerge\n\n5\n\n\fin the latent space of G trained via WGAN (in particular when Z is high-dimensional).2 In our\nexperiments we observed that WAE tends to produce somewhat less sharp samples than WGAN. On\nthe other hand, WAE is arguably less prone to mode dropping than WGAN as the WAE objective\nseverely penalizes mode dropping due to the reconstruction error term. To combine the best of both\napproaches, we propose the following novel combination of the primal and the dual form of Wd, via\ntheir convex combination\n\nWc(PX , PG(Z)) = \u03b3 sup\n\nf \u2208F1\n\nEX [f (X)] \u2212 EY [f (G(Z))]!\n\n+ (1 \u2212 \u03b3)(cid:18)\n\ninf\n\nF : F (X)\u223cPZ\n\nEX EF [d(X, G(F (X)))](cid:19) ,\n\n(11)\n\nwith \u03b3 \u2208 [0, 1]. There are two practical questions remaining. Firstly, minimizing this expression\nw.r.t. G can be done by alternating between performing gradient updates for the critic f and gradient\nupdates for G, F . In other words, we combine the steps of the WGAN algorithm [16, Algorithm 1]\nand WAE-MMD algorithm [17, Algorithm 2], and call this combined algorithm Wasserstein++.\nSecondly, one can train the critic f on fake samples from G(Z) or from G(F (X)), which will not\nfollow the same distribution in general due to a mismatch between F (X) and PZ , which is more\npronounced in the beginning of the optimization process. Preliminary experiments suggest that the\nfollowing setup yields samples of best quality (in terms of FID score):\n\n(i) Train f on samples from G( \u02dcZ), where \u02dcZ = U Z + (1 \u2212 U )F (X) with U \u223c Uniform(0, 1).\n(ii) Train G only on samples from F (X), for both the WGAN and the WAE loss term.\n\nWe note that training f on samples from G( \u02dcZ) instead of G(Z) arguably introduces robustness\nto distribution mismatch in Z-space. A more detailed description of Wasserstein++ can be found\nin Appendix C, and the relation of Wasserstein++ to existing approaches combining GANs and\nautoencoders is discussed in Section 6. We proceed to present the empirical evaluation of the\nproposed approach.\n\n5 Empirical evaluation3\n\nSetup. We empirically evaluate the proposed DPLC framework for G\u22c6 trained via WAE-MMD\n(with an inverse multiquadratics kernel, see [17]), WGAN with gradient penalty (WGAN-GP) [28],\nand Wasserstein++ (implementing the 1-Lipschitz constraint in (11) via the gradient penalty from\n[28]), on two standard generative modeling benchmark image datasets, CelebA [19] and LSUN\nbedrooms [20], both downscaled to 64 \u00d7 64 resolution. We focus on these data sets at relatively low\nresolution as current state-of-the-art generative models can handle them reasonably well, and we do\nnot want to limit ourselves by the dif\ufb01culties arising with generative models at higher resolutions.\nThe Euclidean distance is used as distortion measure (training objective) d in all experiments.\n\nWe measure the quality of the reconstructions of our DPLC systems via mean squared error (MSE)\nand we assess how well the distribution of the testing reconstructions matches that of the original data\nusing the FID score, which is the recommended measure for image data [18, 29]. To quantify the\nvariability of the reconstructions conditionally on the code w (i.e., conditionally on the encoder input),\nwe estimate the mean conditional pixel variance PV[ \u02c6X|w] = 1\nEB[( \u02c6Xi,j \u2212 EB[ \u02c6Xi,j|w])2|w],\nwhere N is the number of pixels of X. In other words, PV is a proxy for how well G \u25e6 B picks\nup the noise at its input at low rates. All performance measures are computed on a testing set of\n10k samples held out form the respective training set, except PV which is computed on a subset 256\ntesting samples, averaged over 100 reconstructions per testing sample (i.e., code w).\n\nN Pi,j\n\nArchitectures, hyperparameters, and optimizer. The prior PZ is an m-dimensional multivariate\nstandard normal, and the noise vector providing stochasticity to B has m i.i.d. entries distributed\nuniformly on [0, 1]. We use the DCGAN [30] generator and discriminator architecture for G and\nf , respectively. For F and E we follow [17] and apply the architecture similar to the DCGAN\n\n2In principle, this is not an issue if B has enough model capacity, but it might lead to differences in practice\n\nas the distortion (8) should be easier to minimize if the Z-space is suitably structured, see Section 5.\n\n3Code is available at https://github.com/mitscha/dplc.\n\n6\n\n\f10\u22121\n\n10\u22122\n\n10\u22121\n\n10\u22122\n\nMSE\n\nrFID\n\nPV\n\n102\n\n101\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n0\n\n0\n\n10\u22122\n\n10\u22121\n\n100\n\n0\n\n10\u22122\n\n10\u22121\n\n100\n\n0\n\n10\u22122\n\n10\u22121\n\n100\n\n102\n\n0\n\n10\u22122\n\n10\u22121\nBits per pixel\n\n101\n\n0\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n0\n\n10\u22122\n\n10\u22121\nBits per pixel\n\n100\n\n0\n\n10\u22122\n\n10\u22121\nBits per pixel\n\n100\n\nWAE\n\nWGAN-GP\n\nWasserstein++\n\nBPG\n\nCAE\n\nGC\n\nFigure 3: Testing MSE (smaller is better), reconstruction FID (smaller is better), conditional pixel\nvariance (PV, larger is better) obtained by our DPLC model, for different generators G\u22c6, CAE, BPG,\nas well as GC [15], as function of the bitrate. The results for CelebA are shown in the top row,\nthose for LSUN bedrooms in the bottom row. The PV of our DPLC models steadily increases with\ndecreasing rate, i.e., they generate gradually more image content, as opposed to GC.\n\ndiscriminator. B is realized as a stack of n residual blocks [31]. We set m = 128, n = 2 for CelebA,\nand m = 512, n = 4 for the LSUN bedrooms data set. We chose m to be larger than the standard\nlatent space dimension for GANs as we observed that lower m may lead to blurry reconstructions.\n\nAs baselines, we consider compressive autoencoders (CAEs) with the same architecture G \u25e6 B \u25e6 E\nbut without feeding noise to B, training G, B, E jointly to minimize distortion, and BPG [32], a\nstate-of-the-art engineered codec.4 In addition, to corroborate the claims made on the disadvantages\nof (3) in Section 3, we train G \u25e6 B \u25e6 E to minimize (3) as done in the generative compression (GC)\napproach from [15], but replacing df by Wd.\n\nThroughout, we rely on the Adam optimizer [33]. To train G by means of WAE-MMD and WGAN-\nGP we use the training parameters form [17] and [28], respectively. For Wasserstein++, we set \u03b3 in\n(11) to 2.5 \u00b7 10\u22125 for CelebA and to 10\u22124 for LSUN. Further, we use the same training parameters to\nsolve (8) as for WAE-MMD. Thereby, to compensate for the increase in the reconstruction loss with\ndecreasing rate, we adjust the coef\ufb01cient of the MMD penalty, \u03bbMMD (see Appendix C), proportionally\nas a function of the reconstruction loss of the CAE baseline, i.e., \u03bbMMD(R) = const. \u00b7 MSECAE(R).\nWe adjust the coef\ufb01cient \u03bb of the divergence term df in (3) analogously. This ensures that the\nregularization strength is roughly the same at all rates. Appendix B provides a detailed description of\nall architectures and hyperparameters.\n\nResults. Table 1 shows sample FID of G\u22c6 for WAE, WGAN-GP, and Wasserstein++, as well as\nthe reconstruction FID and MSE for WAE and Wasserstein++.5 In Figure 3 we plot the MSE, the\nreconstruction FID, and PV obtained by our DPLC models as a function of the bitrate, for different\nG\u22c6, along with the values obtained for the baselines. Figure 1 presents visual examples produced by\nour DPLC model with G\u22c6 trained using Wasserstein++, along with examples obtained for GC and\nCAE. More visual examples can be found in Appendix D.\n\n4The implementation from [32] used in this paper cannot compress to rates below \u2248 0.2 bpp on average for\n\nthe data sets considered here.\n\n5The reconstruction FID and MSE in Table 1 are obtained as G\u22c6(F (X)), without rate constraint. We do\nnot report reconstruction FID and MSE for WGAN-GP as its formulation (5) does not naturally include an\nunconstrained encoder.\n\n7\n\n\fDiscussion. We \ufb01rst discuss the performance of the trained generators G\u22c6, shown in Table 1. For\nboth CelebA and LSUN bedrooms, the sample FID obtained by Wasserstein++ is considerably\nsmaller than that of WAE, but slightly larger than that of WGAN-GP. Further, Wasserstein++ yields a\nsigni\ufb01cantly smaller reconstruction FID than WAE, but a larger reconstruction MSE. Note that the\ndecrease in sample and reconstruction FID achieved by Wasserstein++ compared to WAE should be\nexpected to come at the cost of an increased reconstruction MSE, as the Wasserstein++ objective is\nobtained by adding a WGAN term to the WAE objective (which minimizes distortion).\n\nWe now turn to the DPLC results obtained for CelebA shown in Figure 3, top row. It can be seen that\namong our DPLC models, the one combined with G\u22c6 from WAE yields the lowest MSE, followed\nby those based on Wasserstein++, and WGAN-GP. This is not surprising as the optimization of\nWGAN-GP does not include a distortion term. CAE obtains a lower MSE than all DPLC models\nwhich is again intuitive as G, B, E are trained jointly and to minimize distortion exclusively (in\nparticular there is no constraint on the distribution in Z-space). Finally, BPG obtains the overall\nlowest MSE. Note, however, that BPG relies on several advanced techniques such as entropy coding\nbased on context models (see, e.g., [4, 8\u201310]), which we did not implement here (but which could be\nincorporated into our DPLC framework).\n\nAmong our DPLC methods, DPLC based on Wasserstein++ attains the lowest reconstruction FID (i.e.,\nits distribution most faithfully reproduces the data distribution) followed by WGAN-GP and WAE.\nFor all three models, the FID decreases as the rate increases, meaning that the models manage not\nonly to reduce distortion as the rate increases, but also to better reproduce the original distribution.\nThe FID of CAE increases drastically as the rate falls below 0.03 bpp. Arguably, this can be attributed\nto signi\ufb01cant blur incurred at these low rates (see Figure 9 in Appendix D). BPG yields a very high\nFID as soon as the rate falls below 0.5 bpp due to compression artifacts.\n\nThe PV can be seen to increase steadily for all DPLC models as the rate decreases, as expected.\nThis is also re\ufb02ected by the visual examples in Figure 1, left: At 0.5 bpp no variability is visible, at\n0.125 bpp the facial expression starts to vary, and decreasing the rate further leads to the encoder\nproducing different persons, deviating more and more form the original image, until the system\ngenerates random faces.\n\nIn contrast, the PV obtained by solving (3) as in GC [15] is essentially 0, except at 0 bpp, where it is\ncomparable to that of our DPLC models. The noise injected into D = G \u25e6 B is hence ignored unless\nit is the only source of randomness at 0 bpp. We emphasize that this is the case even though we adjust\nthe coef\ufb01cient \u03bb of the df term as \u03bb(R) = const. \u00b7 MSECAE(R) to compensate for the increase in\ndistortion with decreasing rate. The performance of GC in terms of MSE and reconstruction FID is\ncomparable to that of the DPLC model with Wasserstein++ G\u22c6.\n\nWe now turn to the DPLC results obtained for LSUN bedrooms. The qualitative behavior of DPLC\nbased on WAE and Wasserstein++ in terms of MSE, reconstruction FID, and PV is essentially the\nsame as observed for CelebA. Wasserstein++ provides the lowest FID by a large margin, for all\npositive rates. The reconstruction FID for WAE is high at all rates, which is not surprising as the\nsample FID obtained by WAE is large (cf. Table 1), i.e., WAE struggles to model the distribution of\nthe LSUN bedrooms data set.\n\nFor DPLC based on WGAN-GP, in contrast, while the MSE and PV follow the same trend as\nfor CelebA, the reconstruction FID increases notably as the bitrate decreases. By inspecting the\ncorresponding reconstructions (cf. Figure 12 in Appendix D) one can see that the model manages to\napproximate the data distribution well at zero bitrate, but yields increasingly blurry reconstructions\nas the bitrate increases. This indicates that either the (trained) function B \u25e6 E is not mapping the\noriginal images to Z space in a way suitable for G\u22c6 to produce crisp reconstructions, or the range of\nG\u22c6 does not cover the support of PX well. We tried to address the former issue by increasing the\ndepth of B (to increase model capacity) and by increasing \u03bbMMD (to reduce the mismatch between\nthe distribution of B(E(X)) and PZ ), but we did not observe improvements in reconstruction quality.\nWe therefore suspect mode coverage issues to cause the blur in the reconstructions.\n\nFinally, GC [15] largely ignores the noise injected into D at high bitrates, while using it to produce\nstochastic decoders at low bitrates. However, at low rates, the rFID of GC is considerably higher\nthan that of DPLC based on Wasserstein++, meaning that it does not faithfully reproduce the data\ndistribution despite using stochasticity. Indeed, GC suffers from mode collapse at low rates as can be\nseen in Figure 14 in Appendix D.\n\n8\n\n\fTable 1: Reconstruction FID and MSE (without the rate constraint5), and sample FID for the trained\ngenerators G\u22c6, on CelebA and LSUN bedrooms (smaller is better for all three metrics). Wasserstein++\nobtains lower rFID and sFID than WAE, but a (slightly) higher sFID than WGAN-GP.\n\nCelebA\n\nrFID\n\n38.55\n/\n10.93\n\nMSE\n\n0.0165\n/\n0.0277\n\nLSUN bedrooms\n\nsFID\n\n51.82\n22.70\n23.36\n\nMSE\n\n0.0099\n/\n0.0321\n\nrFID\n\n42.59\n/\n27.52\n\nsFID\n\n153.57\n45.52\n60.97\n\nWAE\nWGAN-GP\nWasserstein++\n\n6 Related work\n\nDNN-based methods for compression have become an active area of research over the past few years.\nMost authors focus on image compression [1\u20136, 8, 7, 13, 9, 10], while others consider audio [11] and\nvideo [12] data. Compressive autoencoders [3, 5, 6, 8, 13, 10] and recurrent neural networks (RNNs)\n[1, 2, 7] have emerged as the most popular DNN architectures for compression.\n\nGANs have been used in the context of learned image compression before [4, 15, 34, 35]. Work [4]\napplies a GAN loss to image patches for artifact suppression, whereas [15] applies the GAN loss to\nthe entire image to encourage the decoder to generate image content (but does not demonstrate a\nproperly working stochastic decoder). GANs are leveraged by [36] and [35] to improve image quality\nof super resolution and engineered compression methods, respectively.\n\nSanturkar et al. [34] use a generator trained with a GAN as a decoder in a compression system.\nHowever, they rely on vanilla GAN [14] only rather than considering different Wd-based generative\nmodels and they do not provide an analytical characterization of their model. Most importantly, they\noptimize their model using conventional distortion minimization with deterministic decoder, rather\nthan solving the DPLC problem.\n\nGregor et al. [37] propose a variational autoencoder (VAE)-type generative model that learns a\nhierarchy of progressively more abstract representations. By storing the high-level part of the\nrepresentation and generating the low-level one, they manage to partially preserve and partially\ngenerate image content. However, their framework is lacking a notion of rate and distortion and does\nnot quantize the representations into a code (apart from using \ufb01nite precision data types).\n\nProbably most closely related to Wasserstein++ is VAE-GAN [38], combining VAE [24] with vanilla\nGAN [14]. However, whereas the VAE part and the GAN part minimize different divergences\n(Kullback-Leibler and Jensen-Shannon in the case of VAE and vanilla GAN, respectively), WAE\nand WGAN minimize the same cost function, so Wasserstein++ is somewhat more principled\nconceptually. More generally, learning generative models jointly with an inference mechanism for\nthe latent variables has attracted signi\ufb01cant attention, see, e.g., [38\u201341] and [42] for an overview.\n\nOutside of the domain of machine learning, the problem of distribution-preserving (scalar) quanti-\nzation was studied. Speci\ufb01cally, [43] studies moment preserving quantization, that is quantization\nwith the design criterion that certain moments of the data distribution shall be preserved. Further,\n[44] proposes an engineered dither-based quantization method that preserves the distribution of the\nvariable to be quantized.\n\n7 Conclusion\n\nIn this paper, we studied the DPLC problem, which amounts to optimizing the rate-distortion tradeoff\nunder the constraint that the reconstructed samples follow the distribution of the training data. We\nproposed different approaches to solve the DPLC problem, in particular Wasserstein++, a novel\ncombination of WAE and WGAN, and analytically characterized the properties of the resulting\ncompression systems. These systems allowed us to obtain essentially artifact-free reconstructions\nat all rates, covering the full spectrum from learning a generative model of the data at zero bitrate\non one hand, to learning a compression system with almost perfect reconstruction at high bitrate on\nthe other hand. Most importantly, our framework improves over previous methods by producing\nstochastic decoders at low bitrates, thereby effectively solving the DPLC problem for the \ufb01rst time.\nFuture work includes scaling the proposed approach up to full-resolution images and applying it to\ndata types other than images.\n\n9\n\n\fAcknowledgments. The authors would like to thank Fabian Mentzer for insightful discussions and\nfor providing code to generate BPG images for the empirical evaluation in Section 5.\n\nReferences\n\n[1] G. Toderici, S. M. O\u2019Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and\nR. Sukthankar, \u201cVariable rate image compression with recurrent neural networks,\u201d International\nConference on Learning Representations (ICLR), 2015.\n\n[2] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, \u201cFull\nresolution image compression with recurrent neural networks,\u201d in Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), pp. 5435\u20135443, 2017.\n\n[3] L. Theis, W. Shi, A. Cunningham, and F. Huszar, \u201cLossy image compression with compressive\n\nautoencoders,\u201d in International Conference on Learning Representations (ICLR), 2017.\n\n[4] O. Rippel and L. Bourdev, \u201cReal-time adaptive image compression,\u201d in Proceedings of the\n\nInternational Conference on Machine Learning (ICML), pp. 2922\u20132930, 2017.\n\n[5] J. Ball\u00e9, V. Laparra, and E. P. Simoncelli, \u201cEnd-to-end optimized image compression,\u201d in\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\n[6] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool,\n\u201cSoft-to-hard vector quantization for end-to-end learning compressible representations,\u201d in\nAdvances in Neural Information Processing Systems (NIPS), pp. 1141\u20131151, 2017.\n\n[7] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and\nG. Toderici, \u201cImproved lossy image compression with priming and spatially adaptive bit rates\nfor recurrent networks,\u201d arXiv:1703.10114, 2017.\n\n[8] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, \u201cLearning convolutional networks for content-\nweighted image compression,\u201d in Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pp. 3214\u20133223, 2018.\n\n[9] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, \u201cConditional probability\nmodels for deep image compression,\u201d in Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pp. 4394\u20134402, 2018.\n\n[10] J. Ball\u00e9, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, \u201cVariational image compression\nwith a scale hyperprior,\u201d in International Conference on Learning Representations (ICLR),\n2018.\n\n[11] S. Kankanahalli, \u201cEnd-to-end optimized speech coding with deep neural networks,\u201d in Pro-\nceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), pp. 2521\u20132525, 2018.\n\n[12] C.-Y. Wu, N. Singhal, and P. Kr\u00e4henb\u00fchl, \u201cVideo compression through image interpolation,\u201d in\n\nEuropean Conference on Computer Vision (ECCV), 2018.\n\n[13] R. Torfason, F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, \u201cTowards\nimage understanding from deep compression without decoding,\u201d in International Conference\non Learning Representations (ICLR), 2018.\n\n[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\nand Y. Bengio, \u201cGenerative adversarial nets,\u201d in Advances in Neural Information Processing\nSystems (NIPS), pp. 2672\u20132680, 2014.\n\n[15] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, \u201cGenerative adversarial\n\nnetworks for extreme learned image compression,\u201d arXiv:1804.02958, 2018.\n\n[16] M. Arjovsky, S. Chintala, and L. Bottou, \u201cWasserstein generative adversarial networks,\u201d in\nProceedings of the International Conference on Machine Learning (ICML), pp. 214\u2013223, 2017.\n\n10\n\n\f[17] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, \u201cWasserstein auto-encoders,\u201d in\n\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[18] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, \u201cGANs trained by a two\ntime-scale update rule converge to a local Nash equilibrium,\u201d in Advances in Neural Information\nProcessing Systems (NIPS), pp. 6629\u20136640, 2017.\n\n[19] Z. Liu, P. Luo, X. Wang, and X. Tang, \u201cDeep learning face attributes in the wild,\u201d in Proceedings\n\nof the IEEE International Conference on Computer Vision (ICCV), pp. 3730\u20133738, 2015.\n\n[20] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, \u201cLSUN: Construction of a\nlarge-scale image dataset using deep learning with humans in the loop,\u201d arXiv:1506.03365,\n2015.\n\n[21] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, \u201cUnpaired image-to-image translation using cycle-\nconsistent adversarial networks,\u201d in Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), pp. 2223\u20132232, 2017.\n\n[22] M. Mathieu, C. Couprie, and Y. LeCun, \u201cDeep multi-scale video prediction beyond mean square\n\nerror,\u201d in International Conference on Learning Representations (ICLR), 2016.\n\n[23] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, \u201cToward\nmultimodal image-to-image translation,\u201d in Advances in Neural Information Processing Systems\n(NIPS), pp. 465\u2013476, 2017.\n\n[24] D. P. Kingma and M. Welling, \u201cAuto-encoding variational Bayes,\u201d in International Conference\n\non Learning Representations (ICLR), 2014.\n\n[25] C. Villani, Optimal transport: Old and new, vol. 338. Springer Science & Business Media,\n\n2008.\n\n[26] F. Liese and K.-J. Miescke, \u201cStatistical decision theory,\u201d in Statistical Decision Theory, pp. 1\u201352,\n\nSpringer, 2007.\n\n[27] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola, \u201cA kernel two-sample\n\ntest,\u201d Journal of Machine Learning Research, vol. 13, pp. 723\u2013773, 2012.\n\n[28] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, \u201cImproved training of\nWasserstein GANs,\u201d in Advances in Neural Information Processing Systems (NIPS), pp. 5769\u2013\n5779, 2017.\n\n[29] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet, \u201cAre GANs Created Equal? A\nLarge-Scale Study,\u201d in Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[30] A. Radford, L. Metz, and S. Chintala, \u201cUnsupervised representation learning with deep convo-\n\nlutional generative adversarial networks,\u201d arXiv:1511.06434, 2015.\n\n[31] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npp. 770\u2013778, 2016.\n\n[32] F. Bellard, \u201cBPG Image format.\u201d https://bellard.org/bpg/, 2018. accessed 26 June 2018.\n\n[33] D. P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d in International\n\nConference on Learning Representations (ICLR), 2015.\n\n[34] S. Santurkar, D. Budden, and N. Shavit, \u201cGenerative compression,\u201d in Picture Coding Sympo-\n\nsium (PCS), pp. 258\u2013262, 2018.\n\n[35] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, \u201cDeep generative adversarial compres-\nsion artifact removal,\u201d in Proceedings of the IEEE International Conference on Computer Vision\n(ICCV), pp. 4826\u20134835, 2017.\n\n11\n\n\f[36] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani,\nJ. Totz, Z. Wang, and W. Shi, \u201cPhoto-realistic single image super-resolution using a generative\nadversarial network,\u201d in Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pp. 4681\u20134690, 2017.\n\n[37] K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra, \u201cTowards conceptual\ncompression,\u201d in Advances in Neural Information Processing Systems (NIPS), pp. 3549\u20133557,\n2016.\n\n[38] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther, \u201cAutoencoding beyond pixels\n\nusing a learned similarity metric,\u201d arXiv:1512.09300, 2015.\n\n[39] J. Donahue, P. Kr\u00e4henb\u00fchl, and T. Darrell, \u201cAdversarial feature learning,\u201d in International\n\nConference on Learning Representations (ICLR), 2017.\n\n[40] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville,\n\u201cAdversarially learned inference,\u201d in International Conference on Learning Representations\n(ICLR), 2017.\n\n[41] A. Dosovitskiy and T. Brox, \u201cGenerating images with perceptual similarity metrics based on\ndeep networks,\u201d in Advances in Neural Information Processing Systems (NIPS), pp. 658\u2013666,\n2016.\n\n[42] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed, \u201cVariational approaches\n\nfor auto-encoding generative adversarial networks,\u201d arXiv:1706.04987, 2017.\n\n[43] E. J. Delp and O. R. Mitchell, \u201cMoment preserving quantization (signal processing),\u201d IEEE\n\nTransactions on Communications, vol. 39, no. 11, pp. 1549\u20131558, 1991.\n\n[44] M. Li, J. Klejsa, and W. B. Kleijn, \u201cDistribution preserving quantization with dithering and\n\ntransformation,\u201d IEEE Signal Processing Letters, vol. 17, no. 12, pp. 1014\u20131017, 2010.\n\n[45] H. Luschgy and G. Pag\u00e8s, \u201cFunctional quantization of Gaussian processes,\u201d Journal of Func-\n\ntional Analysis, vol. 196, no. 2, pp. 486\u2013531, 2002.\n\n[46] S. Graf and H. Luschgy, Foundations of quantization for probability distributions. Springer,\n\n2007.\n\n12\n\n\f", "award": [], "sourceid": 2866, "authors": [{"given_name": "Michael", "family_name": "Tschannen", "institution": "ETH Zurich"}, {"given_name": "Eirikur", "family_name": "Agustsson", "institution": "ETH Zurich"}, {"given_name": "Mario", "family_name": "Lucic", "institution": "Google Brain"}]}