{"title": "Neural Discrete Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6306, "page_last": 6315, "abstract": "Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of ``posterior collapse'' -\u2014 where the latents are ignored when they are paired with a powerful autoregressive decoder -\u2014 typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.", "full_text": "Neural Discrete Representation Learning\n\nAaron van den Oord\n\nDeepMind\n\navdnoord@google.com\n\nOriol Vinyals\n\nDeepMind\n\nvinyals@google.com\n\nKoray Kavukcuoglu\n\nDeepMind\n\nkorayk@google.com\n\nAbstract\n\nLearning useful representations without supervision remains a key challenge in\nmachine learning. In this paper, we propose a simple yet powerful generative\nmodel that learns such discrete representations. Our model, the Vector Quantised-\nVariational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the\nencoder network outputs discrete, rather than continuous, codes; and the prior\nis learnt rather than static. In order to learn a discrete latent representation, we\nincorporate ideas from vector quantisation (VQ). Using the VQ method allows the\nmodel to circumvent issues of \u201cposterior collapse\u201d -\u2014 where the latents are ignored\nwhen they are paired with a powerful autoregressive decoder -\u2014 typically observed\nin the VAE framework. Pairing these representations with an autoregressive prior,\nthe model can generate high quality images, videos, and speech as well as doing\nhigh quality speaker conversion and unsupervised learning of phonemes, providing\nfurther evidence of the utility of the learnt representations.\n\n1\n\nIntroduction\n\nRecent advances in generative modelling of images [38, 12, 13, 22, 10], audio [37, 26] and videos\n[20, 11] have yielded impressive samples and applications [24, 18]. At the same time, challenging\ntasks such as few-shot learning [34], domain adaptation [17], or reinforcement learning [35] heavily\nrely on learnt representations from raw data, but the usefulness of generic representations trained in\nan unsupervised fashion is still far from being the dominant approach.\nMaximum likelihood and reconstruction error are two common objectives used to train unsupervised\nmodels in the pixel domain, however their usefulness depends on the particular application the\nfeatures are used in. Our goal is to achieve a model that conserves the important features of the\ndata in its latent space while optimising for maximum likelihood. As the work in [7] suggests, the\nbest generative models (as measured by log-likelihood) will be those without latents but a powerful\ndecoder (such as PixelCNN). However, in this paper, we argue for learning discrete and useful latent\nvariables, which we demonstrate on a variety of domains.\nLearning representations with continuous features have been the focus of many previous work\n[16, 39, 6, 9] however we concentrate on discrete representations [27, 33, 8, 28] which are potentially\na more natural \ufb01t for many of the modalities we are interested in. Language is inherently discrete,\nsimilarly speech is typically represented as a sequence of symbols. Images can often be described\nconcisely by language [40]. Furthermore, discrete representations are a natural \ufb01t for complex\nreasoning, planning and predictive learning (e.g., if it rains, I will use an umbrella). While using\ndiscrete latent variables in deep learning has proven challenging, powerful autoregressive models\nhave been developed for modelling distributions over discrete variables [37].\nIn our work, we introduce a new family of generative models succesfully combining the variational\nautoencoder (VAE) framework with discrete latent representations through a novel parameterisation\nof the posterior distribution of (discrete) latents given an observation. Our model, which relies on\nvector quantization (VQ), is simple to train, does not suffer from large variance, and avoids the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u201cposterior collapse\u201d issue which has been problematic with many VAE models that have a powerful\ndecoder, often caused by latents being ignored. Additionally, it is the \ufb01rst discrete latent VAE model\nthat get similar performance as its continuous counterparts, while offering the \ufb02exibility of discrete\ndistributions. We term our model the VQ-VAE.\nSince VQ-VAE can make effective use of the latent space, it can successfully model important\nfeatures that usually span many dimensions in data space (for example objects span many pixels in\nimages, phonemes in speech, the message in a text fragment, etc.) as opposed to focusing or spending\ncapacity on noise and imperceptible details which are often local.\nLastly, once a good discrete latent structure of a modality is discovered by the VQ-VAE, we train\na powerful prior over these discrete random variables, yielding interesting samples and useful\napplications. For instance, when trained on speech we discover the latent structure of language\nwithout any supervision or prior knowledge about phonemes or words. Furthermore, we can equip\nour decoder with the speaker identity, which allows for speaker conversion, i.e., transferring the\nvoice from one speaker to another without changing the contents. We also show promising results on\nlearning long term structure of environments for RL.\nOur contributions can thus be summarised as:\n\n\u2022 Introducing the VQ-VAE model, which is simple, uses discrete latents, does not suffer from\n\u201cposterior collapse\u201d and has no variance issues.\n\u2022 We show that a discrete latent model (VQ-VAE) perform as well as its continuous model\n\u2022 When paired with a powerful prior, our samples are coherent and high quality on a wide\n\u2022 We show evidence of learning language through raw speech, without any supervision, and\n\nvariety of applications such as speech and video generation.\n\ncounterparts in log-likelihood.\n\nshow applications of unsupervised speaker conversion.\n\n2 Related Work\n\nIn this work we present a new way of training variational autoencoders [23, 32] with discrete latent\nvariables [27]. Using discrete variables in deep learning has proven challenging, as suggested by\nthe dominance of continuous latent variables in most of current work \u2013 even when the underlying\nmodality is inherently discrete.\nThere exist many alternatives for training discrete VAEs. The NVIL [27] estimator use a single-sample\nobjective to optimise the variational lower bound, and uses various variance-reduction techniques to\nspeed up training. VIMCO [28] optimises a multi-sample objective [5], which speeds up convergence\nfurther by using multiple samples from the inference network.\nRecently a few authors have suggested the use of a new continuous reparemetrisation based on the\nso-called Concrete [25] or Gumbel-softmax [19] distribution, which is a continuous distribution and\nhas a temperature constant that can be annealed during training to converge to a discrete distribution\nin the limit. In the beginning of training the variance of the gradients is low but biased, and towards\nthe end of training the variance becomes high but unbiased.\nNone of the above methods, however, close the performance gap of VAEs with continuous latent\nvariables where one can use the Gaussian reparameterisation trick which bene\ufb01ts from much lower\nvariance in the gradients. Furthermore, most of these techniques are typically evaluated on relatively\nsmall datasets such as MNIST, and the dimensionality of the latent distributions is small (e.g., below\n8). In our work, we use three complex image datasets (CIFAR10, ImageNet, and DeepMind Lab) and\na raw speech dataset (VCTK).\nOur work also extends the line of research where autoregressive distributions are used in the decoder\nof VAEs and/or in the prior [14]. This has been done for language modelling with LSTM decoders [4],\nand more recently with dilated convolutional decoders [42]. PixelCNNs [29, 38] are convolutional\nautoregressive models which have also been used as distribution in the decoder of VAEs [15, 7].\nFinally, our approach also relates to work in image compression with neural networks. Theis et. al.\n[36] use scalar quantisation to compress activations for lossy image compression before arithmetic\nencoding. Other authors [1] propose a method for similar compression model with vector quantisation.\n\n2\n\n\fThe authors propose a continuous relaxation of vector quantisation which is annealed over time\nto obtain a hard clustering. In their experiments they \ufb01rst train an autoencoder, afterwards vector\nquantisation is applied to the activations of the encoder, and \ufb01nally the whole network is \ufb01ne tuned\nusing the soft-to-hard relaxation with a small learning rate. In our experiments we were unable to\ntrain using the soft-to-hard relaxation approach from scratch as the decoder was always able to invert\nthe continuous relaxation during training, so that no actual quantisation took place.\n\n3 VQ-VAE\n\nPerhaps the work most related to our approach are VAEs. VAEs consist of the following parts:\nan encoder network which parameterises a posterior distribution q(z|x) of discrete latent random\nvariables z given the input data x, a prior distribution p(z), and a decoder with a distribution p(x|z)\nover input data.\nTypically, the posteriors and priors in VAEs are assumed normally distributed with diagonal covari-\nance, which allows for the Gaussian reparametrisation trick to be used [32, 23]. Extensions include\nautoregressive prior and posterior models [14], normalising \ufb02ows [31, 10], and inverse autoregressive\nposteriors [22].\nIn this work we introduce the VQ-VAE where we use discrete latent variables with a new way of\ntraining, inspired by vector quantisation (VQ). The posterior and prior distributions are categorical,\nand the samples drawn from these distributions index an embedding table. These embeddings are\nthen used as input into the decoder network.\n\n3.1 Discrete Latent variables\nWe de\ufb01ne a latent embedding space e \u2208 RK\u00d7D where K is the size of the discrete latent space (i.e.,\na K-way categorical), and D is the dimensionality of each latent embedding vector ei. Note that\nthere are K embedding vectors ei \u2208 RD, i \u2208 1, 2, ..., K. As shown in Figure 1, the model takes an\ninput x, that is passed through an encoder producing output ze(x). The discrete latent variables z\nare then calculated by a nearest neighbour look-up using the shared embedding space e as shown in\nequation 1. The input to the decoder is the corresponding embedding vector ek as given in equation 2.\nOne can see this forward computation pipeline as a regular autoencoder with a particular non-linearity\nthat maps the latents to 1-of-K embedding vectors. The complete set of parameters for the model are\nunion of parameters of the encoder, decoder, and the embedding space e. For sake of simplicity we\nuse a single random variable z to represent the discrete latent variables in this Section, however for\nspeech, image and videos we actually extract a 1D, 2D and 3D latent feature spaces respectively.\nThe posterior categorical distribution q(z|x) probabilities are de\ufb01ned as one-hot as follows:\n\n(cid:26)1\n\n0\n\nq(z = k|x) =\n\nfor k = argminj(cid:107)ze(x) \u2212 ej(cid:107)2,\notherwise\n\n,\n\n(1)\n\nwhere ze(x) is the output of the encoder network. We view this model as a VAE in which we\ncan bound log p(x) with the ELBO. Our proposal distribution q(z = k|x) is deterministic, and by\nde\ufb01ning a simple uniform prior over z we obtain a KL divergence constant and equal to log K.\nThe representation ze(x) is passed through the discretisation bottleneck followed by mapping onto\nthe nearest element of embedding e as given in equations 1 and 2.\n\nzq(x) = ek, where k = argminj(cid:107)ze(x) \u2212 ej(cid:107)2\n\n(2)\n\n3.2 Learning\n\nNote that there is no real gradient de\ufb01ned for equation 2, however we approximate the gradient\nsimilar to the straight-through estimator [3] and just copy gradients from decoder input zq(x) to\nencoder output ze(x). One could also use the subgradient through the quantisation operation, but this\nsimple estimator worked well for the initial experiments in this paper.\n\n3\n\n\fFigure 1: Left: A \ufb01gure describing the VQ-VAE. Right: Visualisation of the embedding space. The\noutput of the encoder z(x) is mapped to the nearest point e2. The gradient \u2207zL (in red) will push the\nencoder to change its output, which could alter the con\ufb01guration in the next forward pass.\n\nDuring forward computation the nearest embedding zq(x) (equation 2) is passed to the decoder, and\nduring the backwards pass the gradient \u2207zL is passed unaltered to the encoder. Since the output\nrepresentation of the encoder and the input to the decoder share the same D dimensional space,\nthe gradients contain useful information for how the encoder has to change its output to lower the\nreconstruction loss.\nAs seen on Figure 1 (right), the gradient can push the encoder\u2019s output to be discretised differently in\nthe next forward pass, because the assignment in equation 1 will be different.\nEquation 3 speci\ufb01es the overall loss function. It is has three components that are used to train\ndifferent parts of VQ-VAE. The \ufb01rst term is the reconstruction loss (or the data term) which optimizes\nthe decoder and the encoder (through the estimator explained above). Due to the straight-through\ngradient estimation of mapping from ze(x) to zq(x), the embeddings ei receive no gradients from\nthe reconstruction loss log p(z|zq(x)). Therefore, in order to learn the embedding space, we use one\nof the simplest dictionary learning algorithms, Vector Quantisation (VQ). The VQ objective uses\nthe l2 error to move the embedding vectors ei towards the encoder outputs ze(x) as shown in the\nsecond term of equation 3. Because this loss term is only used for updating the dictionary, one can\nalternatively also update the dictionary items as function of moving averages of ze(x) (not used for\nthe experiments in this work).\nFinally, since the volume of the embedding space is dimensionless, it can grow arbitrarily if the\nembeddings ei do not train as fast as the encoder parameters. To make sure the encoder commits to\nan embedding and its output does not grow, we add a commitment loss, the third term in equation 3.\nThus, the total training objective becomes:\n\nL = log p(x|zq(x)) + (cid:107)sg[ze(x)] \u2212 e(cid:107)2\n\n2 + \u03b2(cid:107)ze(x) \u2212 sg[e](cid:107)2\n2,\n\n(3)\n\nwhere sg stands for the stopgradient operator that is de\ufb01ned as identity at forward computation time\nand has zero partial derivatives, thus effectively constraining its operand to be a non-updated constant.\nThe decoder optimises the \ufb01rst loss term only, the encoder optimises the \ufb01rst and the last loss terms,\nand the embeddings are optimised by the middle loss term. We found the resulting algorithm to be\nquite robust to \u03b2, as the results did not vary for values of \u03b2 ranging from 0.1 to 2.0. We use \u03b2 = 0.25\nin all our experiments, although in general this would depend on the scale of reconstruction loss.\nSince we assume a uniform prior for z, the KL term that usually appears in the ELBO is constant\nw.r.t. the encoder parameters and can thus be ignored for training.\nIn our experiments we de\ufb01ne N discrete latents (e.g., we use a \ufb01eld of 32 x 32 latents for ImageNet,\nor 8 x 8 x 10 for CIFAR10). The resulting loss L is identical, except that we get an average over N\nterms for k-means and commitment loss \u2013 one for each latent.\nThe log-likelihood of the complete model log p(x) can be evaluated as follows:\n\n(cid:88)\n\nlog p(x) = log\n\np(x|zk)p(zk),\n\nBecause the decoder p(x|z) is trained with z = zq(x) from MAP-inference, the decoder should not\nallocate any probability mass to p(x|z) for z (cid:54)= zq(x) once it has fully converged. Thus, we can write\n\nk\n\n4\n\n\flog p(x) \u2248 log p(x|zq(x))p(zq(x)). We empirically evaluate this approximation in section 4. From\nJensen\u2019s inequality, we also can write log p(x) \u2265 log p(x|zq(x))p(zq(x)).\n\n3.3 Prior\n\nThe prior distribution over the discrete latents p(z) is a categorical distribution, and can be made\nautoregressive by depending on other z in the feature map. Whilst training the VQ-VAE, the prior is\nkept constant and uniform. After training, we \ufb01t an autoregressive distribution over z, p(z), so that\nwe can generate x via ancestral sampling. We use a PixelCNN over the discrete latents for images,\nand a WaveNet for raw audio. Training the prior and the VQ-VAE jointly, which could strengthen our\nresults, is left as future research.\n\n4 Experiments\n\n4.1 Comparison with continuous variables\n\nAs a \ufb01rst experiment we compare VQ-VAE with normal VAEs (with continuous variables), as well as\nVIMCO [28] with independent Gaussian or categorical priors. We train these models using the same\nstandard VAE architecture on CIFAR10, while varying the latent capacity (number of continuous or\ndiscrete latent variables, as well as the dimensionality of the discrete space K). The encoder consists\nof 2 strided convolutional layers with stride 2 and window size 4 \u00d7 4, followed by two residual\n3 \u00d7 3 blocks (implemented as ReLU, 3x3 conv, ReLU, 1x1 conv), all having 256 hidden units. The\ndecoder similarly has two residual 3 \u00d7 3 blocks, followed by two transposed convolutions with stride\n2 and window size 4 \u00d7 4. We use the ADAM optimiser [21] with learning rate 2e-4 and evaluate\nthe performance after 250,000 steps with batch-size 128. For VIMCO we use 50 samples in the\nmulti-sample training objective.\nThe VAE, VQ-VAE and VIMCO models obtain 4.51 bits/dim, 4.67 bits/dim and 5.14 respectively.\nAll reported likelihoods are lower bounds. Our numbers for the continuous VAE are comparable to\nthose reported for a Deep convolutional VAE: 4.54 bits/dim [13] on this dataset.\nOur model is the \ufb01rst among those using discrete latent variables which challenges the performance\nof continuous VAEs. Thus, we get very good reconstructions like regular VAEs provide, with the\ncompressed representation that symbolic representations provide. A few interesting characteristics,\nimplications and applications of the VQ-VAEs that we train is shown in the next subsections.\n\n4.2\n\nImages\n\nImages contain a lot of redundant information as most of the pixels are correlated and noisy, therefore\nlearning models at the pixel level could be wasteful.\nIn this experiment we show that we can model x = 128 \u00d7 128 \u00d7 3 images by compressing them to a\nz = 32 \u00d7 32 \u00d7 1 discrete space (with K=512) via a purely deconvolutional p(x|z). So a reduction of\n32\u00d732\u00d79 \u2248 42.6 in bits. We model images by learning a powerful prior (PixelCNN) over z. This\n128\u00d7128\u00d73\u00d78\nallows to not only greatly speed up training and sampling, but also to use the PixelCNNs capacity to\ncapture the global structure instead of the low-level statistics of images.\n\nFigure 2: Left: ImageNet 128x128x3 images, right: reconstructions from a VQ-VAE with a 32x32x1\nlatent space, with K=512.\n\n5\n\n\fReconstructions from the 32x32x1 space with discrete latents are shown in Figure 2. Even considering\nthat we greatly reduce the dimensionality with discrete encoding, the reconstructions look only slightly\nblurrier than the originals. It would be possible to use a more perceptual loss function than MSE over\npixels here (e.g., a GAN [12]), but we leave that as future work.\nNext, we train a PixelCNN prior on the discretised 32x32x1 latent space. As we only have 1 channel\n(not 3 as with colours), we only have to use spatial masking in the PixelCNN. The capacity of the\nPixelCNN we used was similar to those used by the authors of the PixelCNN paper [38].\n\nFigure 3: Samples (128x128) from a VQ-VAE with a PixelCNN prior trained on ImageNet images.\nFrom left to right: kit fox, gray whale, brown bear, admiral (butter\ufb02y), coral reef, alp, microwave,\npickup.\n\nSamples drawn from the PixelCNN were mapped to pixel-space with the decoder of the VQ-VAE\nand can be seen in Figure 3.\n\nFigure 4: Samples (128x128) from a VQ-VAE with a PixelCNN prior trained on frames captured\nfrom DeepMind Lab.\n\nWe also repeat the same experiment for 84x84x3 frames drawn from the DeepMind Lab environment\n[2]. The reconstructions looked nearly identical to their originals. Samples drawn from the PixelCNN\nprior trained on the 21x21x1 latent space and decoded to the pixel space using a deconvolutional\nmodel decoder can be seen in Figure 4.\nFinally, we train a second VQ-VAE with a PixelCNN decoder on top of the 21x21x1 latent space\nfrom the \ufb01rst VQ-VAE on DM-LAB frames. This setup typically breaks VAEs as they suffer from\n\"posterior collapse\", i.e., the latents are ignored as the decoder is powerful enough to model x\nperfectly. Our model however does not suffer from this, and the latents are meaningfully used. We use\nonly three latent variables (each with K=512 and their own embedding space e) at the second stage\nfor modelling the whole image and as such the model cannot reconstruct the image perfectly \u2013 which\nis consequence of compressing the image onto 3 x 9 bits, i.e. less than a \ufb02oat32. Reconstructions\nsampled from the discretised global code can be seen in Figure 5.\n\n6\n\n\fFigure 5: Top original images, Bottom: reconstructions from a 2 stage VQ-VAE, with 3 latents to\nmodel the whole image (27 bits), and as such the model cannot reconstruct the images perfectly. The\nreconstructions are generated by sampled from the second PixelCNN prior in the 21x21 latent domain\nof \ufb01rst VQ-VAE, and then decoded with standard VQ-VAE decoder to 84x84. A lot of the original\nscene, including textures, room layout and nearby walls remain, but the model does not try to store\nthe pixel values themselves, which means the textures are generated procedurally by the PixelCNN.\n\nFigure 6: Left: original waveform, middle: reconstructed with same speaker-id, right: reconstructed\nwith different speaker-id. The contents of the three waveforms are the same.\n\n4.3 Audio\n\nIn this set of experiments we evaluate the behaviour of discrete latent variables on models of raw\naudio. In all our audio experiments, we train a VQ-VAE that has a dilated convolutional architecture\nsimilar to WaveNet decoder. All samples for this section can be played from the following url:\nhttps://avdnoord.github.io/homepage/vqvae/.\nWe \ufb01rst consider the VCTK dataset, which has speech recordings of 109 different speakers [41].\nWe train a VQ-VAE where the encoder has 6 strided convolutions with stride 2 and window-size 4.\nThis yields a latent space 64x smaller than the original waveform. The latents consist of one feature\nmap and the discrete space is 512-dimensional. The decoder is conditioned on both the latents and a\none-hot embedding for the speaker.\nFirst, we ran an experiment to show that VQ-VAE can extract a latent space that only conserves\nlong-term relevant information. After training the model, given an audio example, we can encode\nit to the discrete latent representation, and reconstruct by sampling from the decoder. Because the\ndimensionality of the discrete representation is 64 times smaller, the original sample cannot be\nperfectly reconstructed sample by sample. As it can be heard from the provided samples, and as\nshown in Figure 7, the reconstruction has the same content (same text contents), but the waveform\nis quite different and prosody in the voice is altered. This means that the VQ-VAE has, without\nany form of linguistic supervision, learned a high-level abstract space that is invariant to low-level\nfeatures and only encodes the content of the speech. This experiment con\ufb01rms our observations from\nbefore that important features are often those that span many dimensions in the input data space (in\nthis case phoneme and other high-level content in waveform).\nWe have then analysed the unconditional samples from the model to understand its capabilities. Given\nthe compact and abstract latent representation extracted from the audio, we trained the prior on top of\nthis representation to model the long-term dependencies in the data. For this task we have used a\nlarger dataset of 460 speakers [30] and trained a VQ-VAE model where the resolution of discrete\nspace is 128 times smaller. Next we trained the prior as usual on top of this representation on chunks\nof 40960 timesteps (2.56 seconds), which yields 320 latent timesteps. While samples drawn from even\nthe best speech models like the original WaveNet [37] sound like babbling , samples from VQ-VAE\ncontain clear words and part-sentences (see samples linked above). We conclude that VQ-VAE was\nable to model a rudimentary phoneme-level language model in a completely unsupervised fashion\nfrom raw audio waveforms.\n\n7\n\n\fNext, we attempted the speaker conversion where the latents are extracted from one speaker and then\nreconstructed through the decoder using a separate speaker id. As can be heard from the samples,\nthe synthesised speech has the same content as the original sample, but with the voice from the\nsecond speaker. This experiment again demonstrates that the encoded representation has factored out\nspeaker-speci\ufb01c information: the embeddings not only have the same meaning regardless of details\nin the waveform, but also across different voice-characteristics.\nFinally, in an attempt to better understand the content of the discrete codes we have compared the\nlatents one-to-one with the ground-truth phoneme-sequence (which was not used any way to train the\nVQ-VAE). With a 128-dimensional discrete space that runs at 25 Hz (encoder downsampling factor\nof 640), we mapped every of the 128 possible latent values to one of the 41 possible phoneme values1\n(by taking the conditionally most likely phoneme). The accuracy of this 41-way classi\ufb01cation was\n49.3%, while a random latent space would result in an accuracy of 7.2% (prior most likely phoneme).\nIt is clear that these discrete latent codes obtained in a fully unsupervised way are high-level speech\ndescriptors that are closely related to phonemes.\n\n4.4 Video\n\nFor our \ufb01nal experiment we have used the DeepMind Lab [2] environment to train a generative model\nconditioned on a given action sequence. In Figure 7 we show the initial 6 frames that are input to the\nmodel followed by 10 frames that are sampled from VQ-VAE with all actions set to forward (top row)\nand right (bottom row). Generation of the video sequence with the VQ-VAE model is done purely in\nthe latent space, zt without the need to generate the actual images themselves. Each image in the\nsequence xt is then created by mapping the latents with a deterministic decoder to the pixel space\nafter all the latents are generated using only the prior model p(z1, . . . , zT ). Therefore, VQ-VAE can\nbe used to imagine long sequences purely in latent space without resorting to pixel space. It can be\nseen that the model has learnt to successfully generate a sequence of frames conditioned on given\naction without any degradation in the visual quality whilst keeping the local geometry correct. For\ncompleteness, we trained a model without actions and obtained similar results, not shown due to\nspace constraints.\n\nFigure 7: First 6 frames are provided to the model, following frames are generated conditioned on an\naction. Top: repeated action \"move forward\", bottom: repeated action \"move right\".\n\n5 Conclusion\n\nIn this work we have introduced VQ-VAE, a new family of models that combine VAEs with vector\nquantisation to obtain a discrete latent representation. We have shown that VQ-VAEs are capable of\nmodelling very long term dependencies through their compressed discrete latent space which we have\ndemonstrated by generating 128 \u00d7 128 colour images, sampling action conditional video sequences\nand \ufb01nally using audio where even an unconditional model can generate surprisingly meaningful\nchunks of speech and doing speaker conversion. All these experiments demonstrated that the discrete\nlatent space learnt by VQ-VAEs capture important features of the data in a completely unsupervised\nmanner. Moreover, VQ-VAEs achieve likelihoods that are almost as good as their continuous latent\nvariable counterparts on CIFAR10 data. We believe that this is the \ufb01rst discrete latent variable model\nthat can successfully model long range sequences and fully unsupervisedly learn high-level speech\ndescriptors that are closely related to phonemes.\n\n1Note that the encoder/decoder pairs could make the meaning of every discrete latent depend on previous\nlatents in the sequence, e.g.. bi/tri-grams (and thus achieve a higher compression) which means a more advanced\nmapping to phonemes would results in higher accuracy.\n\n8\n\n\fReferences\n[1] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and\nLuc Van Gool. Soft-to-hard vector quantization for end-to-end learned compression of images and neural\nnetworks. arXiv preprint arXiv:1704.00648, 2017.\n\n[2] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich K\u00fcttler, Andrew\nLefrancq, Simon Green, V\u00edctor Vald\u00e9s, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801,\n2016.\n\n[3] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[4] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\nGenerating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\n[5] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint\n\narXiv:1509.00519, 2015.\n\n[6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. CoRR,\nabs/1606.03657, 2016.\n\n[7] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever,\n\nand Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.\n\n[8] Aaron Courville, James Bergstra, and Yoshua Bengio. A spike and slab restricted boltzmann machine. In\nProceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n233\u2013241, 2011.\n\n[9] Emily Denton, Sam Gross, and Rob Fergus. Semi-supervised learning with context-conditional generative\n\nadversarial networks. arXiv preprint arXiv:1611.06430, 2016.\n\n[10] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint\n\narXiv:1605.08803, 2016.\n\n[11] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through\n\nvideo prediction. In Advances in Neural Information Processing Systems, pages 64\u201372, 2016.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[13] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards\nconceptual compression. In Advances In Neural Information Processing Systems, pages 3549\u20133557, 2016.\n\n[14] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive\n\nnetworks. arXiv preprint arXiv:1310.8499, 2013.\n\n[15] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David V\u00e1zquez, and\nAaron C. Courville. Pixelvae: A latent variable model for natural images. CoRR, abs/1611.05013, 2016.\n\n[16] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nscience, 313(5786):504\u2013507, 2006.\n\n[17] Judy Hoffman, Erik Rodner, Jeff Donahue, Trevor Darrell, and Kate Saenko. Ef\ufb01cient learning of\n\ndomain-invariant image representations. arXiv preprint arXiv:1301.3224, 2013.\n\n[18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional\n\nadversarial networks. arXiv preprint arXiv:1611.07004, 2016.\n\n[19] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[20] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and\n\nKoray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.\n\n[21] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n9\n\n\f[22] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved\n\nvariational inference with inverse autoregressive \ufb02ow. 2016.\n\n[23] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[24] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro Acosta,\nAndrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-\nresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.\n\n[25] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[26] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron\nCourville, and Yoshua Bengio. Samplernn: An unconditional end-to-end neural audio generation model.\narXiv preprint arXiv:1612.07837, 2016.\n\n[27] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv\n\npreprint arXiv:1402.0030, 2014.\n\n[28] Andriy Mnih and Danilo Jimenez Rezende. Variational inference for monte carlo objectives. CoRR,\n\nabs/1602.06725, 2016.\n\n[29] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759, 2016.\n\n[30] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus\nbased on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE\nInternational Conference on, pages 5206\u20135210. IEEE, 2015.\n\n[31] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\n[32] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[33] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 448\u2013455, 2009.\n\n[34] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. One-shot\n\nlearning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065, 2016.\n\n[35] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[36] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Husz\u00e1r. Lossy image compression with\n\ncompressive autoencoders. arXiv preprint arXiv:1703.00395, 2017.\n\n[37] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.\nCoRR abs/1609.03499, 2016.\n\n[38] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\nimage generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages\n4790\u20134798, 2016.\n\n[39] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked\ndenoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.\nJournal of Machine Learning Research, 11(Dec):3371\u20133408, 2010.\n\n[40] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\ncaption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 3156\u20133164, 2015.\n\n[41] Junichi Yamagishi. English multi-speaker corpus for cstr voice cloning toolkit. URL http://homepages. inf.\n\ned. ac. uk/jyamagis/page3/page58/page58. html, 2012.\n\n[42] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved variational\n\nautoencoders for text modeling using dilated convolutions. CoRR, abs/1702.08139, 2017.\n\n10\n\n\f", "award": [], "sourceid": 3169, "authors": [{"given_name": "Aaron", "family_name": "van den Oord", "institution": "Google Deepmind"}, {"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google DeepMind"}, {"given_name": "koray", "family_name": "kavukcuoglu", "institution": "Google DeepMind"}]}