{"title": "Variational Memory Addressing in Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3920, "page_last": 3929, "abstract": "Aiming to augment generative models with external memory, we interpret the output of a memory module with stochastic addressing as a conditional mixture distribution, where a read operation corresponds to sampling a discrete memory address and retrieving the corresponding content from memory. This perspective allows us to apply variational inference to memory addressing, which enables effective training of the memory module by using the target information to guide memory lookups. Stochastic addressing is particularly well-suited for generative models as it naturally encourages multimodality which is a prominent aspect of most high-dimensional datasets. Treating the chosen address as a latent variable also allows us to quantify the amount of information gained with a memory lookup and measure the contribution of the memory module to the generative process. To illustrate the advantages of this approach we incorporate it into a variational autoencoder and apply the resulting model to the task of generative few-shot learning. The intuition behind this architecture is that the memory module can pick a relevant template from memory and the continuous part of the model can concentrate on modeling remaining variations. We demonstrate empirically that our model is able to identify and access the relevant memory contents even with hundreds of unseen Omniglot characters in memory.", "full_text": "Variational Memory Addressing\n\nin Generative Models\n\nJ\u00f6rg Bornschein Andriy Mnih Daniel Zoran Danilo J. Rezende\n\n{bornschein, amnih, danielzoran, danilor}@google.com\n\nDeepMind, London, UK\n\nAbstract\n\nAiming to augment generative models with external memory, we interpret the\noutput of a memory module with stochastic addressing as a conditional mixture\ndistribution, where a read operation corresponds to sampling a discrete memory\naddress and retrieving the corresponding content from memory. This perspective\nallows us to apply variational inference to memory addressing, which enables\neffective training of the memory module by using the target information to guide\nmemory lookups. Stochastic addressing is particularly well-suited for generative\nmodels as it naturally encourages multimodality which is a prominent aspect of\nmost high-dimensional datasets. Treating the chosen address as a latent variable\nalso allows us to quantify the amount of information gained with a memory lookup\nand measure the contribution of the memory module to the generative process.\nTo illustrate the advantages of this approach we incorporate it into a variational\nautoencoder and apply the resulting model to the task of generative few-shot\nlearning. The intuition behind this architecture is that the memory module can\npick a relevant template from memory and the continuous part of the model can\nconcentrate on modeling remaining variations. We demonstrate empirically that\nour model is able to identify and access the relevant memory contents even with\nhundreds of unseen Omniglot characters in memory.\n\n1\n\nIntroduction\n\nRecent years have seen rapid developments in generative modelling. Much of the progress was driven\nby the use of powerful neural networks to parameterize conditional distributions composed to de\ufb01ne\nthe generative process (e.g., VAEs [1, 2], GANs [3]). In the Variational Autoencoder (VAE) framework\nfor example, we typically de\ufb01ne a generative model p(z), p\u2713(x|z) and an approximate inference\nmodel q(z|x). All conditional distributions are parameterized by multilayered perceptrons (MLPs)\nwhich, in the simplest case, output the mean and the diagonal variance of a Normal distribution\ngiven the conditioning variables. We then optimize a variational lower bound to learn the generative\nmodel for x. Considering recent progress, we now have the theory and the tools to train powerful,\npotentially non-factorial parametric conditional distributions p(x|y) that generalize well with respect\nto x (normalizing \ufb02ows [4], inverse autoregressive \ufb02ows [5], etc.).\nAnother line of work which has been gaining popularity recently is memory augmented neural\nnetworks [6, 7, 8]. In this family of models the network is augmented with a memory buffer which\nallows read and write operations and is persistent in time. Such models usually handle input and output\nto the memory buffer using differentiable \u201csoft\u201d write/read operations to allow back-propagating\ngradients during training.\nHere we propose a memory-augmented generative model that uses a discrete latent variable a acting\nas an address into the memory buffer M. This stochastic perspective allows us to introduce a\nvariational approximation over the addressing variable which takes advantage of target information\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fz\n\np(z)\n\nm\n\np(m)\n\np(a)\n\na\n\nm\n\nm\n\nx p(x|z, m(z))\n\nz\n\nx\n\np(x|m, z)\n\nz\n\nx\n\np(x|ma, z)\n\nFigure 1: Left: Sketch of typical SOTA generative latent variable model with memory. Red edges\nindicate approximate inference distributions q(\u00b7|\u00b7). The KL(q||p) cost to identify a speci\ufb01c memory\nentry might be substantial, even though the cost of accessing a memory entry should be in the order\nof log |M|. Middle & Right: We combine a top-level categorical distribution p(a) and a conditional\nvariational autoencoder with a Gaussian p(z|m).\n\nwhen retrieving contents from memory during training. We compute the sampling distribution over\nthe addresses based on a learned similarity measure between the memory contents at each address\nand the target. The memory contents ma at the selected address a serve as a context for a continuous\nlatent variable z, which together with ma is used to generate the target observation. We therefore\ninterpret memory as a non-parametric conditional mixture distribution. It is non-parametric in the\nsense that we can change the content and the size of the memory from one evaluation of the model\nto another without having to relearn the model parameters. And since the retrieved content ma\nis dependent on the stochastic variable a, which is part of the generative model, we can directly\nuse it downstream to generate the observation x. These two properties set our model apart from\nother work on VAEs with mixture priors [9, 10] aimed at unconditional density modelling. Another\ndistinguishing feature of our approach is that we perform sampling-based variational inference on the\nmixing variable instead of integrating it out as is done in prior work, which is essential for scaling to\na large number of memory addresses.\nMost existing memory-augmented generative models use soft attention with the weights dependent on\nthe continuous latent variable to access the memory. This does not provide clean separation between\ninferring the address to access in memory and the latent factors of variation that account for the\nvariability of the observation relative to the memory contents (see Figure 1). Or, alternatively, when\nthe attention weights depend deterministically on the encoder, the retrieved memory content can not\nbe directly used in the decoder.\nOur contributions in this paper are threefold: a) We interpret memory-read operations as conditional\nmixture distribution and use amortized variational inference for training; b) demonstrate that we can\ncombine discrete memory addressing variables with continuous latent variables to build powerful\nmodels for generative few-shot learning that scale gracefully with the number of items in memory;\nand c) demonstrate that the KL divergence over the discrete variable a serves as a useful measure to\nmonitor memory usage during inference and training.\n\n2 Model and Training\n\nWe will now describe the proposed model along with the variational inference procedure we use to\ntrain it. The generative model has the form\n\np(x|M) =Xa\n\np(a|M)Zz\n\np(z|ma) p(x|z, ma) dz\n\n(1)\n\nwhere x is the observation we wish to model, a is the addressing categorical latent variable, z the\ncontinuous latent vector, M the memory buffer and ma the memory contents at the ath address.\nThe generative process proceeds by \ufb01rst sampling an address a from the categorical distribution\np(a|M), retrieving the contents ma from the memory buffer M, and then sampling the observation\nx from a conditional variational auto-encoder with ma as the context conditioned on (Figure 1, B).\n\n2\n\n\fThe intuition here is that if the memory buffer contains a set of templates, a trained model of this\ntype should be able to produce observations by distorting a template retrieved from a randomly\nsampled memory location using the conditional variational autoencoder to account for the remaining\nvariability.\nWe can write the variational lower bound for the model in (1):\n\nlog p(x|M) \n\nEa,z\u21e0q(\u00b7|M,x)\n\n[log p(x, z, a|M) log q(a, z|M, x)]\n\n(2)\n\nwhere q(a, z|M, x) = q(a|M, x)q(z|ma, x).\n\n(3)\nIn the rest of the paper, we omit the dependence on M for brevity. We will now describe the\ncomponents of the model and the variational posterior (3) in detail.\nThe \ufb01rst component of the model is the memory buffer M. We here do not implement an explicit write\noperation but consider two possible sources for the memory content: Learned memory: In generative\nexperiments aimed at better understanding the model\u2019s behaviour we treat M as model parameters.\nThat is we initialize M randomly and update its values using the gradient of the objective. Few-shot\nlearning: In the generative few-shot learning experiments, before processing each minibatch, we\nsample |M| entries from the training data and store them in their raw (pixel) form in M. We ensure\nthat the training minibatch {x1, ..., x|B|} contains disjoint samples from the same character classes,\nso that the model can use M to \ufb01nd suitable templates for each target x.\nThe second component is the addressing variable a 2{ 1, ...,|M|} which selects a memory entry\nma from the memory buffer M. The varitional posterior distribution q(a|x) is parameterized as a\nsoftmax over a similarity measure between x and each of the memory entries ma:\n(4)\n\n(x, y) is a learned similarity function described in more detail below.\n\nwhere Sq\nGiven a sample a from the posterior q(a|x), retreiving ma from M is a purely deterministic\noperation. Sampling from q(a|x) is easy as it amounts to computing its value for each slot in memory\nand sampling from the resulting categorical distribution. Given a, we can compute the probability\nof drawing that address under the prior p(a). We here use a learned prior p(a) that shares some\nparameters with q(a|x).\nSimilarity functions: To obtain an ef\ufb01cient implementation for mini-batch training we use the same\nmemory content M for the all training examples in a mini-batch and choose a speci\ufb01c form for the\nsimilarity function. We parameterize Sq(m, x) with two MLPs: h that embeds the memory content\ninto the matching space and hq\n that does the same to the query x. The similarity is then computed as\nthe inner product of the embeddings, normalized by the norm of the memory content embedding:\n\nq(a|x) / exp Sq\n\n(ma, x),\n\nSq(ma, x) = hea, eqi\n||ea||2\nwhere ea = h(ma) , eq = hq\n\n(5)\n\n(x).\n\n(6)\nThis form allows us to compute the similarities between the embeddings of a mini-batch of |B|\nobservations and |M| memory entries at the computational cost of O(|M||B||e|), where |e| is the\ndimensionality of the embedding. We also experimented with several alternative similarity functions\nsuch as the plain inner product (hea, eqi) and the cosine similarity (hea, eqi/||ea|| \u00b7 ||eq||) and found that\nthey did not outperform the above similarity function. For the unconditioneal prior p(a), we learn a\nquery point ep 2R |e| to use in similarity function (5) in place of eq. We share h between p(a) and\nq(a|x). Using a trainable p(a) allows the model to learn that some memory entries are more useful\nfor generating new targets than others. Control experiments showed that there is only a very small\ndegradation in performance when we assume a \ufb02at prior p(a) = 1/|M|.\n2.1 Gradients and Training\nFor the continuous variable z we use the methods developed in the context of variational autoen-\ncoders [1]. We use a conditional Gaussian prior p(z|ma) and an approximate conditional posterior\nq(z|x, ma). However, since we have a discrete latent variable a in the model we can not simply back-\npropagate gradients through it. Here we show how to use VIMCO [11] to estimate the gradients for\n\n3\n\n\fthis model. With VIMCO, we essentially optimize the multi-sample variational bound [12, 13, 11]:\n\nlog p(x) \n\nE\n\na(k)\u21e0q(a|x)\nz(k)\u21e0q(z|ma,x)\n\n\"log\n\n1\nK\n\nKXk=1\n\np(x, ma, z)\n\nq(a, z|x) # = L\n\n(7)\n\nMultiple samples from the posterior enable VIMCO to estimate low-variance gradients for those pa-\nrameters of the model which in\ufb02uence the non-differentiable discrete variable a. The corresponding\ngradient estimates are:\n\nr\u2713L' Xa(k),z(k) \u21e0 q(\u00b7|x)\nrL' Xa(k),z(k) \u21e0 q(\u00b7|x)\n\n!(k) \u21e3r\u2713 log p\u2713(x, a(k), z(k)) r\u2713 log q\u2713(z|a, x)\u2318\n!(k)\n r log q(a(k)|x)\np(x, a(k), z(k))\nq(a(k), z(k)|x)\n\n(8)\n\nwith !(k) =\n\n, \u02dc!(k) =\n\n\u02dc!(k)\n\nPk \u02dc!(k)\nKXk0\n\n1\n\nand !(k)\n\n = log\n\n\u02dc!(k0) log\n\n1\n\nK 1 Xk06=k\n\n\u02dc!(k0) !(k)\n\nFor z-related gradients this is equivalent to IWAE [13]. Alternative gradient estimators for discrete\nlatent variable models (e.g. NVIL [14], RWS [12] or Gumbel-max relaxation-based approaches\n[15, 16]) might work here too, but we have not investigated their effectiveness. Notice how the\ngradients r log p(x|z, a) provide updates for the memory contents ma (if necessary), while the\ngradients r log p(a) and r log q(a|x) provide updates for the embedding MLPs. The former update\nthe mixture components while the latter update their relative weights. The log-likelihood bound\n(2) suggests that we can decompose the overall loss into three terms: the expected reconstruction\nerror Ea,z\u21e0q [log p(x|a, z)] and the two KL terms which measure the information \ufb02ow from the\napproximate posterior to the generative model for our latent variables: KL(q(a|x)||p(a)), and\nEa\u21e0q [KL(q(z|a, x)||p(z|a))].\n3 Related work\n\nAttention and external memory are two closely related techniques that have recently become important\nbuilding blocks for neural models. Attention has been widely used for supervised learning tasks such\nas translation, image classi\ufb01cation and image captioning. External memory can be seen as an input or\nan internal state and attention mechanisms can either be used for selective reading or incremental\nupdating. While most work involving memory and attention has been done in the context supervised\nlearning, here we are interested in using them effectively in the generative setting.\nIn [17] the authors use soft-attention with learned memory contents to augment models to have\nmore parameters in the generative model. External memory as a way of implementing one-shot\ngeneralization was introduced in [18]. This was achieved by treating the exemplars conditioned\non as memory entries accessed through a soft attention mechanism at each step of the incremental\ngenerative process similar to the one in DRAW [19]. Generative Matching Networks [20] are a similar\narchitecture which uses a single-step VAE generative process instead of an iterative DRAW-like\none. In both cases, soft attention is used to access the exemplar memory, with the address weights\ncomputed based on a learned similarity function between an observation at the address and a function\nof the latent state of the generative model.\nIn contrast to this kind of deterministic soft addressing, we use hard attention, which stochastically\npicks a single memory entry and thus might be more appropriate in the few-shot setting. As the\nmemory location is stochastic in our model, we perform variational inference over it, which has not\nbeen done for memory addressing in a generative model before. A similar approach has however\nbeen used for training stochastic attention for image captioning [21]. In the context of memory,\nhard attention has been used in RLNTM \u2013 a version of the Neural Turing Machine modi\ufb01ed to use\nstochastic hard addressing [22]. However, RLNTM has been trained using REINFORCE rather\nthan variational inference. A number of architectures for VAEs augmented with mixture priors have\n\n4\n\n\fFigure 2: A: Typical learning curve when training a model to recall MNIST digits (M \u21e0 training\ndata (each step); x \u21e0 M; |M| = 256): In the beginning the continuous latent variables model most\nof the variability of the data; after \u21e1 100k update steps the stochastic memory component takes\nover and both the NLL bound and the KL(q(a|x)||p(a)) estimate approach log(256), the NLL of\nan optimal probabilistic lookup-table. B: Randomly selected samples from the MNIST model with\nlearned memory: Samples within the same row use a common ma.\n\nbeen proposed, but they do not use the mixture component indicator variable to index memory and\nintegrate out the variable instead [9, 10], which prevents them from scaling to a large number of\nmixing components.\nAn alternative approach to generative few-shot learning proposed in [23] uses a hierarchical VAE\nto model a large number of small related datasets jointly. The statistical structure common to\nobservations in the same dataset are modelled by a continuous latent vector shared among all such\nobservations. Unlike our model, this model is not memory-based and does not use any form of\nattention. Generative models with memory have also been proposed for sequence modelling in [24],\nusing differentiable soft addressing. Our approach to stochastic addressing is suf\ufb01ciently general to\nbe applicable in this setting as well and it would be interesting how it would perform as a plug-in\nreplacement for soft addressing.\n\n4 Experiments\n\nWe optimize the parameters with Adam [25] and report experiments with the best results from\nlearning rates in {1e-4, 3e-4}. We use minibatches of size 32 and K=4 samples from the approximate\nposterior q(\u00b7|x) to compute the gradients, the KL estimates, and the log-likelihood bounds. We keep\nthe architectures deliberately simple and do not use autoregressive connections or IAF [5] in our\nmodels as we are primarily interested in the quantitative and qualitative behaviour of the memory\ncomponent.\n\n4.1 MNIST with fully connected MLPs\n\nWe \ufb01rst perform a series of experiments on the binarized MNIST dataset [26]. We use 2 layered en-\nand decoders with 256 and 128 hidden units with ReLU nonlinearities and a 32 dimensional Gaussian\nlatent variable z.\nTrain to recall: To investigate the model\u2019s capability to use its memory to its full extent, we consider\nthe case where it is trained to maximize the likelihood for random data points x which are present\nin M. During inference, an optimal model would pick the template ma that is equivalent to x with\nprobability q(a|x)=1. The corresponding prior probability would be p(a) \u21e1 1/|M|. Because there\nare no further variations that need to be modeled by z, its posterior q(z|x, m) can match the prior\np(z|m), yielding a KL cost of zero. The model expected log likelihood would be -log |M|, equal\nto the log-likelihood of an optimal probabilistic lookup table. Figure 2A illustrates that our model\nconverges to the optimal solution. We observed that the time to convergence depends on the size\nof the memory and with |M| > 512 the model sometimes fails to \ufb01nd the optimal solution. It is\nnoteworthy that the trained model from Figure 2A can handle much larger memory sizes at test time,\ne.g. achieving NLL \u21e1 log(2048) given 2048 test set images in memory. This indicates that the\nmatching MLPs for q(a|x) are suf\ufb01ciently discriminative.\n\n5\n\n\fFigure 3: Approximate inference with q(a|x): Histogram and corresponding top-5 entries ma for\ntwo randomly selected targets. M contains 10 examples from 8 unseen test-set character classes.\n\nFigure 4: A: Generative one-shot sampling: Left most column is the testset example provided in\nM; remaining columns show randomly selected samples from p(x|M). The model was trained\nwith 4 examples from 8 classes each per gradient step. B: Breakdown of the KL cost for different\nmodels trained with varying number of examples per class in memory. KL(q(a|x)||p(a)) increases\nfrom 2.0 to 4.5 nats as KL(q(z|ma, x)||p(z|ma)) decreases from 28.2 to 21.8 nats. As the number\nof examples per class increases, the model shifts the responsibility for modeling the data from the\ncontinuous variable z to the discrete a. The overall testset NLL for the different models improves\nfrom 75.1 to 69.1 nats.\n\nLearned memory: We train models with |M|2{\n64, 128, 256, 512, 1024} randomly initial-\nized mixture components (ma 2R 256). After training, all models converged to an average\nKL(q(a|x)||p(a)) \u21e1 2.5 \u00b1 0.3 nats over both the training and the test set, suggesting that the\nmodel identi\ufb01ed between e2.2 \u21e1 9 and e2.8 \u21e1 16 clusters in the data that are represented by a. The\nentropy of p(a) is signi\ufb01cantly higher, indicating that multiple ma are used to represent the same data\nclusters. A manual inspection of the q(a|x) histograms con\ufb01rms this interpretation. Although our\nmodel over\ufb01ts slightly more to the training set, we do generally not observe a big difference between\nour model and the corresponding baseline VAE (a VAE with the same architecture, but without the\ntop level mixture distribution) in terms of the \ufb01nal NLL. This is probably not surprising, because\nMNIST provides many training examples describing a relatively simple data manifold. Figure 2B\nshows samples from the model.\n\n4.2 Omniglot with convolutional MLPs\n\nTo apply the model to a more challenging dataset and to use it for generative few-shot learning, we\ntrain it on various versions of the Omniglot [27] dataset. For these experiments we use convolutional\nen- and decoders: The approximate posterior q(z|m, x) takes the concatenation of x and m as input\nand predicts the mean and variance for the 64 dimensional z. It consists of 6 convolutional layers\nwith 3 \u21e5 3 kernels and 48 or 64 feature maps each. Every second layer uses a stride of 2 to get an\noverall downsampling of 8 \u21e5 8. The convolutional pyramid is followed by a fully-connected MLP\nwith 1 hidden layer and 2|z| output units. The architecture of p(x|m, z) uses the same downscaling\npyramid to map m to a |z|-dimensional vector, which is concatenated with z and upscaled with\ntransposed convolutions to the full image size again. We use skip connections from the downscaling\nlayers of m to the corresponding upscaling layers to preserve a high bandwidth path from m to x.\nTo reduce over\ufb01tting, given the relatively small size of the Omniglot dataset, we tie the parameters\nof the convolutional downscaling layers in q(z|m) and p(x|m, z). The embedding MLPs for p(a)\nand q(a|x) use the same convolutional architecture and map images x and memory content ma into\n\n6\n\n\fFigure 5: Robustness to increasing memory size at test-time: A: Varying the number of confounding\nmemory entries: At test-time we vary the number of classes in M. For an optimal model of disjoint\ndata from C classes we expect L = average L per class + log C (dashed lines). The model was\ntrained with 4 examples from 8 character classes in memory per gradient step. We also show our best\nsoft-attenttion baseline model which was trained with 16 examples from two classes each gradient\nstep. B: Memory contains examples from all 144 test-set character classes and we vary the number\nof examples per class. At C=0 we show the LL of our best unconditioned baseline VAE. The models\nwere trained with 8 character classes and {1, 4, 8} examples per class in memory.\n\na 128-dimensional matching space for the similarity calculations. We left their parameters untied\nbecause we did not observe any improvement nor degradation of performance when tying them.\nWith learned memory: We run experiments on the 28 \u21e5 28 pixel sized version of Omniglot which\nwas introduced in [13]. The dataset contains 24,345 unlabeled examples in the training, and 8,070\nexamples in the test set from 1623 different character classes. The goal of this experiment is to show\nthat our architecture can learn to use the top-level memory to model highly multi-modal input data.\nWe run experiments with up to 2048 randomly initialized mixture components and observe that the\nmodel makes substantial use of them: The average KL(q(a|x)||p(a)) typically approaches log |M|,\nwhile KL(q(z|\u00b7)||p(z|\u00b7)) and the overall training-set NLL are signi\ufb01cantly lower compared to the\ncorresponding baseline VAE. However big models without regularization tend to over\ufb01t heavily (e.g.\ntraining-set NLL < 80 nats; testset NLL > 150 nats when using |M|=2048). By constraining the\nmodel size (|M|=256, convolutions with 32 feature maps) and adding 3e-4 L2 weight decay to all\nparameters with the exception of M, we obtain a model with a testset NLL of 103.6 nats (evaluated\nwith K=5000 samples from the posterior), which is about the same as a two-layer IWAE and slightly\nworse than the best RBMs (103.4 and \u21e1100 respectively, [13]).\nFew-shot learning: The 28 \u21e5 28 pixel version [13] of Omniglot does not contain any alphabet or\ncharacter-class labels. For few-shot learning we therefore start from the original dataset [27] and\nscale the 104 \u21e5 104 pixel sized examples with 4 \u21e5 4 max-pooling to 26 \u21e5 26 pixels. We here use the\n45/5 split introduced in [18] because we are mostly interested in the quantitative behaviour of the\nmemory component, and not so much in \ufb01nding optimal regularization hyperparameters to maximize\nperformance on small datasets. For each gradient step, we sample 8 random character-classes from\nrandom alphabets. From each character-class we sample 4 examples and use them as targets x to form\na minibatch of size 32. Depending on the experiment, we select a certain number of the remaining\nexamples from the same character classes to populate M. We chose 8 character-classes and 4\nexamples per class for computational convenience (to obtain reasonable minibatch and memory sizes).\nIn control experiments with 32 character classes per minibatch we obtain almost indistinguishable\nlearning dynamics and results.\nTo establish meaningful baselines, we train additional models with identical encoder and decoder\narchitectures: 1) A simple, unconditioned VAE. 2) A memory-augmented generative model with\nsoft-attention. Because the soft-attention weights have to depend solely on the variables in the\ngenerative model and may not take input directly from the encoder, we have to use z as the top-level\nlatent variable: p(z), p(x|z, m(z)) and q(z|x). The overall structure of this model resembles the\nstructure of prior work on memory-augmented generative models (see section 3 and Figure 1A), and\nis very similar to the one used in [20], for example.\nFor the unconditioned baseline VAE we obtain a NLL of 90.8, while our memory augmented model\nreaches up to 68.8 nats. Figure 5 shows the scaling properties of our model when varying the\nnumber of conditioning examples at test-time. We observe only minimal degradation compared\n\n7\n\n\fModel\nGenerative Matching Nets\nGenerative Matching Nets\nGenerative Matching Nets\nVariational Memory Addressing\nVariational Memory Addressing\nVariational Memory Addressing\nVariational Memory Addressing\n\nCtest\n\n1\n2\n4\n1\n2\n4\n16\n\n1\n83.3\n86.4\n88.3\n86.5\n87.2\n87.5\n89.6\n\n2\n78.9\n84.9\n87.3\n83.0\n83.3\n83.3\n85.1\n\n3\n75.7\n82.4\n86.7\n79.6\n80.9\n81.2\n81.5\n\n4\n72.9\n81.0\n85.4\n79.0\n79.3\n80.7\n81.9\n\n5\n70.1\n78.8\n84.0\n76.5\n79.1\n79.5\n81.3\n\n10\n59.9\n71.4\n80.2\n76.2\n77.0\n78.6\n79.8\n\n19\n45.8\n61.2\n73.7\n73.9\n75.0\n76.7\n77.0\n\nTable 1: Our model compared to Generative Matching Networks [20]: GMNs have an extra stage\nthat computes joint statistics over the memory context which gives the model a clear advantage when\nmultiple conditiong examples per class are available. But with increasing number of classes C it\nquickly degrades. LL bounds were evaluated with K=1000 posterior samples.\n\nto a theoretically optimal model when we increase the number of concurrent character classes in\nmemory up to 144, indicating that memory readout works reliably with |M| 2500 items in memory.\nThe soft-attention baseline model reaches up to 73.4 nats when M contains 16 examples from 1\nor 2 character-classes, but degrades rapidly with increasing number of confounding classes (see\nFigure 5A). Figure 3 shows histograms and samples from q(a|x), visually con\ufb01rming that our model\nperforms reliable approximate inference over the memory locations.\nWe also train a model on the Omniglot dataset used in [20]. This split provides a relatively small\ntraining set. We reduce the number of feature channels and hidden layers in our MLPs and add 3e-4\nL2 weight decay to all parameters to reduce over\ufb01tting. The model in [20] has a clear advantage\nwhen many examples from very few character classes are in memory because it was speci\ufb01cally\ndesigned to extract joint statistics from memory before applying the soft-attention readout. But like\nour own soft-attention baseline, it quickly degrades as the number of concurrent classes in memory is\nincreased to 4 (table 1).\nFew-shot classi\ufb01cation: Although this is not the main aim of this paper, we can use the\ntrained model to perform discriminative few-shot classi\ufb01cation: We can estimate p(c|x) \u21e1\nPma has label c Ez\u21e0q(z|a,x) [p(x, z, ma)/p(x)] or by using the feed forward approximation p(c|x) \u21e1\nPma has label c q(a|x). Without any further retraining or \ufb01netuneing we obtain classi\ufb01cation accuracies\nof 91%, 97%, 77% and 90% for 5-way 1-shot, 5-way 5-shot, 20-way 1-shot and 20-way 5-shot\nrespectively with q(a|x).\n\n5 Conclusions\n\nIn our experiments we generally observe that the proposed model is very well behaved: we never\nused temperature annealing for the categorical softmax or other tricks to encourage the model to\nuse memory. The interplay between p(a) and q(a|x) maintains exploration (high entropy) during\nthe early phase of training and decreases naturally as the sampled ma become more informative.\nThe KL divergences for the continuous and discrete latent variables show intuitively interpretable\nresults for all our experiments: On the densely sampled MNIST dataset only a few distinctive mixture\ncomponents are identi\ufb01ed, while on the more disjoint and sparsely sampled Omniglot dataset the\nmodel chooses to use many more memory entries and uses the continuous latent variables less. By\ninterpreting memory addressing as a stochastic operation, we gain the ability to apply a variational\napproximation which helps the model to perform precise memory lookups during inference and\ntraining. Compared to soft-attention approaches, we loose the ability to naively backprop through\nread-operations and we have to use approximations like VIMCO. However, our experiments strongly\nsuggest that this can be a worthwhile trade-off. Our experiments also show that the proposed\nvariational approximation is robust to increasing memory sizes: A model trained with 32 items in\nmemory performed nearly optimally with more than 2500 items in memory at test-time. Beginning\nwith M 48 our hard-attention implementation becomes noticeably faster in terms of wall-clock\ntime per parameter update than the corresponding soft-attention baseline. Even though we use K=4\nposterior samples during training and the soft-attention baseline only requires a single one.\n\n8\n\n\fAcknowledgments\nWe thank our colleagues at DeepMind and especially Oriol Vinyals and Sergey Bartunov for insightful\ndiscussions.\n\nReferences\n[1] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In Proceedings of The 31st International\nConference on Machine Learning, pages 1278\u20131286, 2014.\n\n[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[4] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\narXiv preprint arXiv:1505.05770, 2015.\n\n[5] Diederik P Kingma, Tim Salimans, and Max Welling. Improving variational inference with\n\ninverse autoregressive \ufb02ow. arXiv preprint arXiv:1606.04934, 2016.\n\n[6] Sreerupa Das, C. Lee Giles, and Guo zheng Sun. Learning context-free grammars: Capabilities\nand limitations of a recurrent neural network with an external stack memory. In In Proceedings\nof the Fourteenth Annual Conference of the Cognitive Science Society, pages 791\u2013795. Morgan\nKaufmann Publishers, 1992.\n\n[7] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory\nnetworks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 28, pages 2440\u20132448. Curran Associates,\nInc., 2015.\n\n[8] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,\net al. Hybrid computing using a neural network with dynamic external memory. Nature,\n538(7626):471\u2013476, 2016.\n\n[9] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni,\nKai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture\nvariational autoencoders. arXiv preprint arXiv:1611.02648, 2016.\n\n[10] Eric Nalisnick, Lars Hertel, and Padhraic Smyth. Approximate inference for deep latent gaussian\n\nmixtures. In NIPS Workshop on Bayesian Deep Learning, 2016.\n\n[11] Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv\n\npreprint arXiv:1602.06725, 2016.\n\n[12] J\u00f6rg Bornschein and Yoshua Bengio. Reweighted wake-sleep. arXiv preprint arXiv:1406.2751,\n\n2014.\n\n[13] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\narXiv preprint arXiv:1509.00519, 2015.\n\n[14] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks.\n\narXiv preprint arXiv:1402.0030, 2014.\n\n[15] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[16] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\nstat, 1050:1, 2017.\n\n9\n\n\f[17] Chongxuan Li, Jun Zhu, and Bo Zhang. Learning to generate with memory. In International\n\nConference on Machine Learning, pages 1177\u20131186, 2016.\n\n[18] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra.\n\nOne-shot generalization in deep generative models. arXiv preprint arXiv:1603.05106, 2016.\n\n[19] DRAW: A Recurrent Neural Network For Image Generation, 2015.\n[20] Sergey Bartunov and Dmitry P Vetrov. Fast adaptation in generative models with generative\n\nmatching networks. arXiv preprint arXiv:1612.02192, 2016.\n\n[21] Jimmy Ba, Ruslan R Salakhutdinov, Roger B Grosse, and Brendan J Frey. Learning wake-sleep\nrecurrent attention models. In Advances in Neural Information Processing Systems, pages\n2593\u20132601, 2015.\n\n[22] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv\n\npreprint arXiv:1505.00521, 362, 2015.\n\n[23] Harrison Edwards and Amos Storkey. Towards a Neural Statistician. 2 2017.\n[24] Mevlana Gemici, Chia-Chun Hung, Adam Santoro, Greg Wayne, Shakir Mohamed, Danilo J\nRezende, David Amos, and Timothy Lillicrap. Generative temporal models with memory. arXiv\npreprint arXiv:1702.04649, 2017.\n\n[25] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[26] Hugo Larochelle. Binarized mnist dataset http://www.cs.toronto.edu/~larocheh/public/\n\ndatasets/binarized_mnist/binarized_mnist_[train|valid|test].amat, 2011.\n\n[27] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2117, "authors": [{"given_name": "J\u00f6rg", "family_name": "Bornschein", "institution": "DeepMind"}, {"given_name": "Andriy", "family_name": "Mnih", "institution": "DeepMind"}, {"given_name": "Daniel", "family_name": "Zoran", "institution": "DeepMind"}, {"given_name": "Danilo", "family_name": "Jimenez Rezende", "institution": "Google DeepMind"}]}