{"title": "Variational Memory Encoder-Decoder", "book": "Advances in Neural Information Processing Systems", "page_first": 1508, "page_last": 1518, "abstract": "Introducing variability while maintaining coherence is a core task in learning to generate utterances in conversation. Standard neural encoder-decoder models and their extensions using conditional variational autoencoder often result in either trivial or digressive responses. To overcome this, we explore a novel approach that injects variability into neural encoder-decoder via the use of external memory as a mixture model, namely Variational Memory Encoder-Decoder (VMED). By associating each memory read with a mode in the latent mixture distribution at each timestep, our model can capture the variability observed in sequential data such as natural conversations. We empirically compare the proposed model against other recent approaches on various conversational datasets. The results show that VMED consistently achieves significant improvement over others in both metric-based and qualitative evaluations.", "full_text": "Variational Memory Encoder-Decoder\n\nHung Le, Truyen Tran, Thin Nguyen and Svetha Venkatesh\n\nApplied AI Institute, Deakin University, Geelong, Australia\n\n{lethai,truyen.tran,thin.nguyen,svetha.venkatesh}@deakin.edu.au\n\nAbstract\n\nIntroducing variability while maintaining coherence is a core task in learning to\ngenerate utterances in conversation. Standard neural encoder-decoder models and\ntheir extensions using conditional variational autoencoder often result in either\ntrivial or digressive responses. To overcome this, we explore a novel approach that\ninjects variability into neural encoder-decoder via the use of external memory as\na mixture model, namely Variational Memory Encoder-Decoder (VMED). By as-\nsociating each memory read with a mode in the latent mixture distribution at each\ntimestep, our model can capture the variability observed in sequential data such\nas natural conversations. We empirically compare the proposed model against\nother recent approaches on various conversational datasets. The results show that\nVMED consistently achieves signi\ufb01cant improvement over others in both metric-\nbased and qualitative evaluations.\n\n1\n\nIntroduction\n\nRecent advances in generative modeling have led to exploration of generative tasks. While gener-\native models such as GAN [12] and VAE [19, 29] have been applied successfully for image gen-\neration, learning generative models for sequential discrete data is a long-standing problem. Early\nattempts to generate sequences using RNNs [13] and neural encoder-decoder models [17, 35] gave\npromising results, but the deterministic nature of these models proves to be inadequate in many re-\nalistic settings. Tasks such as translation, question-answering and dialog generation would bene\ufb01t\nfrom stochastic models that can produce a variety of outputs for an input. For example, there are\nseveral ways to translate a sentence from one language to another, multiple answers to a question\nand multiple responses for an utterance in conversation.\nAnother line of research that has captured attention recently is memory augmented neural networks\n(MANNs). Such models have larger memory capacity and thus \u201cremember\u201d temporally distant\ninformation in the input sequence and provide a RAM-like mechanism to support model execution.\nMANNs have been successfully applied to long sequence prediction tasks [14, 33] demonstrating\ngreat improvement when compared to other recurrent models. However, the role of memory in\nsequence generation has not been well understood.\nFor tasks involving language understanding and production, handling intrinsic uncertainty and latent\nvariations is necessary. The choice of words and grammars may change erratically depending on\nspeaker intentions, moods and previous languages used. The underlying RNN in neural sequential\nmodels \ufb01nds it hard to capture the dynamics and their outputs are often trivial or too generic [23].\nOne way to overcome these problems is to introduce variability into these models. Unfortunately,\nsequential data such as speech and natural language is a hard place to inject variability [30] since\nthey require a coherence of grammars and semantics yet allow freedom of word choice.\nWe propose a novel hybrid approach that integrates MANN and VAE, called Variational Memory\nEncoder-Decoder (VMED), to model the sequential properties and inject variability in sequence\ngeneration tasks. We introduce latent random variables to model the variability observed in the data\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand capture dependencies between the latent variables across timesteps. Our assumption is that there\nare latent variables governing an output at each timestep. In the conversation context, for instance,\nthe latent space may represent the speaker\u2019s hidden intention and mood that dictate word choice and\ngrammars. For a rich latent multimodal space, we use a Mixture of Gaussians (MoG) because a\nspoken word\u2019s latent intention and mood can come from different modes, e.g., whether the speaker\nis asking or answering, or she/he is happy or sad. By modeling the latent space as an MoG where\neach mode associates with some memory slot, we aim to capture multiple modes of the speaker\u2019s\nintention and mood when producing a word in the response. Since the decoder in our model has\nmultiple read heads, the MoG can be computed directly from the content of chosen memory slots.\nOur external memory plays a role as a mixture model distribution generating the latent variables that\nare used to produce the output and take part in updating the memory for future generative steps.\nTo train our model, we adapt Stochastic Gradient Variational Bayes (SGVB) framework [19]. In-\nstead of minimizing the KL divergence directly, we resort to using its variational approximation [15]\nto accommodate the MoG in the latent space. We show that minimizing the approximation results\nin KL divergence minimization. We further derive an upper bound on our total timestep-wise KL\ndivergence and demonstrate that minimizing the upper bound is equivalent to \ufb01tting a continuous\nfunction by a scaled MoG. We validate the proposed model on the task of conversational response\ngeneration. This task serves as a nice testbed for the model because an utterance in a conversation\nis conditioned on previous utterances, the intention and the mood of the speaker. Finally, we eval-\nuate our model on two open-domain and two closed-domain conversational datasets. The results\ndemonstrate our proposed VMED gains signi\ufb01cant improvement over state-of-the-art alternatives.\n\n2 Preliminaries\n\n2.1 Memory-augmented Encoder-Decoder Architecture\n\nA memory-augmented encoder-decoder (MAED) consists of two neural controllers linked via ex-\nternal memory. This is a natural extension to read-write MANNs to handle sequence-to-sequence\nproblems. In MAED, the memory serves as a compressor that encodes the input sequence to its\nmemory slots, capturing the most essential information. Then, a decoder will attend to these mem-\nory slots looking for the cues that help to predict the output sequence. MAED has recently demon-\nstrated promising results in machine translation [5, 37] and healthcare [20, 21, 28]. In this paper,\nwe advance a recent MAED known as DC-MANN described in [20] where the powerful DNC [14]\nis chosen as the external memory.\nIn DNC, memory accesses and updates are executed via the\ncontroller\u2019s reading and writing heads at each timestep. Given current input xt and a set of K pre-\n\nvious read values from memory rt\u22121 =(cid:2)r1\n\n(cid:3), the controllers compute read-weight\n\nand write-weight vector ww\n\nvector wi,r\nt for addressing the memory Mt. There are two versions of\nt\ndecoding in DC-MANN: write-protected and writable memory. We prefer to allow writing to the\nmemory during inference because in this work, we focus on generating diverse output sequences,\nwhich requires a dynamic memory for both encoding and decoding process.\n\nt\u22121, r2\n\nt\u22121, ..., rK\nt\u22121\n\n2.2 Conditional Variational Autoencoder (CVAE) for Conversation Generation\n\nA dyadic conversation can be represented via three random variables: the conversation context x\n(all the chat before the response utterance), the response utterance y and a latent variable z, which\nis used to capture the latent distribution over the reasonable responses. A variational autoencoder\nconditioned on x (CVAE) is trained to maximize the conditional log likelihood of y given x, which\ninvolves an intractable marginalization over the latent variable z, i.e.,:\n\n(cid:90)\n\n(cid:90)\n\np (y | x) =\n\np (y, z | x) dz =\n\np (y | x, z) p (z | x) dz\n\n(1)\n\nz\n\nz\n\nFortunately, CVAE can be ef\ufb01ciently trained with the Stochastic Gradient Variational Bayes (SGVB)\nframework [19] by maximizing the variational lower bound of the conditional log likelihood. In\na typical CVAE work, z is assumed to follow multivariate Gaussian distribution with a diagonal\ncovariance matrix, which is conditioned on x as p\u03c6 (z | x) and a recognition network q\u03b8(z | x, y) to\napproximate the true posterior distribution p(z | x, y). The variational lower bound becomes:\n\n2\n\n\fFigure 1: Graphical Models of the vanilla CVAE (a) and our proposed VMED (b)\n\nL (\u03c6, \u03b8; y, x) = \u2212 KL (q\u03b8 (z | x, y) (cid:107) p\u03c6 (z | x)) + Eq\u03b8(z|x,y) [log p (y | x, z)] \u2264 log p (y | x) (2)\nWith the introduction of the neural approximator q\u03b8(z | x, y) and the reparameterization trick [18],\nwe can apply the standard back-propagation to compute the gradient of the variational lower bound.\nFig. 1(a) depicts elements of the graphical model for this approach in the case of using CVAE.\n\n3 Methods\n\nBuilt upon CVAE and partly inspired by VRNN [8], we introduce a novel memory-augmented vari-\national recurrent network dubbed Variational Memory Encoder-Decoder (VMED). With an external\nmemory module, VMED explicitly models the dependencies between latent random variables across\nsubsequent timesteps. However, unlike the VRNN which uses hidden values of RNN to model the\nlatent distribution as a Gaussian, our VMED uses read values r from an external memory M as\na Mixture of Gaussians (MoG) to model the latent space. This choice of MoG also leads to new\nformulation for the prior p\u03c6 and the posterior q\u03b8 mentioned in Eq. (2). The graphical representation\nof our model is shown in Fig. 1(b).\n\n3.1 Generative Process\n\nthe context sequence via K read values rt\u22121 = (cid:2)r1\n\n(cid:3) from the external memory.\n\nt\u22121, r2\n\nt\u22121, ..., rK\nt\u22121\n\nThe VMED includes a CVAE at each time step of the decoder. These CVAEs are conditioned on\n\nSince the read values are conditioned on the previous state of the decoder hd\nt\u22121, our model takes into\naccount the temporal structure of the output. Unlike other designs of CVAE where there is often\nonly one CVAE with a Gaussian prior for the whole decoding process, our model keeps reading\nthe external memory to produce the prior as a Mixture of Gaussians at every timestep. At the t-th\nstep of generating an utterance in the output sequence, the decoder will read from the memory K\nread values, representing K modes of the MoG. This multi-modal prior re\ufb02ects the fact that given a\ncontext x, there are different modes of uttering the output word yt, which a single mode cannot fully\ncapture. The MoG prior distribution is modeled as:\n\ngt = p\u03c6 (zt | x, rt\u22121) =\n\n\u03c0i,x\nt\n\nK(cid:88)\n\n(cid:0)x, ri\n\nt\u22121\n\n(cid:1)N(cid:16)\n\nzt; \u00b5i,x\n\nt\n\n(cid:0)x, ri\n\nt\u22121\n\n(cid:1) , \u03c3i,x\n\nt\n\n(cid:0)x, ri\n\n(cid:17)\n\n(cid:1)2\n\nt\u22121\n\nI\n\n(3)\n\ni=1\n\nt\n\nt\n\nand standard deviation (s.d.) \u03c3i,x\n\nWe treat the mean \u00b5i,x\nof each Gaussian distribution in the prior\nas neural functions of the context sequence x and read vectors from the memory. The context is\nencoded into the memory by an LST M E encoder. In decoding, the decoder LST M D attends to\nthe memory and choose K read vectors. We split each read vector into two parts ri,\u00b5 and ri,\u03c3 , each\nof which is used to compute the mean and s.d., respectively: \u00b5i,x\n.\nHere we use the softplus function for computing s.d. to ensure the positiveness. The mode weight\n\u03c0i,x\nt\u22121 over memory slots. Since we use soft-\nt\nattention, a read value is computed from all slots yet the main contribution comes from the one\n\nis chosen based on the read attention weights wi,r\n\nt = sof tplus\n\nt = ri,\u00b5\n\nt\u22121, \u03c3i,x\n\nri,\u03c3\nt\u22121\n\n(cid:17)\n\n(cid:16)\n\n3\n\n\fRequire: Given p\u03c6,(cid:2)r1\n\nAlgorithm 1 VMED Generation\n0, ..., rK\n0\n\n0, r2\n\n(cid:3), hd\nt = LST M D(cid:0)[y\u2217\n\n0, y\u2217\n\n0\n\n1: for t = 1, T do\n2:\n3:\n4:\n5:\n6:\n7: end for\n\nSampling zt \u223c p\u03c6 (zt | x, rt\u22121) in Eq. (3)\nCompute: od\n\nCompute the conditional distribution: p (yt | x, z\u2264t) = sof tmax(cid:0)Woutod\n\nt\u22121, zt] , hd\n\nt , hd\n\nt\u22121\n\n(cid:1)\n\nt\n\nUpdate memory and read [r1\nGenerate output y\u2217\n\nt , r2\nt = argmax\ny\u2208V ocab\n\nt ] using hd\nt , ..., rK\np (yt = y | x, z\u2264t)\n\nt as in DNC\n\n(cid:1)\n\nwith highest attention score. Thus, we pick the maximum attention score in each read weight and\nnormalize to become the mode weights: \u03c0i,x\n\nmax wi,r\n\nt = max wi,r\n\nt\u22121/\n\nt\u22121.\n\ni=K(cid:80)\n\ni=1\n\nArmed with the prior, we follow a recurrent generative process by alternatively using the memory\nto compute the MoG and using latent variable z sampled from the MoG to update the memory and\nproduce the output conditional distribution. The pseudo-algorithm of the generative process is given\nin Algorithm 1.\n\n3.2 Neural Posterior Approximation\nAt each step of the decoder, the true posterior p (zt | x, y) will be approximated by a neural function\nof x, y and rt\u22121, denoted as q\u03b8 (zt | x, y, rt\u22121) . Here, we use a Gaussian distribution to approximate\nthe posterior. The unimodal posterior is chosen because given a response y, it is reasonable to\nassume only one mode of latent space is responsible for this response. Also, choosing a unimodel\nwill allow the reparameterization trick during training and reduce the complexity of KL divergence\ncomputation. The approximated posterior is computed by the following the equation:\n\nft = q\u03b8 (zt | x, y\u2264t, rt\u22121) = N(cid:16)\n\n(cid:17)\n\n(cid:35)\n\nzt; \u00b5x,y\n\nt\n\n(x, y\u2264t, rt\u22121) , \u03c3x,y\n\nt\n\n(x, y\u2264t, rt\u22121)2 I\n\n(4)\n\nK(cid:80)\n\nwith mean \u00b5x,y\n. We use an LST M U utterance encoder to model the ground truth\nutterance sequence up to timestep t-th y\u2264t. The t-th hidden value of the LST M U is used to repre-\nsent the given data in the posterior: hu\n\n(cid:1). The neural posterior combines the\n\nt = LST M U(cid:0)yt, hu\n\nand s.d. \u03c3x,y\n\nt\u22121\n\nt\n\nt\n\n\u03c0i,x\nt ri\nt ], \u03c3x,y\nt = sof tplus (W\u03c3 [rt, hu\n\nt\u22121 together with the ground truth data to produce the Gaussian posterior:\nread values rt =\ni=1\n\u00b5x,y\nt ]). In these equations, we use learnable matrix\nt = W\u00b5 [rt, hu\nweights W\u00b5 and W\u03c3 as a recognition network to compute the mean and s.d. of the posterior, ensur-\ning that the distribution has the same dimension as the prior. We apply the reparamterization trick to\nt (cid:12)\u0001, \u0001 \u2208 N (0, I). Intu-\ncalculate the random variable sampled from the posterior as z(cid:48)\nitively, the reparameterization trick bridges the gap between the generation model and the inference\nmodel during the training.\n\nt +\u03c3x,y\n\nt = \u00b5x,y\n\n3.3 Learning\nIn the training phase, the neural posterior is used to produce the latent variable z(cid:48)\nt. The read values\nfrom memory are used directly as the MoG priors and the priors are trained to approximate the\nposterior by reducing the KL divergence. During testing, the decoder uses the prior for generating\nlatent variable zt, from which the output is computed. The training and testing diagram is illustrated\nin Fig. 2. The objective function becomes a timestep-wise variational lower bound by following\nsimilar derivation presented in [8]:\n\nL (\u03b8, \u03c6; y, x) = Eq\u2217\n\n\u2212KL (q\u03b8 (zt | x, y\u2264t, rt\u22121) (cid:107) p\u03c6 (zt | x, rt\u22121)) + log p (yt | x, z\u2264t)\n\nt=1\n\n(5)\n\n4\n\n(cid:34) T(cid:88)\n\n\fFigure 2: Training and testing of VMED\n\nwhere q\u2217 = q\u03b8 (z\u2264T | x, y\u2264T , r