{"title": "Variational Memory Encoder-Decoder", "book": "Advances in Neural Information Processing Systems", "page_first": 1508, "page_last": 1518, "abstract": "Introducing variability while maintaining coherence is a core task in learning to generate utterances in conversation. Standard neural encoder-decoder models and their extensions using conditional variational autoencoder often result in either trivial or digressive responses. To overcome this, we explore a novel approach that injects variability into neural encoder-decoder via the use of external memory as a mixture model, namely Variational Memory Encoder-Decoder (VMED). By associating each memory read with a mode in the latent mixture distribution at each timestep, our model can capture the variability observed in sequential data such as natural conversations. We empirically compare the proposed model against other recent approaches on various conversational datasets. The results show that VMED consistently achieves significant improvement over others in both metric-based and qualitative evaluations.", "full_text": "Variational Memory Encoder-Decoder\n\nHung Le, Truyen Tran, Thin Nguyen and Svetha Venkatesh\n\nApplied AI Institute, Deakin University, Geelong, Australia\n\n{lethai,truyen.tran,thin.nguyen,svetha.venkatesh}@deakin.edu.au\n\nAbstract\n\nIntroducing variability while maintaining coherence is a core task in learning to\ngenerate utterances in conversation. Standard neural encoder-decoder models and\ntheir extensions using conditional variational autoencoder often result in either\ntrivial or digressive responses. To overcome this, we explore a novel approach that\ninjects variability into neural encoder-decoder via the use of external memory as\na mixture model, namely Variational Memory Encoder-Decoder (VMED). By as-\nsociating each memory read with a mode in the latent mixture distribution at each\ntimestep, our model can capture the variability observed in sequential data such\nas natural conversations. We empirically compare the proposed model against\nother recent approaches on various conversational datasets. The results show that\nVMED consistently achieves signi\ufb01cant improvement over others in both metric-\nbased and qualitative evaluations.\n\n1\n\nIntroduction\n\nRecent advances in generative modeling have led to exploration of generative tasks. While gener-\native models such as GAN [12] and VAE [19, 29] have been applied successfully for image gen-\neration, learning generative models for sequential discrete data is a long-standing problem. Early\nattempts to generate sequences using RNNs [13] and neural encoder-decoder models [17, 35] gave\npromising results, but the deterministic nature of these models proves to be inadequate in many re-\nalistic settings. Tasks such as translation, question-answering and dialog generation would bene\ufb01t\nfrom stochastic models that can produce a variety of outputs for an input. For example, there are\nseveral ways to translate a sentence from one language to another, multiple answers to a question\nand multiple responses for an utterance in conversation.\nAnother line of research that has captured attention recently is memory augmented neural networks\n(MANNs). Such models have larger memory capacity and thus \u201cremember\u201d temporally distant\ninformation in the input sequence and provide a RAM-like mechanism to support model execution.\nMANNs have been successfully applied to long sequence prediction tasks [14, 33] demonstrating\ngreat improvement when compared to other recurrent models. However, the role of memory in\nsequence generation has not been well understood.\nFor tasks involving language understanding and production, handling intrinsic uncertainty and latent\nvariations is necessary. The choice of words and grammars may change erratically depending on\nspeaker intentions, moods and previous languages used. The underlying RNN in neural sequential\nmodels \ufb01nds it hard to capture the dynamics and their outputs are often trivial or too generic [23].\nOne way to overcome these problems is to introduce variability into these models. Unfortunately,\nsequential data such as speech and natural language is a hard place to inject variability [30] since\nthey require a coherence of grammars and semantics yet allow freedom of word choice.\nWe propose a novel hybrid approach that integrates MANN and VAE, called Variational Memory\nEncoder-Decoder (VMED), to model the sequential properties and inject variability in sequence\ngeneration tasks. We introduce latent random variables to model the variability observed in the data\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand capture dependencies between the latent variables across timesteps. Our assumption is that there\nare latent variables governing an output at each timestep. In the conversation context, for instance,\nthe latent space may represent the speaker\u2019s hidden intention and mood that dictate word choice and\ngrammars. For a rich latent multimodal space, we use a Mixture of Gaussians (MoG) because a\nspoken word\u2019s latent intention and mood can come from different modes, e.g., whether the speaker\nis asking or answering, or she/he is happy or sad. By modeling the latent space as an MoG where\neach mode associates with some memory slot, we aim to capture multiple modes of the speaker\u2019s\nintention and mood when producing a word in the response. Since the decoder in our model has\nmultiple read heads, the MoG can be computed directly from the content of chosen memory slots.\nOur external memory plays a role as a mixture model distribution generating the latent variables that\nare used to produce the output and take part in updating the memory for future generative steps.\nTo train our model, we adapt Stochastic Gradient Variational Bayes (SGVB) framework [19]. In-\nstead of minimizing the KL divergence directly, we resort to using its variational approximation [15]\nto accommodate the MoG in the latent space. We show that minimizing the approximation results\nin KL divergence minimization. We further derive an upper bound on our total timestep-wise KL\ndivergence and demonstrate that minimizing the upper bound is equivalent to \ufb01tting a continuous\nfunction by a scaled MoG. We validate the proposed model on the task of conversational response\ngeneration. This task serves as a nice testbed for the model because an utterance in a conversation\nis conditioned on previous utterances, the intention and the mood of the speaker. Finally, we eval-\nuate our model on two open-domain and two closed-domain conversational datasets. The results\ndemonstrate our proposed VMED gains signi\ufb01cant improvement over state-of-the-art alternatives.\n\n2 Preliminaries\n\n2.1 Memory-augmented Encoder-Decoder Architecture\n\nA memory-augmented encoder-decoder (MAED) consists of two neural controllers linked via ex-\nternal memory. This is a natural extension to read-write MANNs to handle sequence-to-sequence\nproblems. In MAED, the memory serves as a compressor that encodes the input sequence to its\nmemory slots, capturing the most essential information. Then, a decoder will attend to these mem-\nory slots looking for the cues that help to predict the output sequence. MAED has recently demon-\nstrated promising results in machine translation [5, 37] and healthcare [20, 21, 28]. In this paper,\nwe advance a recent MAED known as DC-MANN described in [20] where the powerful DNC [14]\nis chosen as the external memory.\nIn DNC, memory accesses and updates are executed via the\ncontroller\u2019s reading and writing heads at each timestep. Given current input xt and a set of K pre-\n\nvious read values from memory rt\u22121 =(cid:2)r1\n\n(cid:3), the controllers compute read-weight\n\nand write-weight vector ww\n\nvector wi,r\nt for addressing the memory Mt. There are two versions of\nt\ndecoding in DC-MANN: write-protected and writable memory. We prefer to allow writing to the\nmemory during inference because in this work, we focus on generating diverse output sequences,\nwhich requires a dynamic memory for both encoding and decoding process.\n\nt\u22121, r2\n\nt\u22121, ..., rK\nt\u22121\n\n2.2 Conditional Variational Autoencoder (CVAE) for Conversation Generation\n\nA dyadic conversation can be represented via three random variables: the conversation context x\n(all the chat before the response utterance), the response utterance y and a latent variable z, which\nis used to capture the latent distribution over the reasonable responses. A variational autoencoder\nconditioned on x (CVAE) is trained to maximize the conditional log likelihood of y given x, which\ninvolves an intractable marginalization over the latent variable z, i.e.,:\n\n(cid:90)\n\n(cid:90)\n\np (y | x) =\n\np (y, z | x) dz =\n\np (y | x, z) p (z | x) dz\n\n(1)\n\nz\n\nz\n\nFortunately, CVAE can be ef\ufb01ciently trained with the Stochastic Gradient Variational Bayes (SGVB)\nframework [19] by maximizing the variational lower bound of the conditional log likelihood. In\na typical CVAE work, z is assumed to follow multivariate Gaussian distribution with a diagonal\ncovariance matrix, which is conditioned on x as p\u03c6 (z | x) and a recognition network q\u03b8(z | x, y) to\napproximate the true posterior distribution p(z | x, y). The variational lower bound becomes:\n\n2\n\n\fFigure 1: Graphical Models of the vanilla CVAE (a) and our proposed VMED (b)\n\nL (\u03c6, \u03b8; y, x) = \u2212 KL (q\u03b8 (z | x, y) (cid:107) p\u03c6 (z | x)) + Eq\u03b8(z|x,y) [log p (y | x, z)] \u2264 log p (y | x) (2)\nWith the introduction of the neural approximator q\u03b8(z | x, y) and the reparameterization trick [18],\nwe can apply the standard back-propagation to compute the gradient of the variational lower bound.\nFig. 1(a) depicts elements of the graphical model for this approach in the case of using CVAE.\n\n3 Methods\n\nBuilt upon CVAE and partly inspired by VRNN [8], we introduce a novel memory-augmented vari-\national recurrent network dubbed Variational Memory Encoder-Decoder (VMED). With an external\nmemory module, VMED explicitly models the dependencies between latent random variables across\nsubsequent timesteps. However, unlike the VRNN which uses hidden values of RNN to model the\nlatent distribution as a Gaussian, our VMED uses read values r from an external memory M as\na Mixture of Gaussians (MoG) to model the latent space. This choice of MoG also leads to new\nformulation for the prior p\u03c6 and the posterior q\u03b8 mentioned in Eq. (2). The graphical representation\nof our model is shown in Fig. 1(b).\n\n3.1 Generative Process\n\nthe context sequence via K read values rt\u22121 = (cid:2)r1\n\n(cid:3) from the external memory.\n\nt\u22121, r2\n\nt\u22121, ..., rK\nt\u22121\n\nThe VMED includes a CVAE at each time step of the decoder. These CVAEs are conditioned on\n\nSince the read values are conditioned on the previous state of the decoder hd\nt\u22121, our model takes into\naccount the temporal structure of the output. Unlike other designs of CVAE where there is often\nonly one CVAE with a Gaussian prior for the whole decoding process, our model keeps reading\nthe external memory to produce the prior as a Mixture of Gaussians at every timestep. At the t-th\nstep of generating an utterance in the output sequence, the decoder will read from the memory K\nread values, representing K modes of the MoG. This multi-modal prior re\ufb02ects the fact that given a\ncontext x, there are different modes of uttering the output word yt, which a single mode cannot fully\ncapture. The MoG prior distribution is modeled as:\n\ngt = p\u03c6 (zt | x, rt\u22121) =\n\n\u03c0i,x\nt\n\nK(cid:88)\n\n(cid:0)x, ri\n\nt\u22121\n\n(cid:1)N(cid:16)\n\nzt; \u00b5i,x\n\nt\n\n(cid:0)x, ri\n\nt\u22121\n\n(cid:1) , \u03c3i,x\n\nt\n\n(cid:0)x, ri\n\n(cid:17)\n\n(cid:1)2\n\nt\u22121\n\nI\n\n(3)\n\ni=1\n\nt\n\nt\n\nand standard deviation (s.d.) \u03c3i,x\n\nWe treat the mean \u00b5i,x\nof each Gaussian distribution in the prior\nas neural functions of the context sequence x and read vectors from the memory. The context is\nencoded into the memory by an LST M E encoder. In decoding, the decoder LST M D attends to\nthe memory and choose K read vectors. We split each read vector into two parts ri,\u00b5 and ri,\u03c3 , each\nof which is used to compute the mean and s.d., respectively: \u00b5i,x\n.\nHere we use the softplus function for computing s.d. to ensure the positiveness. The mode weight\n\u03c0i,x\nt\u22121 over memory slots. Since we use soft-\nt\nattention, a read value is computed from all slots yet the main contribution comes from the one\n\nis chosen based on the read attention weights wi,r\n\nt = sof tplus\n\nt = ri,\u00b5\n\nt\u22121, \u03c3i,x\n\nri,\u03c3\nt\u22121\n\n(cid:17)\n\n(cid:16)\n\n3\n\n\fRequire: Given p\u03c6,(cid:2)r1\n\nAlgorithm 1 VMED Generation\n0, ..., rK\n0\n\n0, r2\n\n(cid:3), hd\nt = LST M D(cid:0)[y\u2217\n\n0, y\u2217\n\n0\n\n1: for t = 1, T do\n2:\n3:\n4:\n5:\n6:\n7: end for\n\nSampling zt \u223c p\u03c6 (zt | x, rt\u22121) in Eq. (3)\nCompute: od\n\nCompute the conditional distribution: p (yt | x, z\u2264t) = sof tmax(cid:0)Woutod\n\nt\u22121, zt] , hd\n\nt , hd\n\nt\u22121\n\n(cid:1)\n\nt\n\nUpdate memory and read [r1\nGenerate output y\u2217\n\nt , r2\nt = argmax\ny\u2208V ocab\n\nt ] using hd\nt , ..., rK\np (yt = y | x, z\u2264t)\n\nt as in DNC\n\n(cid:1)\n\nwith highest attention score. Thus, we pick the maximum attention score in each read weight and\nnormalize to become the mode weights: \u03c0i,x\n\nmax wi,r\n\nt = max wi,r\n\nt\u22121/\n\nt\u22121.\n\ni=K(cid:80)\n\ni=1\n\nArmed with the prior, we follow a recurrent generative process by alternatively using the memory\nto compute the MoG and using latent variable z sampled from the MoG to update the memory and\nproduce the output conditional distribution. The pseudo-algorithm of the generative process is given\nin Algorithm 1.\n\n3.2 Neural Posterior Approximation\nAt each step of the decoder, the true posterior p (zt | x, y) will be approximated by a neural function\nof x, y and rt\u22121, denoted as q\u03b8 (zt | x, y, rt\u22121) . Here, we use a Gaussian distribution to approximate\nthe posterior. The unimodal posterior is chosen because given a response y, it is reasonable to\nassume only one mode of latent space is responsible for this response. Also, choosing a unimodel\nwill allow the reparameterization trick during training and reduce the complexity of KL divergence\ncomputation. The approximated posterior is computed by the following the equation:\n\nft = q\u03b8 (zt | x, y\u2264t, rt\u22121) = N(cid:16)\n\n(cid:17)\n\n(cid:35)\n\nzt; \u00b5x,y\n\nt\n\n(x, y\u2264t, rt\u22121) , \u03c3x,y\n\nt\n\n(x, y\u2264t, rt\u22121)2 I\n\n(4)\n\nK(cid:80)\n\nwith mean \u00b5x,y\n. We use an LST M U utterance encoder to model the ground truth\nutterance sequence up to timestep t-th y\u2264t. The t-th hidden value of the LST M U is used to repre-\nsent the given data in the posterior: hu\n\n(cid:1). The neural posterior combines the\n\nt = LST M U(cid:0)yt, hu\n\nand s.d. \u03c3x,y\n\nt\u22121\n\nt\n\nt\n\n\u03c0i,x\nt ri\nt ], \u03c3x,y\nt = sof tplus (W\u03c3 [rt, hu\n\nt\u22121 together with the ground truth data to produce the Gaussian posterior:\nread values rt =\ni=1\n\u00b5x,y\nt ]). In these equations, we use learnable matrix\nt = W\u00b5 [rt, hu\nweights W\u00b5 and W\u03c3 as a recognition network to compute the mean and s.d. of the posterior, ensur-\ning that the distribution has the same dimension as the prior. We apply the reparamterization trick to\nt (cid:12)\u0001, \u0001 \u2208 N (0, I). Intu-\ncalculate the random variable sampled from the posterior as z(cid:48)\nitively, the reparameterization trick bridges the gap between the generation model and the inference\nmodel during the training.\n\nt +\u03c3x,y\n\nt = \u00b5x,y\n\n3.3 Learning\nIn the training phase, the neural posterior is used to produce the latent variable z(cid:48)\nt. The read values\nfrom memory are used directly as the MoG priors and the priors are trained to approximate the\nposterior by reducing the KL divergence. During testing, the decoder uses the prior for generating\nlatent variable zt, from which the output is computed. The training and testing diagram is illustrated\nin Fig. 2. The objective function becomes a timestep-wise variational lower bound by following\nsimilar derivation presented in [8]:\n\nL (\u03b8, \u03c6; y, x) = Eq\u2217\n\n\u2212KL (q\u03b8 (zt | x, y\u2264t, rt\u22121) (cid:107) p\u03c6 (zt | x, rt\u22121)) + log p (yt | x, z\u2264t)\n\nt=1\n\n(5)\n\n4\n\n(cid:34) T(cid:88)\n\n\fFigure 2: Training and testing of VMED\n\nwhere q\u2217 = q\u03b8 (z\u2264T | x, y\u2264T , r<T ). To maximize the objective function, we have to compute KL\ndivergence between ft = q\u03b8 (zt | x, y\u2264t, rt\u22121) and gt = p\u03c6 (zt | x, rt\u22121). Since there is no closed-\nform for this KL (ft (cid:107) gt) between Gaussian ft and Mixture of Gaussians gt, we use a closed-\nform approximation named Dvar [15] to replace the KL term in the objective function. For our\ncase: KL (ft (cid:107) gt) \u2248 Dvar (ft (cid:107) gt) = \u2212 log\ndivergence between two Gaussians and \u03c0i is the mode weight of gt. The \ufb01nal objective function is:\n\n\u03c0ie\u2212KL(ft(cid:107)gi\n\n(cid:1) is the KL\nt). Here, KL(cid:0)ft (cid:107) gi\n(cid:17)(cid:17)(cid:17)(cid:105)\n2I(cid:1) (cid:107) N(cid:16)\n\n, \u03c3i,x\n\n(cid:16)\u2212KL\n\nT(cid:88)\n\nL =\n\nlog\n\nt\n\n\u00b5i,x\nt\n\n2I\n\nt\n\n, \u03c3x,y\n\nt\n\ni=1\n\nK(cid:80)\n(cid:16)N(cid:0)\u00b5x,y\n(cid:17)\n\nt\n\n(cid:104)\nK(cid:88)\nL(cid:88)\nT(cid:88)\n\ni=1\n\nt=1\n\nl=1\n\nt=1\n\n+\n\n1\nL\n\n\u03c0i,x\nt\n\nlog p\n\nexp\n\n(cid:16)\nyt | x, z(l)\u2264t\n\n(6)\n\n3.4 Theoretical Analysis\n\nWe now show that by modeling the prior as MoG and the posterior as Gaussian, minimizing the\napproximation results in KL divergence minimization. Let de\ufb01ne the log-likelihood Lf (g) =\nEf (x) [log g (x)], we have (see Supplementary material for full derivation):\n\nK(cid:88)\n\nLf (g) \u2265 log\n\u21d2 Dvar \u2265Lf (f ) \u2212 Lf (g) = KL (f (cid:107) g)\n\ni=1\n\n\u03c0ie\u2212KL(f(cid:107)gi) + Lf (f ) = \u2212Dvar + Lf (f )\n\nThus, minimizing Dvar results in KL divergence minimization. Next, we establish an upper bound\non the total timestep-wise KL divergence in Eq. (5) and show that minimizing this upper bound is\nequivalent to \ufb01tting a continuous function by a scaled MoG. The total timestep-wise KL divergence\nreads:\n\nT(cid:88)\n\nt=1\n\nKL (ft (cid:107) gt) =\n\n+\u221e(cid:90)\n\nT(cid:88)\n\nft (x) log [ft (x)] dx \u2212\n\n+\u221e(cid:90)\n\nT(cid:88)\n\nft (x) log [gt (x)] dx\n\n\u2212\u221e\n\nt=1\n\n\u2212\u221e\n\nt=1\n\n5\n\n\fTable 1: BLEU-1, 4 and A-Glove on testing datasets. B1, B4, AG are acronyms for BLEU-1,\nBLEU-4, A-Glove metrics, respectively (higher is better).\n\nCornell Movies\nB1\n18.4\n17.7\n17.6\n16.5\n18.6\n20.7\n22.3\n19.4\n23.1\n\nB4\n9.5\n9.2\n9.0\n8.5\n9.7\n10.8\n11.9\n10.4\n12.3\n\nAG\n0.52\n0.54\n0.51\n0.56\n0.59\n0.57\n0.64\n0.63\n0.61\n\nOpenSubtitle\n\nLJ users\n\nB1\n11.4\n13.2\n14.3\n13.5\n16.4\n12.9\n15.3\n24.8\n17.9\n\nB4\n5.4\n6.5\n7.2\n6.6\n8.1\n6.2\n8.8\n12.9\n9.3\n\nAG\n0.29\n0.42\n0.47\n0.45\n0.43\n0.44\n0.49\n0.54\n0.52\n\nB1\n13.1\n11.4\n12.4\n12.2\n11.5\n13.7\n15.4\n18.1\n14.4\n\nB4\n6.4\n5.6\n6.1\n6.0\n5.6\n6.9\n7.9\n9.8\n7.5\n\nAG\n0.45\n0.49\n0.47\n0.48\n0.46\n0.47\n0.51\n0.49\n0.47\n\nReddit comments\nAG\nB1\n7.5\n0.31\n0.25\n5.5\n0.28\n7.5\n0.39\n5.3\n6.9\n0.27\n0.39\n9.1\n0.38\n9.2\n0.46\n12.3\n8.6\n0.41\n\nB4\n3.3\n2.4\n3.4\n2.8\n3.1\n4.3\n4.4\n6.4\n4.6\n\nModel\n\nSeq2Seq\n\nSeq2Seq-att\n\nDNC\nCVAE\nVLSTM\n\nVMED (K=1)\nVMED (K=2)\nVMED (K=3)\nVMED (K=4)\n\nK(cid:80)\n\n+\u221e(cid:90)\n\nT(cid:88)\n\ni=1\n\n\u03c0i\ntgi\n\nt and gi\n\nt is the i-th Gaussian in the MoG at timestep t-th. If at each decoding\nwhere gt =\nstep, minimizing Dvar results in adequate KL divergence such that the prior is optimized close to\nthe neural posterior, according to Chebyshev\u2019s sum inequality, we can derive an upper bound on the\ntotal timestep-wise KL divergence as (see Supplementary Materials for full derivation):\n\nft (x) log [ft (x)] dx \u2212\n\nft (x) log\n\ngt (x)\n\ndx\n\n(7)\n\n\u2212\u221e\n\nt=1\n\n\u2212\u221e\n\nt=1\n\nt=1\n\n+\u221e(cid:90)\n\nT(cid:88)\n\n1\nT\n\n(cid:34) T(cid:89)\n\n(cid:35)\n\nT(cid:80)\n\nThe left term is sum of the entropies of ft (x), which does not depend on the training parameter\n\u03c6 used to compute gt, so we can ignore that. Thus given f, minimizing the upper bound of the\ntotal timestep-wise KL divergence is equivalent to maximizing the right term of Eq. (7). Since\n\ngt (x) is a scaled MoG (see\ngt is an MoG and products of MoG is proportional to an MoG,\nSupplementary material for full proof). Maximizing the right term is equivalent to \ufb01tting function\n\nt=1\n\nT(cid:81)\n\nft (x), which is sum of Gaussians and thus continuous, by a scaled MoG. This, in theory, is\n\nt=1\npossible regardless of the form of ft since MoG is a universal approximator [1, 25].\n\n4 Results\n\nDatasets and pre-processing: We perform experiments on two collections: The \ufb01rst collection in-\ncludes open-domain movie transcript datasets containing casual conversations: Cornell Movies1 and\nOpenSubtitle2. They have been used commonly in evaluating conversational agents [24, 35]. The\nsecond are closed-domain datasets crawled from speci\ufb01c domains, which are question-answering\nof LiveJournal (LJ) users and Reddit comments on movie topics. For each dataset, we use 10,000\nconversations for validating and 10,000 for testing.\nBaselines, implementations and metrics: We compare our model with three deterministic base-\nlines: the encoder-decoder neural conversational model (Seq2Seq) similar to [35] and its two vari-\nants equipped with attention mechanism [2] (Seq2Seq-att) and a DNC external memory [14] (DNC).\nThe vanilla CVAE is also included in the baselines. To build this CVAE, we follow similar archi-\ntecture introduced in [40] without bag-of-word loss and dialog act features3. A variational recurrent\nmodel without memory is also included in the baselines. The model termed VLSTM is implemented\n\n1http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html\n2http://opus.nlpl.eu/OpenSubtitles.php\n3Another variant of non-memory CVAE with MoG prior is also examined. We produce a set of MoG\nparameters by a feed forward network with the input as the last encoder hidden states. However, the model is\nhard to train and fails to converge with these datasets.\n\n6\n\n\fTable 2: Examples of context-response pairs. /*/ denotes separations between stochastic responses.\n\nInput context\nReddit comment:\nWhat is your favorite\nscene in \ufb01lm history ?\nMine is the restaurant\nscene in the\nGodfather.\n\nReddit comment:\nWhat actor will win\nan Oscar in the next\n10 years ?\n\nLJ user: Your very\npretty and your is\nvery cute. May i add\nya to my friends list ?\n\nCornell movies:\nNobody can murder\nsomeone with their\nfeelings . I mean that\nis crazy ?\n\nOpenSubtitle: I\u2019m\nobliged for your\nhospitality. I\nappreciate it and your\nhusband too.\n\nResponse\nSeq2Seq: The scene in\nSeq2Seq-att: The \ufb01nal\nDNC: The scene in\nCVAE: Inception god! Not by a shark /*/ Amour great /*/ Pro thing you know 3\ndead\nVLSTM: The scene in /*/ The of a dead /*/ The sky in scene\nVMED (K=3): The opening scene from history movie /*/ The scene in a shot\nnights! Robin movie /*/ The psycho scene in \ufb01ght from\nSeq2Seq: Colin\nSeq2Seq-att: Liam Neeson\nDNC: Tom Gyllenhaal\nCVAE: Daryl and Aaron /*/ Carefully count Alfred Deniro /*/ Ponyo Joker\npossible\nVLSTM: Michael Bullock /*/ Michael /*/ Michael De\nVMED (K=3): Edward or Leo Dicaprio goes on /*/ Dicaprio will /*/ Dicaprio\nTom has actually in jack on road\nSeq2Seq: I! Add and too you back\nSeq2Seq-att: I did may! Away back\nDNC: Sure you added\nCVAE: Sure ! I can add you /*/ Thank cool of surely /*/ Thank hi of sure!\nVLSTM: Haha of do go /*/ Haha of sure! Yes lol /*/ Haha sure! Go\nVMED (K=3): Thank cool for sure /*/ Add de\ufb01nitely! Sure surely friends /*/\nThank girl go ahead\nSeq2Seq: Yes you are\nSeq2Seq-att: Really not is it\nDNC: Managing the truth\nCVAE: Sure not to be in here /*/ Oh yes but i know! /*/ That to doing with here\nand\nVLSTM: I am dead . ! That is ... /*/ I did not what way . /*/ I am not . But his\nthings with ...\nVMED (K=4): You are right! /*/ That is crazy /*/ You can\u2019t know Jimmy\nSeq2Seq: That is have got coming about these\nSeq2Seq-att: May you not what nothing nobody\nDNC: Yes i am taking this\nCVAE: No . that for good! And okay /*/ All in the of two thing /*/ Sure. Is this!\nVLSTM: I ... /*/ I understand /*/ I ! . ...\nVMED (K=3): I know. I can afford /*/ I know nothing to store for you pass /*/ I\nknow. Doing anymore you father\n\nbased on LSTM instead of RNN as in VRNN framework [8]. We try our model VMED4 with dif-\nferent number of modes (K = 1, 2, 3, 4). It should be noted that, when K = 1, our model\u2019s prior\nis exactly a Gaussian and the KL term in Eq. (6) is no more an approximation. Details of dataset\ndescriptions and model implementations are included in Supplementary material.\nWe report results using two performance metrics in order to evaluate the system from various linguis-\ntic points of view: (i) Smoothed Sentence-level BLEU [6]: BLEU is a popular metric that measures\nthe geometric mean of modi\ufb01ed ngram precision with a length penalty. We use BLEU-1 to 4 as our\nlexical similarity. (ii) Cosine Similarly of Sentence Embedding: a simple method to obtain sentence\nembedding is to take the average of all the word embeddings in the sentences [10]. We follow [40]\nand choose Glove [22] as the word embedding in measuring sentence similarly (A-Glove). To mea-\nsure stochastic models, for each input, we generate output ten times. The metric between the ground\ntruth and the generated output is calculated and taken average over ten responses.\n\n4Source code is available at https://github.com/thaihungle/VMED\n\n7\n\n\fMetric-based Analysis: We report results on four test datasets in Table 1. For BLEU scores, here\nwe only list results for BLEU-1 and 4. Other BLEUs show similar pattern and will be listed in\nSupplementary material. As clearly seen, VMED models outperform other baselines over all metrics\nacross four datasets. In general, the performance of Seq2Seq is comparable with other deterministic\nmethods despite its simplicity. Surprisingly, CVAE or VLSTM does not show much advantage over\ndeterministic models. As we shall see, although CVAE and VLSTM responses are diverse, they are\noften out of context. Among different modes of VMED, there is often one best \ufb01t with the data and\nthus shows superior performance. The optimal number of modes in our experiments often falls to\nK = 3, indicating that increasing modes does not mean to improve accuracy.\nIt should be noted that there is inconsistency between BLEU scores and A-Glove metrics. This\nis because BLEU measures lexicon matching while A-Glove evaluates semantic similarly in the\nembedding space. For examples, two sentences having different words may share the same meaning\nand lie close in the embedding space. In either case, compared to others, our optimal VMED always\nachieves better performance.\nQualitative Analysis\nTable 2 represents responses generated by experimental models in reply to different input sentences.\nThe replies listed are chosen randomly from 50 generated responses whose average of metric scores\nover all models are highest. For stochastic models, we generate three times for each input, result-\ning in three different responses. In general, the stochastic models often yield longer and diverse\nsequences as expected. For closed-domain cases, all models responses are fairly acceptable. Com-\npared to the rest, our VMED\u2019s responds seem to relate more to the context and contain meaningful\ninformation. In this experiment, the open-domain input seems nosier and harder than the closed-\ndomain ones, thus create a big challenge for all models. Despite that, the quality of VMED\u2019s re-\nsponses is superior to others. Among deterministic models, DNC\u2019s generated responses look more\nreasonable than Seq2Seq\u2019s even though its BLEU scores are not always higher. Perhaps, the refer-\nence to external memory at every timestep enhances the coherence between output and input, making\nthe response more related to the context. VMED may inherit this feature from its external memory\nand thus tends to produce reasonable responses. By contrast, although responses from CVAE and\nVLSTM are not trivial, they have more grammatical errors and sometimes unrelated to the topic.\n\n5 Related Work\n\nWith the recent revival of recurrent neural networks (RNNs), there has been much effort spent on\nlearning generative models of sequences. Early attempts include training RNN to generate the next\noutput given previous sequence, demonstrating RNNs\u2019 ability to generate text and handwriting im-\nages [13]. Later, encoder-decoder architecture [34] enables generating a whole sequence in machine\ntranslation [17], text summation [27] and conversation generation [35]. Although these models have\nachieved signi\ufb01cant empirical successes, they fall short to capture the complexity and variability of\nsequential processes.\nThese limitations have recently triggered a considerable effort on introducing variability into the\nencoder-decoder architecture. Most of the methods focus on conditional VAE (CVAE) by con-\nstructing a variational lower bound conditioned on the context. The setting can be found in many\napplications including machine translation [39] and dialog generation [4, 30, 31, 40]. A common\ntrick is to place a neural net between the encoder and the decoder to compute the Gaussian prior\nand posterior of the CVAE. This design is further enhanced by the use of external memory [7] and\nreinforcement learning [38]. In contrast to this design, our VMED uses recurrent latent variable\napproach [8], that is, our model requires a CVAE for each step of generation. Besides, our external\nmemory is used for producing the latent distribution, which is different from the one proposed in [7]\nwhere the memory is used only for holding long-term dependencies at sentence level. Compared to\nvariational addressing scheme mentioned in [3], our memory uses deterministic addressing scheme,\nyet the memory content itself is used to introduce randomness to the architecture. More relevant to\nour work is GTMM [11] where memory read-outs involve in constructing the prior and posterior at\nevery timesteps. However, this approach uses Gaussian prior without conditional context.\nUsing mixture of models instead of single Gaussian in VAE framework is not a new concept. Works\nin [9, 16] and [26] proposed replacing the Gaussian prior and posterior in VAE by MoGs for cluster-\ning and generating image problems. Works in [32] and [36] applied MoG prior to model transitions\n\n8\n\n\fbetween video frames and caption generation, respectively. These methods use simple feed forward\nnetwork to produce Gaussian sub-distributions independently. In our model, on the contrary, mem-\nory slots are strongly correlated with each others, and thus modes in our MoG work together to\nde\ufb01ne the shape of the latent distributions at speci\ufb01c timestep. To the best of our knowledge, our\nwork is the \ufb01rst attempt to use an external memory to induce mixture models for sequence generation\nproblems.\n\n6 Conclusions\n\nWe propose a novel approach to sequence generation called Variational Memory Encoder-Decoder\n(VMED) that introduces variability into encoder-decoder architecture via the use of external mem-\nory as mixture model. By modeling the latent temporal dependencies across timesteps, our VMED\nproduces a MoG representing the latent distribution. Each mode of the MoG associates with some\nmemory slot and thus captures some aspect of context supporting generation process. To accom-\nmodate the MoG, we employ a KL approximation and we demonstrate that minimizing this ap-\nproximation is equivalent to minimizing the KL divergence. We derive an upper bound on our total\ntimestep-wise KL divergence and indicate that the optimization of this upper bound is equivalent\nto \ufb01tting a continuous function by an scaled MoG, which is in theory possible regardless of the\nfunction form. This forms a theoretical basis for our model formulation using MoG prior for every\nstep of generation. We apply our proposed model to conversation generation problem. The results\ndemonstrate that VMED outperforms recent advances both quantitatively and qualitatively. Future\nexplorations may involve implementing a dynamic number of modes that enable learning of the op-\ntimal K for each timestep. Another aspect would be multi-person dialog setting, where our memory\nas mixture model may be useful to capture more complex modes of speaking in the dialog.\n\nReferences\n[1] Athanassia Bacharoglou. Approximation of probability distributions by convex mixtures of\ngaussian measures. Proceedings of the American Mathematical Society, 138(7):2619\u20132628,\n2010.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by\njointly learning to align and translate. Proceedings of the International Conference on Learn-\ning Representations, 2015.\n\n[3] J\u00f6rg Bornschein, Andriy Mnih, Daniel Zoran, and Danilo Jimenez Rezende. Variational mem-\nory addressing in generative models. In Advances in Neural Information Processing Systems,\npages 3923\u20133932, 2017.\n\n[4] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy\nBengio. Generating sentences from a continuous space. In Proceedings of The SIGNLL Con-\nference on Computational Natural Language Learning, pages 10\u201321, 2016.\n\n[5] Denny Britz, Melody Guan, and Minh-Thang Luong. Ef\ufb01cient attention using a \ufb01xed-size\nmemory representation. In Proceedings of the Conference on Empirical Methods in Natural\nLanguage Processing, pages 392\u2013400, 2017.\n\n[6] Boxing Chen and Colin Cherry. A systematic comparison of smoothing techniques for\nsentence-level bleu. In Proceedings of the Ninth Workshop on Statistical Machine Transla-\ntion, pages 362\u2013367, 2014.\n\n[7] Hongshen Chen, Zhaochun Ren, Jiliang Tang, Yihong Eric Zhao, and Dawei Yin. Hierarchical\nvariational memory network for dialogue generation. In Proceedings of the World Wide Web\nConference on World Wide Web, pages 1653\u20131662. International World Wide Web Conferences\nSteering Committee, 2018.\n\n[8] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua\nBengio. A recurrent latent variable model for sequential data. In Advances in Neural Informa-\ntion Processing Systems, pages 2980\u20132988, 2015.\n\n9\n\n\f[9] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni,\nKai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture\nvariational autoencoders. arXiv preprint arXiv:1611.02648, 2016.\n\n[10] Gabriel Forgues, Joelle Pineau, Jean-Marie Larchev\u00eaque, and R\u00e9al Tremblay. Bootstrapping\ndialog systems with word embeddings. In Nips, Modern Machine Learning and Natural Lan-\nguage Processing Workshop, volume 2, 2014.\n\n[11] Mevlana Gemici, Chia-Chun Hung, Adam Santoro, Greg Wayne, Shakir Mohamed, Danilo J\nRezende, David Amos, and Timothy Lillicrap. Generative temporal models with memory.\narXiv preprint arXiv:1702.04649, 2017.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nIn Advances in\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.\nNeural Information Processing Systems, pages 2672\u20132680, 2014.\n\n[13] Alex Graves. Generating sequences with recurrent neural networks.\n\narXiv:1308.0850, 2013.\n\narXiv preprint\n\n[14] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka\nGrabska-Barwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John\nAgapiou, et al. Hybrid computing using a neural network with dynamic external memory.\nNature, 538(7626):471\u2013476, 2016.\n\n[15] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between\ngaussian mixture models. In IEEE International Conference on Acoustics, Speech and Signal\nProcessing., 2007.\n\n[16] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep\nIn Proceedings of the\nembedding: An unsupervised and generative approach to clustering.\nInternational Joint Conference on Arti\ufb01cial Intelligence, pages 1965\u20131972. International Joint\nConference on Arti\ufb01cial Intelligence, 2017.\n\n[17] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings\nof the Conference on Empirical Methods in Natural Language Processing, pages 1700\u20131709,\n2013.\n\n[18] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Process-\ning Systems, pages 3581\u20133589, 2014.\n\n[19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the\n\nInternational Conference on Learning Representations, 2014.\n\n[20] Hung Le, Truyen Tran, and Svetha Venkatesh. Dual control memory augmented neural net-\nworks for treatment recommendations. In Advances in Knowledge Discovery and Data Mining,\npages 273\u2013284, Cham, 2018. Springer International Publishing.\n\n[21] Hung Le, Truyen Tran, and Svetha Venkatesh. Dual memory neural computer for asynchronous\ntwo-view sequential learning. In Proceedings of the 24th ACM SIGKDD International Con-\nference on Knowledge Discovery; Data Mining, KDD \u201918, pages 1637\u20131645, New York, NY,\nUSA, 2018. ACM.\n\n[22] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In\n\nAdvances in Neural Information Processing Systems, pages 2177\u20132185, 2014.\n\n[23] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting\nobjective function for neural conversation models. In Proceedings of the Conference of the\nNorth American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies, pages 110\u2013119, 2016.\n\n[24] Pierre Lison and Serge Bibauw. Not all dialogues are created equal: Instance weighting for\nneural conversational models. In Proceedings of the Annual SIGdial Meeting on Discourse\nand Dialogue, pages 384\u2013394, 2017.\n\n10\n\n\f[25] Vladimir Maz\u2019ya and Gunther Schmidt. On approximate approximations using gaussian ker-\n\nnels. IMA Journal of Numerical Analysis, 16(1):13\u201329, 1996.\n\n[26] Eric Nalisnick, Lars Hertel, and Padhraic Smyth. Approximate inference for deep latent gaus-\n\nsian mixtures. In NIPS Workshop on Bayesian Deep Learning, volume 2, 2016.\n\n[27] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. Ab-\nstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of\nthe SIGNLL Conference on Computational Natural Language Learning, pages 280\u2013290, 2016.\n\n[28] Aaditya Prakash, Siyuan Zhao, Sadid A Hasan, Vivek V Datla, Kathy Lee, Ashequl Qadir, Joey\nLiu, and Oladimeji Farri. Condensed memory networks for clinical diagnostic inferencing. In\nProceedings of the AAAI Conference on Arti\ufb01cial Intelligence, pages 3274\u20133280, 2017.\n\n[29] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nIn Proceedings of the International\nand approximate inference in deep generative models.\nConference on International Conference on Machine Learning, pages II\u20131278. JMLR. org,\n2014.\n\n[30] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C\nCourville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for gen-\nIn Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, pages\nerating dialogues.\n3295\u20133301, 2017.\n\n[31] Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guop-\ning Long. A conditional variational framework for dialog generation. In Proceedings of the\nAnnual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),\nvolume 2, pages 504\u2013509, 2017.\n\n[32] Rui Shu, James Brofos, Frank Zhang, Hung Hai Bui, Mohammad Ghavamzadeh, and Mykel\nKochenderfer. Stochastic video prediction with conditional density estimation. In ECCV Work-\nshop on Action and Anticipation for Visual Learning, volume 2, 2016.\n\n[33] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory net-\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\n\nworks.\nAdvances in Neural Information Processing Systems, pages 2440\u20132448. 2015.\n\n[34] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[35] Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint arXiv:1506.05869,\n\n2015.\n\n[36] Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. Diverse and accurate image descrip-\ntion using a variational auto-encoder with an additive gaussian encoding space. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems, pages 5756\u20135766. 2017.\n\n[37] Mingxuan Wang, Zhengdong Lu, Hang Li, and Qun Liu. Memory-enhanced decoder for neu-\nral machine translation. In Proceedings of the Conference on Empirical Methods in Natural\nLanguage Processing, pages 278\u2013286, 2016.\n\n[38] Tsung-Hsien Wen, Yishu Miao, Phil Blunsom, and Steve Young. Latent intention dialogue\nmodels. In Proceedings of the International Conference on Machine Learning, pages 3732\u2013\n3741, 2017.\n\n[39] Biao Zhang, Deyi Xiong, Hong Duan, Min Zhang, et al. Variational neural machine translation.\nIn Proceedings of the Conference on Empirical Methods in Natural Language Processing,\npages 521\u2013530, 2016.\n\n[40] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. Learning discourse-level diversity for neu-\nral dialog models using conditional variational autoencoders. In Proceedings of the Annual\nMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1,\npages 654\u2013664, 2017.\n\n11\n\n\f", "award": [], "sourceid": 772, "authors": [{"given_name": "Hung", "family_name": "Le", "institution": "Deakin University"}, {"given_name": "Truyen", "family_name": "Tran", "institution": "Deakin University"}, {"given_name": "Thin", "family_name": "Nguyen", "institution": "Deakin University"}, {"given_name": "Svetha", "family_name": "Venkatesh", "institution": "Deakin University"}]}