{"title": "Review Networks for Caption Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 2361, "page_last": 2369, "abstract": "We propose a novel extension of the encoder-decoder framework, called a review network. The review network is generic and can enhance any existing encoder- decoder model: in this paper, we consider RNN decoders with both CNN and RNN encoders. The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review step; the thought vectors are used as the input of the attention mechanism in the decoder. We show that conventional encoder-decoders are a special case of our framework. Empirically, we show that our framework improves over state-of- the-art encoder-decoder systems on the tasks of image captioning and source code captioning.", "full_text": "Review Networks for Caption Generation\n\nZhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, William W. Cohen\n\nSchool of Computer Science\nCarnegie Mellon University\n\n{zhiliny,yey1,yuexinw,rsalakhu,wcohen}@cs.cmu.edu\n\nAbstract\n\nWe propose a novel extension of the encoder-decoder framework, called a review\nnetwork. The review network is generic and can enhance any existing encoder-\ndecoder model: in this paper, we consider RNN decoders with both CNN and RNN\nencoders. The review network performs a number of review steps with attention\nmechanism on the encoder hidden states, and outputs a thought vector after each\nreview step; the thought vectors are used as the input of the attention mechanism\nin the decoder. We show that conventional encoder-decoders are a special case of\nour framework. Empirically, we show that our framework improves over state-of-\nthe-art encoder-decoder systems on the tasks of image captioning and source code\ncaptioning.1\n\n1\n\nIntroduction\n\nEncoder-decoder is a framework for learning a transformation from one representation to another. In\nthis framework, an encoder network \ufb01rst encodes the input into a context vector, and then a decoder\nnetwork decodes the context vector to generate the output. The encoder-decoder framework was\nrecently introduced for sequence-to-sequence learning based on recurrent neural networks (RNNs)\nwith applications to machine translation [3, 15], where the input is a text sequence in one language and\nthe output is a text sequence in the other language. More generally, the encoder-decoder framework\nis not restricted to RNNs and text; e.g., encoders based on convolutional neural networks (CNNs)\nare used for image captioning [18]. Since it is often dif\ufb01cult to encode all the necessary information\nin a single context vector, an attentive encoder-decoder introduces an attention mechanism to the\nencoder-decoder framework. An attention mechanism modi\ufb01es the encoder-decoder bottleneck by\nconditioning the generative process in the decoder on the encoder hidden states, rather than on one\nsingle context vector only. Improvements due to an attention mechanism have been shown on various\ntasks, including machine translation [1], image captioning [20], and text summarization [12].\nHowever, there remain two important issues to address for attentive encoder-decoder models. First,\nthe attention mechanism proceeds in a sequential manner and thus lacks global modeling abilities.\nMore speci\ufb01cally, at the generation step t, the decoded token is conditioned on the attention results at\nthe current time step \u02dcht, but has no information about future attention results \u02dcht(cid:48) with t(cid:48) > t. For\nexample, when there are multiple objects in the image, the caption tokens generated at the beginning\nfocuses on the \ufb01rst one or two objects and is unaware of the other objects, which is potentially\nsuboptimal. Second, previous works show that discriminative supervision (e.g., predicting word\noccurrences in the caption) is bene\ufb01cial for generative models [5], but it is not clear how to integrate\ndiscriminative supervision into the encoder-decoder framework in an end-to-end manner.\nTo address the above questions, we propose a novel architecture, the review network, which extends\nexisting (attentive) encoder-decoder models. The review network performs a given number of review\nsteps with attention on the encoder hidden states and outputs a thought vector after each step, where\n\n1Code and data available at https://github.com/kimiyoung/review_net.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe thought vectors are introduced to capture the global properties in a compact vector representation\nand are usable by the attention mechanism in the decoder. The intuition behind the review network is\nto review all the information encoded by the encoder and produce vectors that are a more compact,\nabstractive, and global representation than the original encoder hidden states.\nAnother role for the thought vectors is as a focus for multitask learning. For instance, one can use the\nthought vectors as inputs for a secondary prediction task, such as predicting discriminative signals\n(e.g., the words that occur in an image caption), in addition to the objective as a generative model. In\nthis paper we explore this multitask review network, and also explore variants with weight tying.\nWe show that conventional attentive encoder-decoders are a special case of the review networks,\nwhich indicates that our model is strictly more expressive than the attentive encoder-decoders. We\nexperiment with two different tasks, image captioning and source code captioning, using CNNs\nand RNNs as the encoders respectively. Our results show that the review network can consistently\nimprove the performance over attentive encoder-decoders on both datasets, and obtain state-of-the-art\nperformance.\n\n2 Related Work\n\nThe encoder-decoder framework in the context of sequence-to-sequence learning was recently\nintroduced for learning transformation between text sequences [3, 15], where RNNs were used\nfor both encoding and decoding. Encoder-decoders, in general, can refer to models that learn a\nrepresentation transformation using two network components, an encoder and a decoder. Besides\nRNNs, convolutional encoders have been developed to address multi-modal tasks such as image\ncaptioning [18]. Attention mechanisms were later introduced to the encoder-decoder framework for\nmachine translation, with attention providing an explanation of explicit token-level alignment between\ninput and output sequences [1]. In contrast to vanilla encoder-decoders, attentive encoder-decoders\ncondition the decoder on the encoder\u2019s hidden states. At each generation step, the decoder pays\nattention to a speci\ufb01c part of the encoder, and generates the next token based on both the current\nhidden state in the decoder and the attended hidden states in the encoder. Attention mechanisms\nhave had considerable success in other applications as well, including image captioning [20] and text\nsummarization [12].\nOur work is also related to memory networks [19, 14]. Memory networks take a question embedding\nas input, and perform multiple computational steps with attention on the memory, which is usually\nformed by the embeddings of a group of sentences. Dynamic memory networks extend memory\nnetworks to model sequential memories [8]. Memory networks are mainly used in the context of\nquestion answering; the review network, on the other hand, is a generic architecture that can be\nintegrated into existing encoder-decoder models. Moreover, the review network learns thought vectors\nusing multiple review steps, while (embedded) facts are provided as input to the memory networks.\nAnother difference is that the review network outputs a sequence of thought vectors, while memory\nnetworks only use the last hidden state to generate the answer. [17] presented a processor unit that\nruns over the encoder multiple times, but their model mainly focuses on handling non-sequential\ndata and their approach differs from ours in many ways (e.g., the encoder consists of small neural\nnetworks operating on each input element, and the process module is not directly connected to the\nencoder, etc). The model proposed in [6] performs a number of sub-steps inside a standard recurrent\nstep, while our decoder generates the output with attention to the thought vectors.\n\n3 Model\n\nGiven the input representation x and the output representation y, the goal is to learn a function\nmapping from x to y. For example, image captioning aims to learn a mapping from an image x to a\ncaption y. For notation simplicity, we use x and y to denote both a tensor and a sequence of tensors.\nFor example, x can be a 3d-tensor that represents an image with RGB channels in image captioning,\nor can be a sequence of 1d-tensors (i.e., vectors) x = (x1,\u00b7\u00b7\u00b7 , xTx ) in machine translation, where xt\ndenotes the one-of-K embedding of the t-th word in the input sequence of length Tx.\nIn contrast to conventional (attentive) encoder-decoder models, our model consists of three compo-\nnents, encoder, reviewer, and decoder. The comparison of architectures is shown in Figure 1. Now\nwe describe the three components in detail.\n\n2\n\n\f(a) Attentive Encoder-Decoder\nModel.\n\n(b) Review Network. Blue components denote optional discrimina-\ntive supervision. Tr is set to 3 in this example.\nFigure 1: Model Architectures.\n\n(a) Attentive Input Reviewer.\n\n(b) Attentive Output Reviewer.\nFigure 2: Illustrations of modules in the review network. f(cid:48)\n\n(c) Decoder.\n\n\u00b7 and f(cid:48)(cid:48)\n\n\u00b7 denote LSTM units.\n\n3.1 Encoder\nThe encoder encodes the input x into a context vector c and a set of hidden states H = {ht}t. We\ndiscuss two types of encoders, RNN encoders and CNN encoders.\nRNN Encoder: Let Tx = |H| be the length of the input sequence. An RNN encoder processes the\ninput sequence x = (x1,\u00b7\u00b7\u00b7 , xTx ) sequentially. At time step t, the RNN encoder updates the hidden\nstate by\n\nht = f (xt, ht\u22121).\n\nIn this work, we implement f using an LSTM unit. The context vector is de\ufb01ned as the \ufb01nal hidden\nstate c = hTx. The cell state and hidden state h0 of the \ufb01rst LSTM unit are initialized as zero.\nCNN Encoder: We take a widely-used CNN architecture\u2014VGGNet [13]\u2014as an example to describe\nhow we use CNNs as encoders. Given a VGGNet, we use the output of the last fully connected layer\nfc7 as the context vector c = fc7(x), and use 14 \u00d7 14 = 196 columns of 512d convolutional output\nconv5 as hidden states H = conv5(x). In this case Tx = |H| = 196.\n\n3.2 Reviewer\n\nLet Tr be a hyperparameter that speci\ufb01es the number of review steps. The intuition behind the\nreviewer module is to review all the information encoded by the encoder and learn thought vectors\nthat are a more compact, abstractive, and global representation than the original encoder hidden states.\nThe reviewer performs Tr review steps on the encoder hidden states H and outputs a thought vector\nft after each step. More speci\ufb01cally,\n\nft = gt(H, ft\u22121),\n\nwhere gt is a modi\ufb01ed LSTM unit with attention mechanism at review step t. We study two variants\nof gt, attentive input reviewers and attentive output reviewers. The attentive input reviewer is inspired\nby visual attention [20], which is more commonly used for images; the attentive output reviewer is\ninspired by attention on text [1], which is more commonly used for sequential tokens.\nAttentive Input Reviewer At each review step t, the attentive input reviewer \ufb01rst applies an attention\nmechanism on H and use the attention result as the input to an LSTM unit (Cf. Figure 2a). Let\n\n3\n\n\f\u02dcft = att(H, ft\u22121) be the attention result at step t. The attentive input reviewer is formulated as\n\n\u02dcft = att(H, ft\u22121) =\n\nhi, gt(H, ft\u22121) = f(cid:48)\n\nt(\u02dcft, ft\u22121),\n\n(1)\n\n|H|(cid:88)\n\ni=1\n\n(cid:80)|H|\n\n\u03b1(hi, ft\u22121)\ni(cid:48)=1 \u03b1(hi(cid:48), ft\u22121)\n\nwhere \u03b1(hi, ft\u22121) is a function that determines the weight for the i-th hidden state. \u03b1(x1, x2) can be\nimplemented as a dot product between x1 and x2 or a multi-layer perceptron (MLP) that takes the\nconcatenation of x1 and x2 as input [9]. f(cid:48)\nAttentive Output Reviewer In contrast to the attentive input reviewer, the attentive output reviewer\nuses a zero vector as input to the LSTM unit, and the thought vector is computed as the weighted\nsum of the attention results and the output of the LSTM unit (Cf. Figure 2b). More speci\ufb01cally, the\nattentive output reviewer is formulated as\n\nt is an LSTM unit at step t.\n\n\u02dcft = att(H, ft\u22121), gt(H, ft\u22121) = f(cid:48)\n\nt(0, ft\u22121) + W\u02dcft,\n\nwhere the attention mechanism att follows the de\ufb01nition in Eq. (1), 0 denotes a zero vector, W is a\nmodel parameter matrix, and f(cid:48)\nt is an LSTM unit at step t. We note that performing attention on top\nof an RNN unit is commonly used in sequence-to-sequence learning [1, 9, 12]. We apply a linear\ntransformation with a matrix W since the dimensions of f(cid:48)\nWeight Tying We study two variants of weight tying for the reviewer module. Let wt denote the\nparameters for the unit f(cid:48)\nt. The \ufb01rst variant follows the common setting in RNNs, where weights\nare shared among all the units; i.e., w1 = \u00b7\u00b7\u00b7 = wTr. We also observe that the reviewer unit does\nnot have sequential input, so we experiment with the second variant where weights are untied; i.e.\nwi (cid:54)= wj,\u2200i (cid:54)= j.\nThe cell state and hidden state of the \ufb01rst unit f(cid:48)\nand hidden states are passed through all the reviewer units in both cases of weight tying.\n\n1 are initialized as the context vector c. The cell states\n\nt(\u00b7,\u00b7) and \u02dcft can be different.\n\n3.3 Decoder\nLet F = {ft}t be the set of thought vectors output by the reviewer. The decoder is formulated as an\nLSTM network with attention on the thought vectors F (Cf. Figure 2c). Let st be the hidden state of\nthe t-th LSTM unit in the decoder. The decoder is formulated as follows:\nst = f(cid:48)(cid:48)([\u02dcst; yt\u22121], st\u22121), yt = arg max\n\n(2)\nwhere [\u00b7;\u00b7] denotes the concatenation of two vectors, f(cid:48)(cid:48) denotes the decoder LSTM, softmaxy is\nthe probability of word y given by a softmax layer, yt is the t-th decoded token, and yt is the word\nembedding of yt. The attention mechanism att follows the de\ufb01nition in Eq. (1). The initial cell state\nand hidden state s0 of the decoder LSTM are both set to the review vector r = W(cid:48)[fTr ; c], where\nW(cid:48) is a model parameter matrix.\n\n\u02dcst = att(F, st\u22121),\n\nsoftmaxy(st),\n\ny\n\n3.4 Discriminative Supervision\n\nIn conventional encoder-decoders, supervision is provided in a generative manner; i.e., the model\naims to maximize the conditional probability of generating the sequential output p(y|x). However,\ndiscriminative supervision has been shown to be useful in [5], where the model is guided to predict\ndiscriminative objectives, such as the word occurrences in the output y.\nWe argue that the review network provides a natural way of incorporating discriminative supervision\ninto the model. Here we take word occurrence prediction for example to describe how to incorporate\ndiscriminative supervision. As shown in the blue components in Figure 1b, we \ufb01rst apply a linear\nlayer on top of the thought vector to compute a score for each word at each review step. We then\napply a max-pooling layer over all the review units to extract the most salient signal for each word,\nand add a multi-label margin loss as discriminative supervision. Let si be the score of word i after\nthe max pooling layer, and W be the set of all words that occur in y. The discriminative loss can be\nwritten as\n\nmax(0, 1 \u2212 (sj \u2212 si)),\n\n(3)\n\n(cid:88)\n\n(cid:88)\n\nj\u2208W\n\ni(cid:54)=j\n\nLd =\n\n1\nZ\n\nwhere Z is a normalizer that counts all the valid i, j pairs. We note that when the discriminative\nsupervision is derived from the given data (i.e., predicting word occurrences in captions), we are not\nusing extra information.\n\n4\n\n\f3.5 Training\n\nThe training loss for a single training instance (x, y) is de\ufb01ned as a weighted sum of the negative\nconditional log likelihood and the discriminative loss. Let Ty be the length of the output sequence y.\nThe loss can be written as\n\nTy(cid:88)\n\nt=1\n\nL(x, y) =\n\n1\nTy\n\n\u2212 log softmaxyt(st) + \u03bbLd,\n\nwhere the de\ufb01nition of softmaxy and st follows Eq. (2), and the formulation of Ld follows Eq. (3). \u03bb\nis a constant weighting factor. We adopt adaptive stochastic gradient descent (AdaGrad) [4] to train\nthe model in an end-to-end manner. The loss of a training batch is averaged over all instances in the\nbatch.\n\n3.6 Connection to Encoder-Decoders\n\nWe now show that our model can be reduced to the conventional (attentive) encoder-decoders in a\nspecial case. In attentive encoder-decoders, the decoder takes the context vector c and the set of\nencoder hidden states H = {ht}t as input, while in our review network, the input of the decoder is\ninstead the review vector r and the set of thought vectors F = {ft}t. To show that our model can be\nreduced to attentive encoder-decoders, we only need to construct a case where H = F and c = r.\nSince r = W(cid:48)[fTr ; c], it can be reduced to r = c with a speci\ufb01c setting of W(cid:48). We further set\nTr = Tx, and de\ufb01ne each reviewer unit as an identity mapping gt(H, ft\u22121) = ht, which satis\ufb01es the\nde\ufb01nition of both the attentive input reviewer and the attentive output reviewer with untied weights.\nWith the above setting, we have ht = ft,\u2200t = 1,\u00b7\u00b7\u00b7 , Tx; i.e., H = F . Thus our model can be\nreduced to attentive encoder-decoders in a special case. Similarly we can show that our model can\nbe reduced to vanilla encoder-decoders (without attention) by constructing a case where r = c and\nft = 0. Therefore, our model is more expressive than (attentive) encoder-decoders.\nThough we set Tr = Tx in the above construction, in practice, we set the number of review steps Tr\nto be much smaller compared to Tx, since we \ufb01nd that the review network can learn a more compact\nand effective representation.\n\n4 Experiments\n\nWe experiment with two datasets of different tasks, image captioning and source code captioning.\nSince these two tasks are quite different, we can use them to test the robustness and generalizability\nof our model.\n\n4.1\n\nImage Captioning\n\n4.1.1 Of\ufb02ine Evaluation\n\nWe evaluate our model on the MSCOCO benchmark dataset [2] for image captioning. The dataset\ncontains 123,000 images with at least 5 captions for each image. For of\ufb02ine evaluation, we use the\nsame data split as in [7, 20, 21], where we reserve 5,000 images for development and test respectively\nand use the rest for training. The models are evaluated using the of\ufb01cial MSCOCO evaluation scripts.\nWe report three widely used automatic evaluation metrics, BLEU-4, METEOR, and CIDEr.\nWe remove all the non-alphabetic characters in the captions, transform all letters to lowercase, and\ntokenize the captions using white space. We replace all words occurring less than 5 times with an\nunknown token and obtain a vocabulary of 9,520 words. We truncate all the captions longer\nthan 30 tokens.\nWe set the number of review steps Tr = 8, the weighting factor \u03bb = 10.0, the dimension of word\nembeddings to be 100, the learning rate to be 1e\u22122, and the dimension of LSTM hidden states to\nbe 1, 024. These hyperparameters are tuned on the development set. We also use early stopping\nstrategies to prevent over\ufb01tting. More speci\ufb01cally, we stop the training procedure when the BLEU-4\nscore on the development set reaches the maximum. We use an MLP with one hidden layer of size\n512 to de\ufb01ne the function \u03b1(\u00b7,\u00b7) in the attention mechanism, and use an attentive input reviewer in\n\n5\n\n\fTable 1: Comparison of model variants on MSCOCO dataset. Results are obtained with a single model using\nVGGNet. Scores in the brackets are without beam search. We use RNN-like tied weights for the review network\nunless otherwise indicated. \u201cDisc Sup\u201d means discriminative supervision.\n\nModel\nAttentive Encoder-Decoder\nReview Net\nReview Net + Disc Sup\nReview Net + Disc Sup + Untied Weights\n\nBLEU-4\n0.278 (0.255)\n0.282 (0.259)\n0.287 (0.264)\n0.290 (0.268)\n\nMETEOR\n0.229 (0.223)\n0.233 (0.227)\n0.238 (0.232)\n0.237 (0.232)\n\nCIDEr\n0.840 (0.793)\n0.852 (0.816)\n0.879 (0.833)\n0.886 (0.852)\n\nTable 2: Comparison with state-of-the-art systems on the MSCOCO evaluation server. \u2020 indicates ensemble\nmodels. Feat. means using task-speci\ufb01c features or attributes. Fine. means using CNN \ufb01ne-tuning.\n\nModel\nAttention [20]\nMS Research [5]\nGoogle NIC [18]\u2020\nSemantic Attention [21]\u2020\nReview Net (this paper)\u2020\n\nBLEU-4 METEOR ROUGE-L CIDEr\n0.893\n0.537\n0.925\n0.567\n0.946\n0.587\n0.599\n0.958\n0.969\n0.597\n\n0.654\n0.662\n0.682\n0.682\n0.686\n\n0.322\n0.331\n0.346\n0.335\n0.347\n\nFine.\nNo\nNo\nYes\nNo\nNo\n\nFeat.\nNo\nYes\nNo\nYes\nNo\n\nour experiments to be consistent with visual attention models [20]. We use beam search with beam\nsize 3 for decoding. We guide the model to predict the words occurring in the caption through the\ndiscriminative supervision Ld without introducing extra information. We \ufb01x the parameters of the\nCNN encoders during training.\nWe compare our model with encoder-decoders to study the effectiveness of the review network. We\nalso compare different variants of our model to evaluate the effects of different weight tying strategies\nand discriminative supervision. Results are reported in Table 1. All the results in Table 1 are obtained\nusing VGGNet [13] as encoders as described in Section 3.1.\nFrom Table 1, we can see that the review network can improve the performance over conventional\nattentive encoder-decoders consistently on all the three metrics. We also observe that adding discrimi-\nnative supervision can boost the performance, which demonstrates the effectiveness of incorporating\ndiscriminative supervision in an end-to-end manner. Untying the weights between the reviewer units\ncan further improve the performance. Our conjecture is that the models with untied weights are\nmore expressive than shared-weight models since each unit can have its own parametric function to\ncompute the thought vector. In addition to Table 1, our experiment shows that applying discriminative\nsupervision on attentive encoder-decoders can improve the CIDEr score from 0.793 to 0.811 without\nbeam search. We did experiments on the development set with Tr = 0, 4, 8, and 16. The performances\nwhen Tr = 4 and Tr = 16 are slightly worse then Tr = 8 (\u22120.003 in Bleu-4 and \u22120.01 in CIDEr).\nWe also experimented on the development set with \u03bb = 0, 5, 10, and 20, and \u03bb = 10 gives the best\nperformance.\n\n4.1.2 Online Evaluation on MSCOCO Server\n\nWe also compare our model with state-of-the-art systems on the MSCOCO evaluation server in Table\n2. Our submission uses Inception-v3 [16] as the encoder and is an ensemble of three identical models\nwith different random initialization. We take the output of the last convolutional layer (before pooling)\nas the encoder states. From Table 2, we can see that among state-of-the-art published systems, the\nreview network achieves the best performance for three out of four metrics (i.e., METEOR, ROUGE-L,\nand CIDEr), and has very close performance to Semantic Attention [21] on BLEU-4 score.\nThe Google NIC system [18] employs several tricks such as CNN \ufb01ne-tuning and scheduled sampling\nand takes more than two weeks to train; the semantic attention system requires hand-engineering\ntask-speci\ufb01c features/attributes. Unlike these methods, our approach with the review network is a\ngeneric end-to-end encoder-decoder model and can be trained within six hours on a Titan X GPU.\n\n6\n\n\fFigure 3: Each row corresponds to a test image: the \ufb01rst is the original image with the caption output by our\nmodel, and the following three images are the visualized attention weights of the \ufb01rst three reviewer units. We\nalso list the top-5 words with highest scores for each unit. Colors indicate semantically similar words.\n\n4.1.3 Case Study and Visualization\n\nTo better understand the review network, we visualize the attention weights \u03b1 in the review network\nin Figure 3. The visualization is based on the review network with untied weights and discriminative\nsupervision. We also list the top-5 words with highest scores (computed based on the thought vectors)\nat each reviewer unit.\nWe \ufb01nd that the top words with highest scores can uncover the reasoning procedure underlying the\nreview network. For example, in the \ufb01rst image (a giraffe in a zoo), the \ufb01rst reviewer focuses on the\nmotion of the giraffe and the tree near it, the second reviewer analyzes the relative position between\nthe giraffe and the tree, and the third reviewer looks at the big picture and infers that the scene is\nin a zoo based on recognizing the fences and enclosures. All the above information is stored in the\nthought vectors and decoded as natural language by the decoder.\nDifferent from attentive encoder-decoders [20] that attend to a single object at a time during generation,\nit can be clearly seen from Figure 3 that the review network captures more global signals, usually\ncombining multiple objects into one thought, including objects not \ufb01nally shown in the caption\n(e.g., \u201ctraf\ufb01c light\u201d and \u201cmotorcycles\u201d). The thoughts are sometimes abstractive, such as motion\n(\u201cstanding\u201d), relative position (\u201cnear\u201d, \u201cby\u201d, \u201cup\u201d), quantity (\u201cbunch\u201d, \u201cgroup\u201d), and scene (\u201ccity\u201d,\n\u201czoo\u201d). Also, the order of review is not restricted by the order in natural language.\n\n4.2 Source Code Captioning\n\n4.2.1 Data and Settings\n\nThe task of source code captioning is to predict the code comment given the source code, which can\nbe framed under the problem of sequence-to-sequence learning. We experiment with a benchmark\n\n7\n\n\fTable 3: Comparison of model variants on HabeasCorpus code captioning dataset. \u201cBidir\u201d indicates using\nbidirectional RNN encoders, \u201cLLH\u201d refers to log-likelihood, \u201cCS-k\u201d refers to top-k character savings.\n\nModel\nLanguage Model\nEncoder-Decoder\nEncoder-Decoder (Bidir)\nAttentive Encoder-Decoder (Bidir)\nReview Net\n\nLLH CS-1\n-5.34\n-5.25\n-5.19\n-5.14\n-5.06\n\n0.2340\n0.2535\n0.2632\n0.2716\n0.2889\n\nCS-2\n0.2763\n0.2976\n0.3068\n0.3152\n0.3361\n\nCS-3\n0.3000\n0.3201\n0.3290\n0.3364\n0.3579\n\nCS-4\n0.3153\n0.3367\n0.3442\n0.3523\n0.3731\n\nCS-5\n0.3290\n0.3507\n0.3570\n0.3651\n0.3840\n\ndataset for source code captioning, HabeasCorpus [11]. HabeasCorpus collects nine popular open-\nsource Java code repositories, such as Apache Ant and Lucene. The dataset contains 6, 734 Java\nsource code \ufb01les with 7, 903, 872 source code tokens and 251, 565 comment word tokens. We\nrandomly sample 10% of the \ufb01les as the test set, 10% as the development set, and use the rest for\ntraining. We use the development set for early stopping and hyperparameter tuning.\nOur evaluation follows previous works on source code language modeling [10] and captioning [11].\nWe report the log-likelihood of generating the actual code captions based on the learned models. We\nalso evaluate the approaches from the perspective of code comment completion, where we compute\nthe percentage of characters that can be saved by applying the models to predict the next token. More\nspeci\ufb01cally, we use a metric of top-k character savings [11] (CS-k). Let n be the minimum number\nof pre\ufb01x characters needed to be \ufb01ltered such that the actual word ranks among the top-k based on\nthe given model. Let L be the length of the actual word. The number of saved characters is then\nL \u2212 n. We compute the average percentage of saved characters per comment to obtain the metric\nCS-k.\nWe follow the tokenization used in [11], where we transform camel case identi\ufb01ers into multiple\nseparate words (e.g., \u201cbinaryClassi\ufb01erEnsemble\u201d to \u201cbinary classi\ufb01er ensemble\u201d), and remove all\nnon-alphabetic characters. We truncate code sequences and comment sequences longer than 300\ntokens. We use an RNN encoder and an attentive output reviewer with tied weights. We set the\nnumber of review steps Tr = 8, the dimension of word embeddings to be 50, and the dimension of\nthe LSTM hidden states to be 256.\n\n4.2.2 Results\n\nWe report the log-likelihood and top-k character savings of different model variants in Table 3. The\nbaseline model \u201cLanguage Model\u201d is an LSTM decoder whose output is not sensitive to the input code\nsequence. A preliminary experiment showed that the LSTM decoder signi\ufb01cantly outperforms the N-\ngram models used in [11] (+3% in CS-2), so we use the LSTM decoder as a baseline for comparison.\nWe also compare with different variants of encoder-decoders, including incorporating bidirectional\nRNN encoders and attention mechanism. It can be seen from Table 3 that both bidirectional encoders\nand attention mechanism can improve over vanilla encoder-decoders. The review network outperforms\nattentive encoder-decoders consistently in all the metrics, which indicates that the review network is\neffective at learning useful representation.\n\n5 Conclusion\n\nWe present a novel architecture, the review network, to improve the encoder-decoder learning\nframework. The review network performs multiple review steps with attention on the encoder hidden\nstates, and computes a set of thought vectors that summarize the global information in the input. We\nempirically show consistent improvement over conventional encoder-decoders on the tasks of image\ncaptioning and source code captioning. In the future, it will be interesting to apply our model to more\ntasks that can be modeled under the encoder-decoder framework, such as machine translation and\ntext summarization.\nAcknowledgements This work was funded by the NSF under grants CCF-1414030 and IIS-1250956,\nGoogle, Disney Research, the ONR grant N000141512791, and the ADeLAIDE grant FA8750-16C-\n0130-001.\n\n8\n\n\fReferences\n[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. In ICLR, 2015.\n\n[2] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv\npreprint arXiv:1504.00325, 2015.\n\n[3] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. In ACL, 2014.\n\n[4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. JMLR, 12:2121\u20132159, 2011.\n\n[5] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Doll\u00e1r, Jianfeng\nGao, Xiaodong He, Margaret Mitchell, John C Platt, et al. From captions to visual concepts and\nback. In CVPR, pages 1473\u20131482, 2015.\n\n[6] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint\n\narXiv:1603.08983, 2016.\n\n[7] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image\n\ndescriptions. In CVPR, pages 3128\u20133137, 2015.\n\n[8] Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter\nOndruska, Ishaan Gulrajani, and Richard Socher. Ask me anything: Dynamic memory networks\nfor natural language processing. In ICML, 2016.\n\n[9] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-\n\nbased neural machine translation. In ACL, 2015.\n\n[10] Chris J Maddison and Daniel Tarlow. Structured generative models of natural source code. In\n\nICML, 2014.\n\n[11] Dana Movshovitz-Attias and William W Cohen. Natural language models for predicting\n\nprogramming comments. In ACL, 2013.\n\n[12] Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive\n\nsentence summarization. In EMNLP, 2015.\n\n[13] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. In ICLR, 2015.\n\n[14] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In NIPS,\n\npages 2431\u20132439, 2015.\n\n[15] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In NIPS, pages 3104\u20133112, 2014.\n\n[16] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.\nRethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567,\n2015.\n\n[17] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for\n\nsets. In ICLR, 2016.\n\n[18] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\n\nimage caption generator. In CVPR, pages 3156\u20133164, 2015.\n\n[19] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In ICLR, 2015.\n[20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and\nYoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention.\nIn ICML, 2015.\n\n[21] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with\n\nsemantic attention. In CVPR, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1235, "authors": [{"given_name": "Zhilin", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "Ye", "family_name": "Yuan", "institution": "Carnegie Mellon University"}, {"given_name": "Yuexin", "family_name": "Wu", "institution": "Carnegie Mellon University"}, {"given_name": "William", "family_name": "Cohen", "institution": "Carnegie Mellon University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}]}