{"title": "Skip-Thought Vectors", "book": "Advances in Neural Information Processing Systems", "page_first": 3294, "page_last": 3302, "abstract": "We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.", "full_text": "Skip-Thought Vectors\n\nRyan Kiros 1, Yukun Zhu 1, Ruslan Salakhutdinov 1,2, Richard S. Zemel 1,2\n\nAntonio Torralba 3, Raquel Urtasun 1, Sanja Fidler 1\n\nUniversity of Toronto 1\n\nCanadian Institute for Advanced Research 2\n\nMassachusetts Institute of Technology 3\n\nAbstract\n\nWe describe an approach for unsupervised learning of a generic, distributed sen-\ntence encoder. Using the continuity of text from books, we train an encoder-\ndecoder model that tries to reconstruct the surrounding sentences of an encoded\npassage. Sentences that share semantic and syntactic properties are thus mapped\nto similar vector representations. We next introduce a simple vocabulary expan-\nsion method to encode words that were not seen as part of training, allowing us\nto expand our vocabulary to a million words. After training our model, we ex-\ntract and evaluate our vectors with linear models on 8 tasks: semantic relatedness,\nparaphrase detection, image-sentence ranking, question-type classi\ufb01cation and 4\nbenchmark sentiment and subjectivity datasets. The end result is an off-the-shelf\nencoder that can produce highly generic sentence representations that are robust\nand perform well in practice.\n\n1\n\nIntroduction\n\nDeveloping learning algorithms for distributed compositional semantics of words has been a long-\nstanding open problem at the intersection of language understanding and machine learning. In recent\nyears, several approaches have been developed for learning composition operators that map word\nvectors to sentence vectors including recursive networks [1], recurrent networks [2], convolutional\nnetworks [3, 4] and recursive-convolutional methods [5, 6] among others. All of these methods\nproduce sentence representations that are passed to a supervised task and depend on a class label in\norder to backpropagate through the composition weights. Consequently, these methods learn high-\nquality sentence representations but are tuned only for their respective task. The paragraph vector\nof [7] is an alternative to the above models in that it can learn unsupervised sentence representations\nby introducing a distributed sentence indicator as part of a neural language model. The downside is\nat test time, inference needs to be performed to compute a new vector.\nIn this paper we abstract away from the composition methods themselves and consider an alterna-\ntive loss function that can be applied with any composition operator. We consider the following\nquestion: is there a task and a corresponding loss that will allow us to learn highly generic sentence\nrepresentations? We give evidence for this by proposing a model for learning high-quality sentence\nvectors without a particular supervised task in mind. Using word vector learning as inspiration, we\npropose an objective function that abstracts the skip-gram model of [8] to the sentence level. That\nis, instead of using a word to predict its surrounding context, we instead encode a sentence to predict\nthe sentences around it. Thus, any composition operator can be substituted as a sentence encoder\nand only the objective function becomes modi\ufb01ed. Figure 1 illustrates the model. We call our model\nskip-thoughts and vectors induced by our model are called skip-thought vectors.\nOur model depends on having a training corpus of contiguous text. We chose to use a large collection\nof novels, namely the BookCorpus dataset [9] for training our models. These are free books written\nby yet unpublished authors. The dataset has books in 16 different genres, e.g., Romance (2,865\nbooks), Fantasy (1,479), Science \ufb01ction (786), Teen (430), etc. Table 1 highlights the summary\nstatistics of the book corpus. Along with narratives, books contain dialogue, emotion and a wide\nrange of interaction between characters. Furthermore, with a large enough collection the training\nset is not biased towards any particular domain or application. Table 2 shows nearest neighbours\n\n1\n\n\fFigure 1: The skip-thoughts model. Given a tuple (si\u22121, si, si+1) of contiguous sentences, with si\nthe i-th sentence of a book, the sentence si is encoded and tries to reconstruct the previous sentence\nsi\u22121 and next sentence si+1. In this example, the input is the sentence triplet I got back home. I\ncould see the cat on the steps. This was strange. Unattached arrows are connected to the encoder\noutput. Colors indicate which components share parameters. (cid:104)eos(cid:105) is the end of sentence token.\n# of unique words mean # of words per sentence\n\n# of books\n\n1,316,420\n\n13\n\n11,038\n\n# of sentences\n74,004,228\n\n# of words\n984,846,357\n\nTable 1: Summary statistics of the BookCorpus dataset [9]. We use this corpus to training our\nmodel.\n\nof sentences from a model trained on the BookCorpus dataset. These results show that skip-thought\nvectors learn to accurately capture semantics and syntax of the sentences they encode.\nWe evaluate our vectors in a newly proposed setting: after learning skip-thoughts, freeze the model\nand use the encoder as a generic feature extractor for arbitrary tasks. In our experiments we con-\nsider 8 tasks: semantic-relatedness, paraphrase detection, image-sentence ranking and 5 standard\nclassi\ufb01cation benchmarks. In these experiments, we extract skip-thought vectors and train linear\nmodels to evaluate the representations directly, without any additional \ufb01ne-tuning. As it turns out,\nskip-thoughts yield generic representations that perform robustly across all tasks considered.\nOne dif\ufb01culty that arises with such an experimental setup is being able to construct a large enough\nword vocabulary to encode arbitrary sentences. For example, a sentence from a Wikipedia article\nmight contain nouns that are highly unlikely to appear in our book vocabulary. We solve this problem\nby learning a mapping that transfers word representations from one model to another. Using pre-\ntrained word2vec representations learned with a continuous bag-of-words model [8], we learn a\nlinear mapping from a word in word2vec space to a word in the encoder\u2019s vocabulary space. The\nmapping is learned using all words that are shared between vocabularies. After training, any word\nthat appears in word2vec can then get a vector in the encoder word embedding space.\n\n2 Approach\n\n2.1\n\nInducing skip-thought vectors\n\nWe treat skip-thoughts in the framework of encoder-decoder models 1. That is, an encoder maps\nwords to a sentence vector and a decoder is used to generate the surrounding sentences. Encoder-\ndecoder models have gained a lot of traction for neural machine translation.\nIn this setting, an\nencoder is used to map e.g. an English sentence into a vector. The decoder then conditions on this\nvector to generate a translation for the source English sentence. Several choices of encoder-decoder\npairs have been explored, including ConvNet-RNN [10], RNN-RNN [11] and LSTM-LSTM [12].\nThe source sentence representation can also dynamically change through the use of an attention\nmechanism [13] to take into account only the relevant words for translation at any given time. In our\nmodel, we use an RNN encoder with GRU [14] activations and an RNN decoder with a conditional\nGRU. This model combination is nearly identical to the RNN encoder-decoder of [11] used in neural\nmachine translation. GRU has been shown to perform as well as LSTM [2] on sequence modelling\ntasks [14] while being conceptually simpler. GRU units have only 2 gates and do not require the use\nof a cell. While we use RNNs for our model, any encoder and decoder can be used so long as we\ncan backpropagate through it.\nAssume we are given a sentence tuple (si\u22121, si, si+1). Let wt\nand let xt\nand objective function.\nEncoder. Let w1\nsentence. At each time step, the encoder produces a hidden state ht\nrepresentation of the sequence w1\n\ni be the words in sentence si where N is the number of words in the\ni which can be interpreted as the\nthus represents the full sentence.\n\ni denote the t-th word for sentence si\ni denote its word embedding. We describe the model in three parts: the encoder, decoder\n\ni , . . . , wN\n\ni , . . . , wt\n\ni. The hidden state hN\ni\n\n1A preliminary version of our model was developed in the context of a computer vision application [9].\n\n2\n\n\fQuery and nearest sentence\nhe ran his hand inside his coat , double-checking that the unopened letter was still there .\nhe slipped his hand between his coat and his shirt , where the folded copies lay in a brown envelope .\nim sure youll have a glamorous evening , she said , giving an exaggerated wink .\nim really glad you came to the party tonight , he said , turning to her .\nalthough she could tell he had n\u2019t been too invested in any of their other chitchat , he seemed genuinely curious about this .\nalthough he had n\u2019t been following her career with a microscope , he \u2019d de\ufb01nitely taken notice of her appearances .\nan annoying buzz started to ring in my ears , becoming louder and louder as my vision began to swim .\na weighty pressure landed on my lungs and my vision blurred at the edges , threatening my consciousness altogether .\nif he had a weapon , he could maybe take out their last imp , and then beat up errol and vanessa .\nif he could ram them from behind , send them sailing over the far side of the levee , he had a chance of stopping them .\nthen , with a stroke of luck , they saw the pair head together towards the portaloos .\nthen , from out back of the house , they heard a horse scream probably in answer to a pair of sharp spurs digging deep into its \ufb02anks .\n\u201c i \u2019ll take care of it , \u201d goodman said , taking the phonebook .\n\u201c i \u2019ll do that , \u201d julia said , coming in .\nhe \ufb01nished rolling up scrolls and , placing them to one side , began the more urgent task of \ufb01nding ale and tankards .\nhe righted the table , set the candle on a piece of broken plate , and reached for his \ufb02int , steel , and tinder .\n\nTable 2: In each example, the \ufb01rst sentence is a query while the second sentence is its nearest\nneighbour. Nearest neighbours were scored by cosine similarity from a random sample of 500,000\nsentences from our corpus.\n\nTo encode a sentence, we iterate the following sequence of equations (dropping the subscript i):\n\nrt = \u03c3(Wrxt + Urht\u22121)\nzt = \u03c3(Wzxt + Uzht\u22121)\n\u00afht = tanh(Wxt + U(rt (cid:12) ht\u22121))\nht = (1 \u2212 zt) (cid:12) ht\u22121 + zt (cid:12) \u00afht\n\n(1)\n(2)\n(3)\n(4)\nwhere \u00afht is the proposed state update at time t, zt is the update gate, rt is the reset gate ((cid:12)) denotes\na component-wise product. Both update gates takes values between zero and one.\nDecoder. The decoder is a neural language model which conditions on the encoder output hi. The\ncomputation is similar to that of the encoder except we introduce matrices Cz, Cr and C that are\nused to bias the update gate, reset gate and hidden state computation by the sentence vector. One\ndecoder is used for the next sentence si+1 while a second decoder is used for the previous sentence\nsi\u22121. Separate parameters are used for each decoder with the exception of the vocabulary matrix V,\nwhich is the weight matrix connecting the decoder\u2019s hidden state for computing a distribution over\nwords. In what follows we describe the decoder for the next sentence si+1 although an analogous\ncomputation is used for the previous sentence si\u22121. Let ht\ni+1 denote the hidden state of the decoder\nat time t. Decoding involves iterating through the following sequence of equations (dropping the\nsubscript i + 1):\n\nr xt\u22121 + Ud\nz xt\u22121 + Ud\n\nrht\u22121 + Crhi)\nzht\u22121 + Czhi)\n\nrt = \u03c3(Wd\nzt = \u03c3(Wd\n\u00afht = tanh(Wdxt\u22121 + Ud(rt (cid:12) ht\u22121) + Chi)\ni+1 = (1 \u2212 zt) (cid:12) ht\u22121 + zt (cid:12) \u00afht\nht\n\n(5)\n(6)\n(7)\n(8)\ni+1 given the previous t \u2212 1 words and the encoder vector is\n(9)\n\ni+1, hi) \u221d exp(vwt\n\ni+1|w<t\n\nht\n\ni+1)\n\nP (wt\n\ni+1\n\nGiven ht\n\ni+1, the probability of word wt\n\ni+1\n\ndenotes the row of V corresponding to the word of wt\n\nwhere vwt\nis performed for the previous sentence si\u22121.\nObjective. Given a tuple (si\u22121, si, si+1), the objective optimized is the sum of the log-probabilities\nfor the forward and backward sentences conditioned on the encoder representation:\n\ni+1. An analogous computation\n\nlogP (wt\n\ni+1|w<t\n\ni+1, hi) +\n\nlogP (wt\n\ni\u22121|w<t\n\ni\u22121, hi)\n\n(10)\n\n(cid:88)\n\n(cid:88)\n\nThe total objective is the above summed over all such training tuples.\n\nt\n\nt\n\n3\n\n\f2.2 Vocabulary expansion\n\nWe now describe how to expand our encoder\u2019s vocabulary to words it has not seen during training.\nSuppose we have a model that was trained to induce word representations, such as word2vec. Let\nVw2v denote the word embedding space of these word representations and let Vrnn denote the RNN\nword embedding space. We assume the vocabulary of Vw2v is much larger than that of Vrnn. Our\ngoal is to construct a mapping f : Vw2v \u2192 Vrnn parameterized by a matrix W such that v(cid:48) = Wv\nfor v \u2208 Vw2v and v(cid:48) \u2208 Vrnn. Inspired by [15], which learned linear mappings between translation\nword spaces, we solve an un-regularized L2 linear regression loss for the matrix W. Thus, any word\nfrom Vw2v can now be mapped into Vrnn for encoding sentences.\n\n3 Experiments\n\nIn our experiments, we evaluate the capability of our encoder as a generic feature extractor after\ntraining on the BookCorpus dataset. Our experimentation setup on each task is as follows:\n\u2022 Using the learned encoder as a feature extractor, extract skip-thought vectors for all sentences.\n\u2022 If the task involves computing scores between pairs of sentences, compute component-wise fea-\n\u2022 Train a linear classi\ufb01er on top of the extracted features, with no additional \ufb01ne-tuning or back-\n\ntures between pairs. This is described in more detail speci\ufb01cally for each experiment.\n\npropagation through the skip-thoughts model.\n\nWe restrict ourselves to linear classi\ufb01ers for two reasons. The \ufb01rst is to directly evaluate the rep-\nresentation quality of the computed vectors. It is possible that additional performance gains can be\nmade throughout our experiments with non-linear models but this falls out of scope of our goal. Fur-\nthermore, it allows us to better analyze the strengths and weaknesses of the learned representations.\nThe second reason is that reproducibility now becomes very straightforward.\n\n3.1 Details of training\n\nTo induce skip-thought vectors, we train two separate models on our book corpus. One is a unidi-\nrectional encoder with 2400 dimensions, which we subsequently refer to as uni-skip. The other is\na bidirectional model with forward and backward encoders of 1200 dimensions each. This model\ncontains two encoders with different parameters: one encoder is given the sentence in correct or-\nder, while the other is given the sentence in reverse. The outputs are then concatenated to form a\n2400 dimensional vector. We refer to this model as bi-skip. For training, we initialize all recurrent\nmatricies with orthogonal initialization [16]. Non-recurrent weights are initialized from a uniform\ndistribution in [-0.1,0.1]. Mini-batches of size 128 are used and gradients are clipped if the norm of\nthe parameter vector exceeds 10. We used the Adam algorithm [17] for optimization. Both mod-\nels were trained for roughly two weeks. As an additional experiment, we also report experimental\nresults using a combined model, consisting of the concatenation of the vectors from uni-skip and\nbi-skip, resulting in a 4800 dimensional vector. We refer to this model throughout as combine-skip.\nAfter our models are trained, we then employ vocabulary expansion to map word embeddings into\nthe RNN encoder space. The publically available CBOW word vectors are used for this purpose\n2. The skip-thought models are trained with a vocabulary size of 20,000 words. After removing\nmultiple word examples from the CBOW model, this results in a vocabulary size of 930,911 words.\nThus even though our skip-thoughts model was trained with only 20,000 words, after vocabulary\nexpansion we can now successfully encode 930,911 possible words.\nSince our goal is to evaluate skip-thoughts as a general feature extractor, we keep text pre-processing\nto a minimum. When encoding new sentences, no additional preprocessing is done other than basic\ntokenization. This is done to test the robustness of our vectors. As an additional baseline, we also\nconsider the mean of the word vectors learned from the uni-skip model. We refer to this baseline as\nbow. This is to determine the effectiveness of a standard baseline trained on the BookCorpus.\n\n3.2 Semantic relatedness\n\nOur \ufb01rst experiment is on the SemEval 2014 Task 1: semantic relatedness SICK dataset [30]. Given\ntwo sentences, our goal is to produce a score of how semantically related these sentences are, based\non human generated scores. Each score is the average of 10 different human annotators. Scores\ntake values between 1 and 5. A score of 1 indicates that the sentence pair is not at all related, while\n\n2http://code.google.com/p/word2vec/\n\n4\n\n\fMethod\nIllinois-LH [18]\nUNAL-NLP [19]\nMeaning Factory [20]\nECNU [21]\nMean vectors [22]\nDT-RNN [23]\nSDT-RNN [23]\nLSTM [22]\nBidirectional LSTM [22]\nDependency Tree-LSTM [22]\nbow\nuni-skip\nbi-skip\ncombine-skip\ncombine-skip+COCO\n\nr\n\n0.7993\n0.8070\n0.8268\n0.8414\n0.7577\n0.7923\n0.7900\n0.8528\n0.8567\n0.8676\n0.7823\n0.8477\n0.8405\n0.8584\n0.8655\n\n\u03c1\n\n0.7538\n0.7489\n0.7721\n\n\u2013\n\n0.6738\n0.7319\n0.7304\n0.7911\n0.7966\n0.8083\n0.7235\n0.7780\n0.7696\n0.7916\n0.7995\n\nMSE\n0.3692\n0.3550\n0.3224\n\n\u2013\n\n0.4557\n0.3822\n0.3848\n0.2831\n0.2736\n0.2532\n0.3975\n0.2872\n0.2995\n0.2687\n0.2561\n\nMethod\nfeats [24]\nRAE+DP [24]\nRAE+feats [24]\nRAE+DP+feats [24]\nFHS [25]\nPE [26]\nWDDP [27]\nMTMETRICS [28]\nTF-KLD [29]\nbow\nuni-skip\nbi-skip\ncombine-skip\ncombine-skip + feats\n\nAcc\n73.2\n72.6\n74.2\n76.8\n75.0\n76.1\n75.6\n77.4\n80.4\n67.8\n73.0\n71.2\n73.0\n75.8\n\nF1\n\n83.6\n82.7\n82.7\n83.0\n84.1\n86.0\n80.3\n81.9\n81.2\n82.0\n83.0\n\nTable 3: Left: Test set results on the SICK semantic relatedness subtask. The evaluation metrics\nare Pearson\u2019s r, Spearman\u2019s \u03c1, and mean squared error. The \ufb01rst group of results are SemEval 2014\nsubmissions, while the second group are results reported by [22]. Right: Test set results on the\nMicrosoft Paraphrase Corpus. The evaluation metrics are classi\ufb01cation accuracy and F1 score. Top:\nrecursive autoencoder variants. Middle: the best published results on this dataset.\n\na score of 5 indicates they are highly related. The dataset comes with a prede\ufb01ned split of 4500\ntraining pairs, 500 development pairs and 4927 testing pairs. All sentences are derived from existing\nimage and video annotation datasets. The evaluation metrics are Pearson\u2019s r, Spearman\u2019s \u03c1, and\nmean squared error.\nGiven the dif\ufb01culty of this task, many existing systems employ a large amount of feature engineering\nand additional resources. Thus, we test how well our learned representations fair against heavily en-\ngineered pipelines. Recently, [22] showed that learning representations with LSTM or Tree-LSTM\nfor the task at hand is able to outperform these existing systems. We take this one step further\nand see how well our vectors learned from a completely different task are able to capture semantic\nrelatedness when only a linear model is used on top to predict scores.\nTo represent a sentence pair, we use two features. Given two skip-thought vectors u and v, we\ncompute their component-wise product u \u00b7 v and their absolute difference |u \u2212 v| and concatenate\nthem together. These two features were also used by [22]. To predict a score, we use the same\nsetup as [22]. Let r(cid:62) = [1, . . . , 5] be an integer vector from 1 to 5. We compute a distribution p\nas a function of prediction scores y given by pi = y \u2212 (cid:98)y(cid:99) if i = (cid:98)y(cid:99) + 1, pi = (cid:98)y(cid:99) \u2212 y + 1 if\ni = (cid:98)y(cid:99) and 0 otherwise. These then become our targets for a logistic regression classi\ufb01er. At test\ntime, given new sentence pairs we \ufb01rst compute targets \u02c6p and then compute the related score as r(cid:62) \u02c6p.\nAs an additional comparison, we also explored appending features derived from an image-sentence\nembedding model trained on COCO (see section 3.4). Given vectors u and v, we obtain vectors u(cid:48)\nand v(cid:48) from the learned linear embedding model and compute features u(cid:48) \u00b7 v(cid:48) and |u(cid:48) \u2212 v(cid:48)|. These\nare then concatenated to the existing features.\nTable 3 (left) presents our results. First, we observe that our models are able to outperform all\nprevious systems from the SemEval 2014 competition. It highlights that skip-thought vectors learn\nrepresentations that are well suited for semantic relatedness. Our results are comparable to LSTMs\nwhose representations are trained from scratch on this task. Only the dependency tree-LSTM of [22]\nperforms better than our results. We note that the dependency tree-LSTM relies on parsers whose\ntraining data is very expensive to collect and does not exist for all languages. We also observe\nusing features learned from an image-sentence embedding model on COCO gives an additional\nperformance boost, resulting in a model that performs on par with the dependency tree-LSTM. To\nget a feel for the model outputs, Table 4 shows example cases of test set pairs. Our model is able to\naccurately predict relatedness on many challenging cases. On some examples, it fails to pick up on\nsmall distinctions that drastically change a sentence meaning, such as tricks on a motorcycle versus\ntricking a person on a motorcycle.\n\n3.3 Paraphrase detection\n\nThe next task we consider is paraphrase detection on the Microsoft Research Paraphrase Cor-\npus [31]. On this task, two sentences are given and one must predict whether or not they are\n\n5\n\n\fSentence 1\nA little girl is looking at a woman in costume\nA little girl is looking at a woman in costume\nA little girl is looking at a woman in costume\nA sea turtle is hunting for \ufb01sh\nA sea turtle is not hunting for \ufb01sh\nA man is driving a car\nThere is no man driving the car\nA large duck is \ufb02ying over a rocky stream\nA large duck is \ufb02ying over a rocky stream\nA person is performing acrobatics on a motorcycle\nA person is performing tricks on a motorcycle\nSomeone is pouring ingredients into a pot\nNobody is pouring ingredients into a pot\nSomeone is pouring ingredients into a pot\n\nSentence 2\nA young girl is looking at a woman in costume\nThe little girl is looking at a man in costume\nA little girl in costume looks like a woman\nA sea turtle is hunting for food\nA sea turtle is hunting for \ufb01sh\nThe car is being driven by a man\nA man is driving a car\nA duck, which is large, is \ufb02ying over a rocky stream\nA large stream is full of rocks, ducks and \ufb02ies\nA person is performing tricks on a motorcycle\nThe performer is tricking a person on a motorcycle\nSomeone is adding ingredients to a pot\nSomeone is pouring ingredients into a pot\nA man is removing vegetables from a pot\n\nGT\n4.7\n3.8\n2.9\n4.5\n3.4\n5\n3.6\n4.8\n2.7\n4.3\n2.6\n4.4\n3.5\n2.4\n\npred\n4.5\n4.0\n3.5\n4.5\n3.8\n4.9\n3.5\n4.9\n3.1\n4.4\n4.4\n4.0\n4.2\n3.6\n\nTable 4: Example predictions from the SICK test set. GT is the ground truth relatedness, scored\nbetween 1 and 5. The last few results show examples where slight changes in sentence structure\nresult in large changes in relatedness which our model was unable to score correctly.\n\nImage Search\n\nModel\nRandom Ranking\nDVSA [32]\nGMM+HGLMM [33]\nm-RNN [34]\nbow\nuni-skip\nbi-skip\ncombine-skip\n\nR@1 R@5 R@10 Med r R@1 R@5 R@10 Med r\n500\n0.1\n3\n38.4\n39.4\n4\n3\n41.0\n4\n33.6\n4\n30.6\n4\n32.7\n33.8\n4\n\n1.0\n74.8\n76.6\n77.0\n73.5\n71.7\n73.2\n74.6\n\n0.5\n60.2\n59.8\n42.2\n57.1\n56.4\n57.1\n60.0\n\n0.1\n27.4\n25.1\n29.0\n24.4\n22.7\n24.2\n25.9\n\nCOCO Retrieval\n\nImage Annotation\n\n0.6\n69.6\n67.9\n73.0\n65.8\n64.5\n67.3\n67.7\n\n1.1\n80.5\n80.9\n83.5\n79.7\n79.8\n79.6\n82.1\n\n631\n1\n2\n2\n3\n3\n3\n3\n\nTable 5: COCO test-set results for image-sentence retrieval experiments. R@K is Recall@K (high\nis good). Med r is the median rank (low is good).\n\nparaphrases. The training set consists of 4076 sentence pairs (2753 which are positive) and the\ntest set has 1725 pairs (1147 are positive). We compute a vector representing the pair of sentences\nin the same way as on the SICK dataset, using the component-wise product u \u00b7 v and their absolute\ndifference |u \u2212 v| which are then concatenated together. We then train logistic regression on top to\npredict whether the sentences are paraphrases. Cross-validation is used for tuning the L2 penalty.\nAs in the semantic relatedness task, paraphrase detection has largely been dominated by extensive\nfeature engineering, or a combination of feature engineering with semantic spaces. We report exper-\niments in two settings: one using the features as above and the other incorporating basic statistics\nbetween sentence pairs, the same features used by [24]. These are referred to as feats in our results.\nWe isolate the results and baselines used in [24] as well as the top published results on this task.\nTable 3 (right) presents our results, from which we can observe the following: (1) skip-thoughts\nalone outperform recursive nets with dynamic pooling when no hand-crafted features are used, (2)\nwhen other features are used, recursive nets with dynamic pooling works better, and (3) when skip-\nthoughts are combined with basic pairwise statistics, it becomes competitive with the state-of-the-art\nwhich incorporate much more complicated features and hand-engineering. This is a promising result\nas many of the sentence pairs have very \ufb01ne-grained details that signal if they are paraphrases.\n\n3.4\n\nImage-sentence ranking\n\nWe next consider the task of retrieving images and their sentence descriptions. For this experiment,\nwe use the Microsoft COCO dataset [35] which is the largest publicly available dataset of images\nwith high-quality sentence descriptions. Each image is annotated with 5 captions, each from dif-\nferent annotators. Following previous work, we consider two tasks: image annotation and image\nsearch. For image annotation, an image is presented and sentences are ranked based on how well\nthey describe the query image. The image search task is the reverse: given a caption, we retrieve\nimages that are a good \ufb01t to the query. The training set comes with over 80,000 images each with 5\ncaptions. For development and testing we use the same splits as [32]. The development and test sets\neach contain 1000 images and 5000 captions. Evaluation is performed using Recall@K, namely the\nmean number of images for which the correct caption is ranked within the top-K retrieved results\n\n6\n\n\f(and vice-versa for sentences). We also report the median rank of the closest ground truth result\nfrom the ranked list.\nThe best performing results on image-sentence ranking have all used RNNs for encoding sentences,\nwhere the sentence representation is learned jointly. Recently, [33] showed that by using Fisher\nvectors for representing sentences, linear CCA can be applied to obtain performance that is as strong\nas using RNNs for this task. Thus the method of [33] is a strong baseline to compare our sentence\nrepresentations with. For our experiments, we represent images using 4096-dimensional OxfordNet\nfeatures from their 19-layer model [36]. For sentences, we simply extract skip-thought vectors for\neach caption. The training objective we use is a pairwise ranking loss that has been previously\nused by many other methods. The only difference is the scores are computed using only linear\ntransformations of image and sentence inputs. The loss is given by:\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nmax{0, \u03b1 \u2212 s(Ux, Vy) + s(Ux, Vyk)} +\n\nmax{0, \u03b1 \u2212 s(Vy, Ux) + s(Vy, Uxk)},\n\nk\n\nk\n\ny\n\nx\nwhere x is an image vector, y is the skip-thought vector for the groundtruth sentence, yk are vectors\nfor constrastive (incorrect) sentences and s(\u00b7,\u00b7) is the image-sentence score. Cosine similarity is\nused for scoring. The model parameters are {U, V} where U is the image embedding matrix and\nV is the sentence embedding matrix. In our experiments, we use a 1000 dimensional embedding,\nmargin \u03b1 = 0.2 and k = 50 contrastive terms. We trained for 15 epochs and saved our model\nanytime the performance improved on the development set.\nTable 5 illustrates our results on this task. Using skip-thought vectors for sentences, we get perfor-\nmance that is on par with both [32] and [33] except for R@1 on image annotation, where other meth-\nods perform much better. Our results indicate that skip-thought vectors are representative enough\nto capture image descriptions without having to learn their representations from scratch. Combined\nwith the results of [33], it also highlights that simple, scalable embedding techniques perform very\nwell provided that high-quality image and sentence vectors are available.\n\n3.5 Classi\ufb01cation benchmarks\n\nFor our \ufb01nal quantitative experiments, we report results on several classi\ufb01cation benchmarks which\nare commonly used for evaluating sentence representation learning methods.\nWe use 5 datasets: movie review sentiment (MR), customer product reviews (CR), subjectiv-\nity/objectivity classi\ufb01cation (SUBJ), opinion polarity (MPQA) and question-type classi\ufb01cation\n(TREC). On all datasets, we simply extract skip-thought vectors and train a logistic regression clas-\nsi\ufb01er on top. 10-fold cross-validation is used for evaluation on the \ufb01rst 4 datasets, while TREC has\na pre-de\ufb01ned train/test split. We tune the L2 penality using cross-validation (and thus use a nested\ncross-validation for the \ufb01rst 4 datasets).\n\nMethod\nNB-SVM [37]\nMNB [37]\ncBoW [6]\nGrConv [6]\nRNN [6]\nBRNN [6]\nCNN [4]\nAdaSent [6]\nParagraph-vector [7]\nbow\nuni-skip\nbi-skip\ncombine-skip\ncombine-skip + NB\n\nMR\n79.4\n79.0\n77.2\n76.3\n77.2\n82.3\n81.5\n83.1\n74.8\n75.0\n75.5\n73.9\n76.5\n80.4\n\nCR\n81.8\n80.0\n79.9\n81.3\n82.3\n82.6\n85.0\n86.3\n78.1\n80.4\n79.3\n77.9\n80.1\n81.3\n\nSUBJ MPQA\n93.2\n93.6\n91.3\n89.5\n93.7\n94.2\n93.4\n95.5\n90.5\n91.2\n92.1\n92.5\n93.6\n93.6\n\n86.3\n86.3\n86.4\n84.5\n90.1\n90.3\n89.6\n93.3\n74.2\n87.0\n86.9\n83.3\n87.1\n87.5\n\nTREC\n\n87.3\n88.4\n90.2\n91.0\n93.6\n92.4\n91.8\n84.8\n91.4\n89.4\n92.2\n\nTable 6: Classi\ufb01cation accuracies on several standard bench-\nmarks. Results are grouped as follows: (a): bag-of-words mod-\nels; (b): supervised compositional models; (c) Paragraph Vector\n(unsupervised learning of sentence representations); (d) ours.\nBest results overall are bold while best results outside of group\n(b) are underlined.\n\nOn these tasks, properly tuned bag-of-\nwords models have been shown to per-\nform exceptionally well.\nIn particular,\nthe NB-SVM of [37] is a fast and ro-\nbust performer on these tasks. Skip-\nthought vectors potentially give an al-\nternative to these baselines being just as\nfast and easy to use. For an additional\ncomparison, we also see to what ef-\nfect augmenting skip-thoughts with bi-\ngram Naive Bayes (NB) features im-\nproves performance 3.\nTable 6 presents our results. On most\ntasks, skip-thoughts performs about as\nwell as the bag-of-words baselines but\nfails to improve over methods whose\nsentence representations are learned di-\nrectly for the task at hand. This indicates\nthat for tasks like sentiment classi\ufb01ca-\ntion, tuning the representations, even on\nsmall datasets, are likely to perform bet-\nter than learning a generic unsupervised\n\n3We use the code available at https://github.com/mesnilgr/nbsvm\n\n7\n\n\f(a) TREC\n\n(b) SUBJ\n\n(c) SICK\n\nFigure 2: t-SNE embeddings of skip-thought vectors on different datasets. Points are colored based\non their labels (question type for TREC, subjectivity/objectivity for SUBJ). On the SICK dataset,\neach point represents a sentence pair and points are colored on a gradient based on their relatedness\nlabels. Results best seen in electronic form.\n\nsentence vector on much bigger datasets. Finally, we observe that the skip-thoughts-NB combina-\ntion is effective, particularly on MR. This results in a very strong new baseline for text classi\ufb01cation:\ncombine skip-thoughts with bag-of-words and train a linear model.\n\n3.6 Visualizing skip-thoughts\n\nAs a \ufb01nal experiment, we applied t-SNE [38] to skip-thought vectors extracted from TREC, SUBJ\nand SICK datasets and the visualizations are shown in Figure 2. For the SICK visualization, each\npoint represents a sentence pair, computed using the concatenation of component-wise and absolute\ndifference of features. Even without the use of relatedness labels, skip-thought vectors learn to\naccurately capture this property.\n\n4 Conclusion\n\nWe evaluated the effectiveness of skip-thought vectors as an off-the-shelf sentence representation\nwith linear classi\ufb01ers across 8 tasks. Many of the methods we compare against were only evaluated\non 1 task. The fact that skip-thought vectors perform well on all tasks considered highlight the\nrobustness of our representations.\nWe believe our model for learning skip-thought vectors only scratches the surface of possible objec-\ntives. Many variations have yet to be explored, including (a) deep encoders and decoders, (b) larger\ncontext windows, (c) encoding and decoding paragraphs, (d) other encoders, such as convnets. It is\nlikely the case that more exploration of this space will result in even higher quality representations.\n\nAcknowledgments\nWe thank Geoffrey Hinton for suggesting the name skip-thoughts. We also thank Felix Hill, Kelvin\nXu, Kyunghyun Cho and Ilya Sutskever for valuable comments and discussion. This work was\nsupported by NSERC, Samsung, CIFAR, Google and ONR Grant N00014-14-1-0232.\n\nReferences\n[1] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and\nChristopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In\nEMNLP, 2013.\n\n[2] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[3] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling\n\nsentences. ACL, 2014.\n\n[4] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. EMNLP, 2014.\n[5] Kyunghyun Cho, Bart van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of\n\nneural machine translation: Encoder-decoder approaches. SSST-8, 2014.\n\n[6] Han Zhao, Zhengdong Lu, and Pascal Poupart. Self-adaptive hierarchical sentence model. IJCAI, 2015.\n[7] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. ICML, 2014.\n[8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations\n\nin vector space. ICLR, 2013.\n\n[9] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and\nSanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and\nreading books. In ICCV, 2015.\n\n8\n\n\f[10] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, pages 1700\u2013\n\n1709, 2013.\n\n[11] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua\nBengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation.\nEMNLP, 2014.\n\n[12] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In\n\nNIPS, 2014.\n\n[13] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. ICLR, 2015.\n\n[14] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated\n\nrecurrent neural networks on sequence modeling. NIPS Deep Learning Workshop, 2014.\n\n[15] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine\n\ntranslation. arXiv preprint arXiv:1309.4168, 2013.\n\n[16] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks. ICLR, 2014.\n\n[17] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n[18] Alice Lai and Julia Hockenmaier. Illinois-lh: A denotational and distributional approach to semantics.\n\nSemEval 2014, 2014.\n\n[19] Sergio Jimenez, George Duenas, Julia Baquero, Alexander Gelbukh, Av Juan Dios B\u00e1tiz, and Av Men-\ndiz\u00e1bal. Unal-nlp: Combining soft cardinality features for semantic textual similarity, relatedness and\nentailment. SemEval 2014, 2014.\n\n[20] Johannes Bjerva, Johan Bos, Rob van der Goot, and Malvina Nissim. The meaning factory: Formal\nsemantics for recognizing textual entailment and determining semantic similarity. SemEval 2014, page\n642, 2014.\n\n[21] Jiang Zhao, Tian Tian Zhu, and Man Lan. Ecnu: One stone two birds: Ensemble of heterogenous mea-\n\nsures for semantic relatedness and textual entailment. SemEval 2014, 2014.\n\n[22] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from\n\ntree-structured long short-term memory networks. ACL, 2015.\n\n[23] Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. Grounded\n\ncompositional semantics for \ufb01nding and describing images with sentences. TACL, 2014.\n\n[24] Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y Ng. Dynamic\n\npooling and unfolding recursive autoencoders for paraphrase detection. In NIPS, 2011.\n\n[25] Andrew Finch, Young-Sook Hwang, and Eiichiro Sumita. Using machine translation evaluation tech-\n\nniques to determine sentence-level semantic equivalence. In IWP, 2005.\n\n[26] Dipanjan Das and Noah A Smith. Paraphrase identi\ufb01cation as probabilistic quasi-synchronous recogni-\n\ntion. In ACL, 2009.\n\n[27] Stephen Wan, Mark Dras, Robert Dale, and C\u00e9cile Paris. Using dependency-based features to take the\n\"para-farce\" out of paraphrase. In Proceedings of the Australasian Language Technology Workshop, 2006.\n[28] Nitin Madnani, Joel Tetreault, and Martin Chodorow. Re-examining machine translation metrics for\n\nparaphrase identi\ufb01cation. In NAACL, 2012.\n\n[29] Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to distributional sentence similarity. In\n\nEMNLP, pages 891\u2013896, 2013.\n\n[30] Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zam-\nparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sen-\ntences through semantic relatedness and textual entailment. SemEval-2014, 2014.\n\n[31] Bill Dolan, Chris Quirk, and Chris Brockett. Unsupervised construction of large paraphrase corpora:\nExploiting massively parallel news sources. In Proceedings of the 20th international conference on Com-\nputational Linguistics, 2004.\n\n[32] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions.\n\nCVPR, 2015.\n\nIn\n\n[33] Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. Associating neural word embeddings with deep\n\nimage representations using \ufb01sher vectors. In CVPR, 2015.\n\n[34] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille. Deep captioning with multimodal recurrent\n\nneural networks (m-rnn). ICLR, 2015.\n\n[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740\u2013755. 2014.\n[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recog-\n\nnition. ICLR, 2015.\n\n[37] Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sentiment and topic clas-\n\nsi\ufb01cation. In ACL, 2012.\n\n[38] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1826, "authors": [{"given_name": "Ryan", "family_name": "Kiros", "institution": "U. Toronto"}, {"given_name": "Yukun", "family_name": "Zhu", "institution": "University of Toronto"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "University of Toronto"}, {"given_name": "Antonio", "family_name": "Torralba", "institution": "MIT"}, {"given_name": "Sanja", "family_name": "Fidler", "institution": "University of Toronto"}]}