{"title": "Deep Fragment Embeddings for Bidirectional Image Sentence Mapping", "book": "Advances in Neural Information Processing Systems", "page_first": 1889, "page_last": 1897, "abstract": "We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.", "full_text": "Deep Fragment Embeddings for Bidirectional Image\n\nSentence Mapping\n\nAndrej Karpathy\n\nArmand Joulin\n\nDepartment of Computer Science, Stanford University, Stanford, CA 94305, USA\n\n{karpathy,ajoulin,feifeili}@cs.stanford.edu\n\nLi Fei-Fei\n\nAbstract\n\nWe introduce a model for bidirectional retrieval of images and sentences through\na deep, multi-modal embedding of visual and natural language data. Unlike pre-\nvious models that directly map images or sentences into a common embedding\nspace, our model works on a \ufb01ner level and embeds fragments of images (ob-\njects) and fragments of sentences (typed dependency tree relations) into a com-\nmon space. We then introduce a structured max-margin objective that allows our\nmodel to explicitly associate these fragments across modalities. Extensive exper-\nimental evaluation shows that reasoning on both the global level of images and\nsentences and the \ufb01ner level of their respective fragments improves performance\non image-sentence retrieval tasks. Additionally, our model provides interpretable\npredictions for the image-sentence retrieval task since the inferred inter-modal\nalignment of fragments is explicit.\n\nIntroduction\n\n1\nThere is signi\ufb01cant value in the ability to associate natural language descriptions with images. De-\nscribing the contents of images is useful for automated image captioning and conversely, the ability\nto retrieve images based on natural language queries has immediate image search applications. In\nparticular, in this work we are interested in training a model on a set of images and their associated\nnatural language descriptions such that we can later rank a \ufb01xed set of withheld sentences given an\nimage query, and vice versa.\nThis task is challenging because it requires detailed understanding of the content of images, sen-\ntences and their inter-modal correspondence. Consider an example sentence query, such as \u201cA dog\nwith a tennis ball is swimming in murky water\u201d (Figure 1). In order to successfully retrieve a corre-\nsponding image, we must accurately identify all entities, attributes and relationships present in the\nsentence and ground them appropriately to a complex visual scene.\nOur primary contribution is in formulating a structured, max-margin objective for a deep neural net-\nwork that learns to embed both visual and language data into a common, multimodal space. Unlike\nprevious work that embeds images and sentences, our model breaks down and embeds fragments of\nimages (objects) and fragments of sentences (dependency tree relations [1]) in a common embed-\nding space and explicitly reasons about their latent, inter-modal correspondences. Extensive empir-\nical evaluation validates our approach. In particular, we report dramatic improvements over state of\nthe art methods on image-sentence retrieval tasks on Pascal1K [2], Flickr8K [3] and Flickr30K [4]\ndatasets. We make our code publicly available.\n2 Related Work\nImage Annotation and Image Search. There is a growing body of work that associates images and\nsentences. Some approaches focus on the direction of describing the contents of images, formulated\neither as a task of mapping images to a \ufb01xed set of sentences written by people [5, 6], or as a task of\nautomatically generating novel captions [7, 8, 9, 10, 11, 12]. More closely related to our motivation\nare methods that allow natural bi-drectional mapping between the two modalities. Socher and Fei-\nFei [13] and Hodosh et al.\n[3] use Kernel Canonical Correlation Analysis to align images and\nsentences, but their method is not easily scalable since it relies on computing kernels quadratic in\n\n1\n\n\fFigure 1: Our model takes a dataset of\nimages and their sentence descriptions\nand learns to associate their fragments.\nIn images, fragments correspond to ob-\nject detections and scene context. In sen-\ntences, fragments consist of typed de-\npendency tree [1] relations.\n\nnumber of images and sentences. Farhadi et al. [5] learn a common meaning space, but their method\nis limited to representing both images and sentences with a single triplet of (object, action, scene).\nZitnick et al. [14] use a Conditional Random Field to reason about spatial relationships in cartoon\nscenes and their relation to natural language descriptions. Finally, joint models of language and\nperception have also been explored in robotics settings [15].\nMultimodal Representation Learning. Our approach falls into a general category of learning\nfrom multi-modal data. Several probabilistic models for representing joint multimodal probability\ndistributions over images and sentences have been developed, using Deep Boltzmann Machines [16],\nlog-bilinear models [17], and topic models [18, 19]. Ngiam et al. [20] described an autoencoder\nthat learns audio-video representations through a shared bottleneck layer. More closely related to\nour task and approach is the work of Frome et al.\n[21], who introduced a model that learns to\nmap images and words to a common semantic embedding with a ranking cost. Adopting a similar\napproach, Socher et al.\n[22] described a Dependency Tree Recursive Neural Network that puts\nentire sentences into correspondence with visual data. However, these methods reason about the\nimage only on the global level using a single, \ufb01xed-sized representation from the top layer of a\nConvolutional Neural Network as a description for the entire image. In contrast, our model explicitly\nreasons about objects that make up a complex scene.\nNeural Representations for Images and Natural Language. Our model is a neural network\nthat is connected to image pixels on one side and raw 1-of-k word representations on the other.\nThere have been multiple approaches for learning neural representations in these data domains. In\nComputer Vision, Convolutional Neural Networks (CNNs) [23] have recently been shown to learn\npowerful image representations that support state of the art image classi\ufb01cation [24, 25, 26] and\nobject detection [27, 28]. In language domain, several neural network models have been proposed\nto learn word/n-gram representations [29, 30, 31, 32, 33, 34], sentence representations [35] and\nparagraph/document representations [36].\n\n3 Proposed Model\nLearning and Inference. Our task is to retrieve relevant images given a sentence query, and con-\nversely, relevant sentences given an image query. We train our model on a set of N images and N\ncorresponding sentences that describe their content (Figure 2). Given this set of correspondences,\nwe learn the weights of a neural network with a structured loss to output a high score when a com-\npatible image-sentence pair is fed through the network, and low score otherwise. Once the training is\ncomplete, all training data is discarded and the network is evaluated on a withheld set of images and\nsentences. The evaluation scores all image-sentence pairs in the test set, sorts the images/sentences\nin order of decreasing score and records the location of a ground truth result in the list.\nFragment Embeddings. Our core insight is that images are complex structures that are made\nup of multiple entities that the sentences make explicit references to. We capture this intuition\ndirectly in our model by breaking down both images and sentences into fragments and reason about\ntheir alignment. In particular, we propose to detect objects as image fragments and use sentence\ndependency tree relations [1] as sentence fragments (Figure 2).\nObjective. We will compute the representation of both image and sentence fragments with a neural\nnetwork, and interpret the top layer as high-dimensional vectors embedded in a common multi-\nmodal space. We will think of the inner product between these vectors as a fragment compatibility\nscore, and compute the global image-sentence score as a \ufb01xed function of the scores of their respec-\ntive fragments. Intuitively, an image-sentence pair will obtain a high global score if the sentence\nfragments can each be con\ufb01dently matched to some fragment in the image. Finally, we will learn\nthe weights of the neural networks such that the true image-sentence pairs achieve a score higher\n(by a margin) than false image-sentence pairs.\n\n2\n\n\fFigure 2: Computing the Fragment and image-sentence similarities. Left: CNN representations (green) of\ndetected objects are mapped to the fragment embedding space (blue, Section 3.2). Right: Dependency tree\nrelations in the sentence are embedded (Section 3.1). Our model interprets inner products (shown as boxes)\nbetween fragments as a similarity score. The alignment (shaded boxes) is latent and inferred by our model\n(Section 3.3.1). The image-sentence similarity is computed as a \ufb01xed function of the pairwise fragment scores.\n\nWe \ufb01rst describe the neural networks that compute the Image and Sentence Fragment embeddings.\nThen we discuss the objective function, which is composed of the two aforementioned objectives.\n3.1 Dependency Tree Relations as Sentence Fragments\nWe would like to extract and represent the set of visually identi\ufb01able entities described in a sentence.\nFor instance, using the example in Figure 2, we would like to identify the entities (dog, child)\nand characterise their attributes (black, young) and their pairwise interactions (chasing). Inspired\nby previous work [5, 22] we observe that a dependency tree of a sentence provides a rich set of\ntyped relationships that can serve this purpose more effectively than individual words or bigrams.\nWe discard the tree structure in favor of a simpler model and interpret each relation (edge) as an\nindividual sentence fragment (Figure 2, right shows 5 example dependency relations). Thus, we\nrepresent every word using 1-of-k encoding vector w using a dictionary of 400,000 words and map\nevery dependency triplet (R, w1, w2) into the embedding space as follows:\n\n(cid:18)\n\n(cid:20) Wew1\n\n(cid:21)\n\n(cid:19)\n\n+ bR\n\n.\n\nWR\n\ns = f\n\nWew2\n\n(1)\nHere, We is a d \u00d7 400, 000 matrix that encodes a 1-of-k vector into a d-dimensional word vector\nrepresentation (we use d = 200). We \ufb01x We to weights obtained through an unsupervised objective\ndescribed in Huang et al. [34]. Note that every relation R can have its own set of weights WR and\nbiases bR. We \ufb01x the element-wise nonlinearity f (.) to be the Recti\ufb01ed Linear Unit (ReLU), which\ncomputes x \u2192 max(0, x). The size of the embedded space is cross-validated, and we found that\nvalues of approximately 1000 generally work well.\n3.2 Object Detections as Image Fragments\nSimilar to sentences, we wish to extract and describe the set of entities that images are composed of.\nInspired by prior work [7], as a modeling assumption we observe that the subject of most sentence\ndescriptions are attributes of objects and their context in a scene. This naturally motivates the use of\nobjects and the global context as the fragments of an image. In particular, we follow Girshick et al.\n[27] and detect objects in every image with a Region Convolutional Neural Network (RCNN). The\nCNN is pre-trained on ImageNet [37] and \ufb01netuned on the 200 classes of the ImageNet Detection\nChallenge [38]. We use the top 19 detected locations and the entire image as the image fragments\nand compute the embedding vectors based on the pixels Ib inside each bounding box as follows:\n\nv = Wm[CNN\u03b8c(Ib)] + bm,\n\n(2)\n\nwhere CNN(Ib) takes the image inside a given bounding box and returns the 4096-dimensional\nactivations of the fully connected layer immediately before the classi\ufb01er. The CNN architecture is\nidentical to the one described in Girhsick et al. [27]. It contains approximately 60 million parameters\n\u03b8c and closely resembles the architecture of Krizhevsky et al [25].\n3.3 Objective Function\nWe are now ready to formulate the objective function. Recall that we are given a training set of N\nimages and corresponding sentences. In the previous sections we described parameterized functions\nthat map every sentence and image to a set of fragment vectors {s} and {v}, respectively. All\nparameters of our model are contained in these two functions. As shown in Figure 2, our model\n\n3\n\n\fFigure 3: The two objectives for a\nbatch of 2 examples. Left: Rows rep-\nresent fragments vi, columns sj. Ev-\nery square shows an ideal scenario of\ni sj) in the MIL ob-\nyij = sign(vT\njective. Red boxes are yij = \u22121.\nYellow indicates members of posi-\ntive bags that happen to currently\nbe yij = \u22121. Right: The scores\nare accumulated with Equation 6 into\nimage-sentence score matrix Skl.\n\nthen interprets the inner product vT\ni sj between an image fragment vi and a sentence fragment sj as\na similarity score, and computes the image-sentence similarity as a \ufb01xed function of the scores of\ntheir respective fragments.\nWe are motivated by two criteria in designing the objective function. First, the image-sentence\nsimilarities should be consistent with the ground truth correspondences. That is, corresponding\nimage-sentence pairs should have a higher score than all other image-sentence pairs. This will\nbe enforced by the Global Ranking Objective. Second, we introduce a Fragment Alignment\nObjective that explicitly learns the appearance of sentence fragments (such as \u201cblack dog\u201d) in the\nvisual domain. Our full objective is the sum of these two objectives and a regularization term:\n\nC(\u03b8) = CF (\u03b8) + \u03b2CG(\u03b8) + \u03b1||\u03b8||2\n2,\n\n(3)\nwhere \u03b8 is a shorthand for parameters of our neural network (\u03b8 = {We, WR, bR, Wm, bm, \u03b8c}) and\n\u03b1 and \u03b2 are hyperparameters that we cross-validate. We now describe both objectives in more detail.\n3.3.1 Fragment Alignment Objective\nThe Fragment Alignment Objective encodes the intuition that if a sentence contains a fragment\n(e.g.\u201cblue ball\u201d, Figure 3), at least one of the boxes in the corresponding image should have a high\nscore with this fragment, while all the other boxes in all the other images that have no mention of\n\u201cblue ball\u201d should have a low score. This assumption can be violated in multiple ways: a triplet\nmay not refer to anything visually identi\ufb01able in the image. The box that the triplet refers to may\nnot be detected by the RCNN. Lastly, other images may contain the described visual concept but\nits mention may omitted in the associated sentence descriptions. Nonetheless, the assumption is\nstill satis\ufb01ed in many cases and can be used to formulate a cost function. Consider an (incomplete)\nFragment Alignment Objective that assumes a dense alignment between every corresponding image\nand sentence fragments:\n\n(cid:88)\n\n(cid:88)\n\nC0(\u03b8) =\n\nmax(0, 1 \u2212 yijvT\n\ni sj).\n\n(4)\n\ni\n\nj\n\nHere, the sum is over all pairs of image and sentence fragments in the training set. The quantity vT\ni sj\ncan be interpreted as the alignment score of visual fragment vi and sentence fragment sj. In this\nincomplete objective, we de\ufb01ne yij as +1 if fragments vi and sj occur together in a corresponding\nimage-sentence pair, and \u22121 otherwise. Intuitively, C0(\u03b8) encourages scores in red regions of Figure\n3 to be less than -1 and scores along the block diagonal (green and yellow) to be more than +1.\nMultiple Instance Learning extension. The problem with the objective C0(\u03b8) is that it assumes\ndense alignment between all pairs of fragments in every corresponding image-sentence pair. How-\never, this is hardly ever the case. For example, in Figure 3, the \u201cboy playing\u201d triplet refers to only\none of the three detections. We now describe a Multiple Instance Learning (MIL) [39] extension\nof the objective C0 that attempts to infer the latent alignment between fragments in corresponding\nimage-sentence pairs. Concretely, for every triplet we put image fragments in the associated im-\nage into a positive bag, while image fragments in every other image become negative examples.\nOur precise formulation is inspired by the mi-SVM [40], which is a simple and natural extension\nof a Support Vector Machine to a Multiple Instance Learning setting. Instead of treating the yij as\nconstants, we minimize over them by wrapping Equation 4 as follows:\n\n4\n\n\fs.t. (cid:88)\n\nCF (\u03b8) = min\nyij\nyij + 1\n\nC0(\u03b8)\n\u2265 1 \u2200j\n\n2\n\ni\u2208pj\nyij = \u22121 \u2200i, j s.t. mv(i) (cid:54)= ms(j) and yij \u2208 {\u22121, 1}\n\n(5)\n\nHere, we de\ufb01ne pj to be the set of image fragments in the positive bag for sentence fragment j.\nmv(i) and ms(j) return the index of the image and sentence (an element of {1, . . . , N}) that the\nfragments vi and sj belong to. Note that the inequality simply states that at least one of the yij\nshould be positive for every sentence fragment j (i.e. at least one green box in every column in\nFigure 3). This objective cannot be solved ef\ufb01ciently [40] but a commonly used heuristic is to set\ni sj). If the constraint is not satis\ufb01ed for any positive bag (i.e. all scores were below\nyij = sign(vT\nzero), the highest-scoring item in the positive bag is set to have a positive label.\n3.3.2 Global Ranking Objective\nRecall that the Global Ranking Objective ensures that the computed image-sentence similarities are\nconsistent with the ground truth annotation. First, we de\ufb01ne the image-sentence alignment score to\nbe the average thresholded score of their pairwise fragment scores:\n\nSkl =\n\n1\n\n|gk|(|gl| + n)\n\n(cid:88)\n\n(cid:88)\n\ni\u2208gk\n\nj\u2208gl\n\nmax(0, vT\n\ni sj).\n\n(6)\n\n(cid:105)\n\n.\n\n(cid:125)\n\nHere, gk is the set of image fragments in image k and gl is the set of sentence fragments in sentence\nl. Both k, l range from 1, . . . , N. We truncate scores at zero because in the mi-SVM objective, scores\ngreater than 0 are considered correct alignments and scores less than 0 are considered to be incorrect\nalignments (i.e. false members of a positive bag). In practice, we found that it was helpful to add\na smoothing term n, since short sentences can otherwise have an advantage (we found that n = 5\nworks well and that this setting is not very sensitive). The Global Ranking Objective then becomes:\n\nCG(\u03b8) =\n\nmax(0, Skl \u2212 Skk + \u2206)\n\n+\n\nmax(0, Slk \u2212 Skk + \u2206)\n\n(7)\n\n(cid:88)\n(cid:124)\n\nl\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:88)\n\nk\n\n(cid:104)(cid:88)\n(cid:124)\n\nl\n\n(cid:123)(cid:122)\n\nrank images\n\nrank sentences\n\nHere, \u2206 is a hyperparameter that we cross-validate. The objective stipulates that the score for true\nimage-sentence pairs Skk should be higher than Skl or Slk for any l (cid:54)= k by at least a margin of \u2206.\n3.4 Optimization\nWe use Stochastic Gradient Descent (SGD) with mini-batches of 100, momentum of 0.9 and make\n20 epochs through the training data. The learning rate is cross-validated and annealed by a fraction\nof \u00d70.1 for the last two epochs. Since both Multiple Instance Learning and CNN \ufb01netuning bene\ufb01t\nfrom a good initialization, we run the \ufb01rst 10 epochs with the fragment alignment objective C0\nand CNN weights \u03b8c \ufb01xed. After 10 epochs, we switch to the full MIL objective CF and begin\n\ufb01netuning the CNN. The word embedding matrix We is kept \ufb01xed due to over\ufb01tting concerns. Our\nimplementation runs at approximately 1 second per batch on a standard CPU workstation.\n4 Experiments\nDatasets. We evaluate our image-sentence retrieval performance on Pascal1K [2], Flickr8K [3] and\nFlickr30K [4] datasets. The datasets contain 1,000, 8,000 and 30,000 images respectively and each\nimage is annotated using Amazon Mechanical Turk with 5 independent sentences.\nSentence Data Preprocessing. We did not explicitly \ufb01lter, spellcheck or normalize any of the\nsentences for simplicity. We use the Stanford CoreNLP parser to compute the dependency trees\nfor every sentence. Since there are many possible relation types (as many as hundreds), due to\nover\ufb01tting concerns and practical considerations we remove all relation types that occur less than\n1% of the time in each dataset. In practice, this reduces the number of relations from 136 to 16 in\nPascal1K, 170 to 17 in Flickr8K, and from 212 to 21 in Flickr30K. Additionally, all words that are\nnot found in our dictionary of 400,000 words [34] are discarded.\nImage Data Preprocessing. We use the Caffe [41] implementation of the ImageNet Detection\nRCNN model [27] to detect objects in all images. On our machine with a Tesla K40 GPU,\nthe RCNN processes one image in approximately 25 seconds. We discard the predictions for\n200 ImageNet detection classes and only keep the 4096-D activations of the fully connect layer\nimmediately before the classi\ufb01er at all of the top 19 detected locations and from the entire image.\n\n5\n\n\fPascal1K\n\nImage Annotation\n\nImage Search\n\nModel\nRandom Ranking\nSocher et al. [22]\nkCCA. [22]\nDeViSE [21]\nSDT-RNN [22]\nOur model\n\nR@1 R@5 R@10 Mean r R@1 R@5 R@10 Mean r\n4.0\n23.0\n21.0\n17.0\n25.0\n39.0\n\n5.2\n46.6\n41.4\n54.6\n65.2\n65.2\n\n12.0\n63.0\n61.0\n68.0\n70.0\n79.0\n\n71.0\n16.9\n18.0\n11.9\n13.4\n10.5\n\n9.0\n45.0\n47.0\n57.0\n56.0\n68.0\n\n10.6\n65.6\n58.0\n72.4\n84.4\n79.8\n\n50.0\n12.5\n15.9\n9.5\n7.0\n7.6\n\n1.6\n16.4\n16.4\n21.6\n25.4\n23.6\n\nTable 1: Pascal1K ranking experiments. R@K is Recall@K (high is good). Mean r is the mean rank (low is\ngood). Note that the test set only consists of 100 images.\n\nFlickr8K\n\nImage Annotation\n\nImage Search\n\nModel\nRandom Ranking\nSocher et al. [22]\nDeViSE [21]\nSDT-RNN [22]\nFragment Alignment Objective\nGlobal Ranking Objective\n(\u2020) Fragment + Global\n\u2020 \u2192 Images: Fullframe Only\n\u2020 \u2192 Sentences: BOW\n\u2020 \u2192 Sentences: Bigrams\nOur model (\u2020 + MIL)\n* Hodosh et al. [3]\n* Our model (\u2020 + MIL)\nTable 2: Flickr8K experiments. R@K is Recall@K (high is good). Med r is the median rank (low is good).\nThe starred evaluation criterion (*) in [3] is slightly different since it discards 4,000 out of 5,000 test sentences.\n\nR@1 R@5 R@10 Med r R@1 R@5 R@10 Med r\n500\n0.1\n29\n4.5\n29\n4.8\n25\n6.0\n26\n7.2\n5.8\n21\n17\n12.5\n32\n5.9\n23\n9.1\n20\n8.7\n12.6\n15\n38\n8.3\n9.3\n17\n\n1.1\n28.6\n27.3\n34.0\n31.8\n34.8\n43.8\n27.3\n40.7\n41.0\n44.0\n30.3\n37.4\n\n0.6\n18.0\n16.5\n22.7\n21.9\n21.8\n29.4\n19.2\n25.9\n28.5\n32.9\n21.6\n24.9\n\n0.5\n18.5\n20.1\n21.6\n20.0\n23.4\n26.7\n17.6\n22.4\n25.2\n29.6\n20.7\n27.9\n\n1.0\n29.0\n29.6\n31.7\n30.3\n35.0\n38.7\n26.5\n34.0\n37.0\n42.5\n30.1\n41.3\n\n631\n32\n28\n23\n25\n20\n14\n34\n17\n16\n14\n34\n21\n\n0.1\n6.1\n5.9\n6.6\n5.9\n7.5\n8.6\n5.2\n6.9\n8.5\n9.7\n7.6\n8.8\n\nEvaluation Protocol for Bidirectional Retrieval. For Pascal1K we follow Socher et al. [22] and\nuse 800 images for training, 100 for validation and 100 for testing. For Flickr datasets we use\n1,000 images for validation, 1,000 for testing and the rest for training (consistent with [3]). We\ncompute the dense image-sentence similarity Skl between every image-sentence pair in the test set\nand rank images and sentences in order of decreasing score. For both Image Annotation and Image\nSearch, we report the median rank of the closest ground truth result in the list, as well as Recall@K\nwhich computes the fraction of times the correct result was found among the top K items. When\ncomparing to Hodosh et al. [3] we closely follow their evaluation protocol and only keep a subset\nof N sentences out of total 5N (we use the \ufb01rst sentence out of every group of 5).\n\n4.1 Comparison Methods\nSDT-RNN. Socher et al. [22] embed a fullframe CNN representation with the sentence representa-\ntion from a Semantic Dependency Tree Recursive Neural Network (SDT-RNN). Their loss matches\nour global ranking objective. We requested the source code of Socher et al. [22] and substituted the\nFlickr8K and Flick30K datasets. To better understand the bene\ufb01ts of using our detection CNN and\nfor a more fair comparison we also train their method with our CNN features. Since we have multi-\nple objects per image, we average representations from all objects with detection con\ufb01dence above a\n(cross-validated) threshold. We refer to the exact method of Socher et al. [22] with a single fullframe\nCNN as \u201cSocher et al\u201d, and to their method with our combined CNN features as \u201cSDT-RNN\u201d.\nDeViSE. The DeViSE [21] source code is not publicly available but their approach is a special case\nof our method with the following modi\ufb01cations: we use the average (L2-normalized) word vectors\nas a sentence fragment, the average CNN activation of all objects above a detection threshold (as\ndiscussed in case of SDT-RNN) as an image fragment and only use the global ranking objective.\n\n4.2 Quantitative Evaluation\nOur model outperforms previous methods. Our full method consistently outperforms previous\nmethods on Flickr8K (Table 2) and Flickr30K (Table 3) datasets. On Pascal1K (Table 1) the\nSDT-RNN appears to be competitive on Image Search.\nFragment and Global Objectives are complementary. As seen in Tables 2 and 3, both objectives\nperform well independently, but bene\ufb01t from the combination. Note that the Global Objective\nperforms consistently better, possibly because it directly minimizes the evaluation criterion (ranking\n\n6\n\n\fFlickr30K\n\nImage Annotation\n\nImage Search\n\nModel\nRandom Ranking\nDeViSE [21]\nSDT-RNN [22]\nFragment Alignment Objective\nGlobal Ranking Objective\n(\u2020) Fragment + Global\nOur model (\u2020 + MIL)\nOur model + Finetune CNN\n\nR@1 R@5 R@10 Med r R@1 R@5 R@10 Med r\n500\n0.1\n25\n4.5\n9.6\n16\n22\n11\n17\n11.5\n14\n12.0\n14\n14.2\n16.4\n13\n\n1.1\n29.2\n41.1\n39.3\n44.9\n50.0\n51.3\n54.7\n\n0.6\n18.1\n29.8\n28.7\n33.2\n37.1\n37.7\n40.2\n\n1.0\n32.7\n41.1\n34.5\n38.4\n43.2\n44.2\n44.5\n\n631\n26\n16\n18\n14\n10\n10\n8\n\n0.1\n6.7\n8.9\n7.6\n8.8\n9.9\n10.2\n10.3\n\n0.5\n21.9\n29.8\n23.8\n27.6\n30.5\n30.8\n31.4\n\nTable 3: Flickr30K experiments. R@K is Recall@K (high is good). Med r is the median rank (low is good).\n\nFigure 4: Qualitative Image Annotation results. Below each image we show the top 5 sentences (among a set\nof 5,000 test sentences) in descending con\ufb01dence. We also show the triplets for the top sentence and connect\neach to the detections with the highest compatibility score (indicated by lines). The numbers next to each triplet\nindicate the matching fragment score. We color a sentence green if it correct and red otherwise.\n\ncost), while the Fragment Alignment Objective only does so indirectly.\nExtracting object representations is important. Using only the global scene-level CNN repre-\nsentation as a single fragment for every image leads to a consistent drop in performance, suggesting\nthat a single fullframe CNN alone is inadequate in effectively representing the images. (Table 2)\nDependency tree relations outperform BoW/bigram representations. We compare to a simpler\nBag of Words (BoW) baseline to understand the contribution of dependency relations.\nIn BoW\nbaseline we iterate over words instead of dependency triplets when creating bags of sentence\nfragments (set w1 = w2 in Equation1). As can be seen in the Table 2, this leads to a consistent drop\nin performance. This drop could be attributed to a difference between using one word or two words\nat a time, so we also compare to a bigram baseline where the words w1, w2 in Equation 1 refer to\nconsecutive words in a sentence, not nodes that share an edge in the dependency tree. Again, we\nobserve a consistent performance drop, which suggests that the dependency relations provide useful\nstructure that the neural network takes advantage of.\nFinetuning the CNN helps on Flickr30K. Our end-to-end Neural Network approach allows us to\nbackpropagate gradients all the way down to raw data (pixels or 1-of-k word encodings). In particu-\nlar, we observed additional improvements on the Flickr30K dataset (Table 3) when we \ufb01netune the\nCNN. Training the CNN improves the validation error for a while but the model eventually starts to\nover\ufb01t. Thus, we found it critical to keep track of the validation error and freeze the model before it\nover\ufb01ts. We were not able to improve the validation performance on Pascal1K and Flickr8K datasets\nand suspect that there is an insuf\ufb01cient amount of training data.\n4.3 Qualitative Experiments\nInterpretable Predictions. We show some example sentence retrieval results in Figure 4. The\nalignment in our model is explicitly inferred on the fragment level, which allows us to interpret the\nscores between images and sentences. For instance, in the last image it is apparent that the model\nretrieved the top sentence because it erroneously associated a mention of a blue person to the blue\n\ufb02ag on the bottom right of the image.\nFragment Alignment Objective trains attribute detectors.\nThe detection CNN is trained to\npredict one of 200 ImageNet Detection classes, so it is not clear if the representation is powerful\nenough to support learning of more complex attributes of the objects or generalize to novel classes.\nTo see whether our model successfully learns to predict sentence triplets, we \ufb01x a triplet vector and\n\n7\n\n\fFigure 5: We \ufb01x a triplet and retrieve the highest scoring image fragments in the test set. Note that ball, person\nand dog are ImageNet Detection classes but their visual properties (e.g. soccer, standing, snowboarding, black)\nare not. Jackets and rocky scenes are not ImageNet Detection classes. Find more in supplementary material.\nsearch for the highest scoring boxes in the test set. Qualitative results shown in Figure 5 suggest\nthat the model is indeed capable of generalizing to more \ufb01ne-grained subcategories (such as \u201cblack\ndog\u201d, \u201csoccer ball\u201d) and to out of sample classes such as \u201crocky terrain\u201d and \u201cjacket\u201d.\nLimitations. Our model is subject to multiple limitations. From a modeling perspective, the use of\nedges from a dependency tree is simple, but not always appropriate. First, a single complex phrase\nthat describes a single visual entity can be split across multiple sentence fragments. For example,\n\u201cblack and white dog\u201d is parsed as two relations (CONJ, black, white) and (AMOD, white, dog).\nConversely, there are many dependency relations that don\u2019t have a clear grounding in the image (for\nexample \u201ceach other\u201d). Furthermore, phrases such as \u201cthree children playing\u201d that describe some\nparticular number of visual entiries are not modeled. While we have shown that the relations give\nrise to more powerful representations than words or bigrams, a more careful treatment of sentence\nfragments will likely lead to further improvements. On the image side, the non-maximum suppres-\nsion in the RCNN can sometimes detect, for example, multiple people inside one person. Since the\nmodel does not take into account any spatial information associated with the detections, it is hard\nfor it to disambiguate between two distinct people or spurious detections of one person.\n5 Conclusions\nWe addressed the problem of bidirectional retrieval of images and sentences. Our neural network\nlearns a multi-modal embedding space for fragments of images and sentences and reasons about\ntheir latent, inter-modal alignment. We have shown that our model signi\ufb01cantly improves the re-\ntrieval performance on image sentence retrieval tasks compared to previous work. Our model also\nproduces interpretable predictions.\nIn future work we hope to develop better sentence fragment\nrepresentations, incorporate spatial reasoning, and move beyond bags of fragments.\nAcknowledgments. We thank Justin Johnson and Jon Krause for helpful comments and discussions.\nWe gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used\nfor this research. This research is supported by an ONR MURI grant, and NSF ISS-1115313.\nReferences\n[1] De Marneffe, M.C., MacCartney, B., Manning, C.D., et al.: Generating typed dependency parses from\n\nphrase structure parses. In: Proceedings of LREC. Volume 6. (2006) 449\u2013454\n\n[2] Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using amazon\u2019s\nmechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language\nData with Amazon\u2019s Mechanical Turk, Association for Computational Linguistics (2010) 139\u2013147\n\n[3] Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and\n\nevaluation metrics. Journal of Arti\ufb01cial Intelligence Research (2013)\n\n[4] Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New\n\nsimilarity metrics for semantic inference over event descriptions. TACL (2014)\n\n[5] Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every\n\npicture tells a story: Generating sentences from images. In: ECCV. (2010)\n\n[6] Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: Describing images using 1 million captioned photographs.\n\nIn: NIPS. (2011)\n\n[7] Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding\n\nand generating simple image descriptions. In: CVPR. (2011)\n\n8\n\n\f[8] Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: Image parsing to text description. Proceedings\n\n[9] Yang, Y., Teo, C.L., Daum\u00b4e III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images.\n\nof the IEEE 98(8) (2010) 1485\u20131508\n\nIn: EMNLP. (2011)\n\nweb-scale n-grams. In: CoNLL. (2011)\n\n[10] Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., Choi, Y.: Composing simple image descriptions using\n\n[11] Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos,\nK., Daum\u00b4e, III, H.: Midge: Generating image descriptions from computer vision detections. In: EACL.\n(2012)\n\n[12] Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image\n\ndescriptions. In: ACL. (2012)\n\n[13] Socher, R., Fei-Fei, L.: Connecting modalities: Semi-supervised segmentation and annotation of images\n\nusing unaligned text corpora. In: CVPR. (2010)\n\n[14] Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. ICCV (2013)\n[15] Matuszek*, C., FitzGerald*, N., Zettlemoyer, L., Bo, L., Fox, D.: A Joint Model of Language and\nPerception for Grounded Attribute Learning. In: Proc. of the 2012 International Conference on Machine\nLearning, Edinburgh, Scotland (June 2012)\n\n[16] Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. In: NIPS. (2012)\n[17] Kiros, R., Zemel, R.S., Salakhutdinov, R.: Multimodal neural language models. ICML (2014)\n[18] Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. In: ICCV.\n\n[19] Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and\n\n(2011)\n\npictures. JMLR (2003)\n\n[20] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML. (2011)\n[21] Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: A deep visual-\n\nsemantic embedding model. In: NIPS. (2013)\n\n[22] Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for\n\n\ufb01nding and describing images with sentences. TACL (2014)\n\n[23] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE 86(11) (1998) 2278\u20132324\n\n[24] Le, Q.V.: Building high-level features using large scale unsupervised learning. In: Acoustics, Speech and\n\nSignal Processing (ICASSP), 2013 IEEE International Conference on, IEEE (2013) 8595\u20138598\n\n[25] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classi\ufb01cation with deep convolutional neural net-\n\n[26] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. arXiv preprint\n\nworks. In: NIPS. (2012)\n\narXiv:1311.2901 (2013)\n\n[27] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and\n\nsemantic segmentation. In: CVPR. (2014)\n\n[28] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In: ICLR. (2014)\n\n[29] Bengio, Y., Schwenk, H., Sen\u00b4ecal, J.S., Morin, F., Gauvain, J.L.: Neural probabilistic language models.\n\nIn: Innovations in Machine Learning. Springer (2006)\n\n[30] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: ICML. (2007)\n[31] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and\n\nphrases and their compositionality. In: NIPS. (2013)\n\n[32] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-\n\n[33] Collobert, R., Weston, J.: A uni\ufb01ed architecture for natural language processing: Deep neural networks\n\nsupervised learning. In: ACL. (2010)\n\nwith multitask learning. In: ICML. (2008)\n\n[34] Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context\n\nand multiple word prototypes. In: ACL. (2012)\n\n[35] Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive\n\nneural networks. In: ICML. (2011)\n\n[36] Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. ICML (2014)\n[37] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image\n\ndatabase. In: CVPR. (2009)\n\n[38] Russakovsky, O., Deng, J., Krause, J., Berg, A., Fei-Fei, L.: Large scale visual recognition challenge\n\n2013. http://image-net.org/challenges/LSVRC/2013/ (2013)\n\n[39] Chen, Y., Bi, J., Wang, J.Z.: Miles: Multiple-instance learning via embedded instance selection. CVPR\n\n[40] Andrews, S., Hofmann, T., Tsochantaridis, I.: Multiple instance learning with generalized support vector\n\n28(12) (2006)\n\nmachines. In: AAAI/IAAI. (2002) 943\u2013944\n\nhttp://caffe.berkeleyvision.org/ (2013)\n\n[41] Jia, Y.:\n\nCaffe: An open source convolutional architecture for\n\nfast\n\nfeature embedding.\n\n9\n\n\f", "award": [], "sourceid": 1029, "authors": [{"given_name": "Andrej", "family_name": "Karpathy", "institution": "Stanford"}, {"given_name": "Armand", "family_name": "Joulin", "institution": "Stanford University"}, {"given_name": "Li", "family_name": "Fei-Fei", "institution": "Stanford U"}]}