{"title": "A Neural Compositional Paradigm for Image Captioning", "book": "Advances in Neural Information Processing Systems", "page_first": 658, "page_last": 668, "abstract": "Mainstream captioning models often follow a sequential structure to generate cap-\ntions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires less data to train, generalizes better, and yields more diverse captions.", "full_text": "A Neural Compositional Paradigm\n\nfor Image Captioning\n\nBo Dai 1\n\nSanja Fidler 2,3,4\n\nDahua Lin 1\n\n1 CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong\n\n2 University of Toronto\n\n3 Vector Institute\n\nbdai@ie.cuhk.edu.hk\n\nfidler@cs.toronto.edu\n\n4 NVIDIA\ndhlin@ie.cuhk.edu.hk\n\nAbstract\n\nMainstream captioning models often follow a sequential structure to generate cap-\ntions, leading to issues such as introduction of irrelevant semantics, lack of diversity\nin the generated captions, and inadequate generalization performance. In this paper,\nwe present an alternative paradigm for image captioning, which factorizes the\ncaptioning procedure into two stages: (1) extracting an explicit semantic represen-\ntation from the given image; and (2) constructing the caption based on a recursive\ncompositional procedure in a bottom-up manner. Compared to conventional ones,\nour paradigm better preserves the semantic content through an explicit factorization\nof semantics and syntax. By using the compositional generation procedure, caption\nconstruction follows a recursive structure, which naturally \ufb01ts the properties of\nhuman language. Moreover, the proposed compositional procedure requires less\ndata to train, generalizes better, and yields more diverse captions.\n\n1\n\nIntroduction\n\nImage captioning, the task to generate short descriptions for given images, has received increasing\nattention in recent years. State-of-the-art models [1, 2, 3, 4] mostly adopt the encoder-decoder\nparadigm [3], where the content of the given image is \ufb01rst encoded via a convolutional network into\na feature vector, which is then decoded into a caption via a recurrent network. In particular, the\nwords in the caption are produced in a sequential manner \u2013 the choice of each word depends on both\nthe preceding word and the image feature. Despite its simplicity and the effectiveness shown on\nvarious benchmarks [5, 6], the sequential model has a fundamental problem. Speci\ufb01cally, it could not\nre\ufb02ect the inherent hierarchical structures of natural languages [7, 8] in image captioning and other\ngeneration tasks, although it could implicitly capture such structures in tasks taking the complete\nsentences as input, e.g. parsing [9], and classi\ufb01cation [10].\nAs a result, sequential models have several signi\ufb01cant drawbacks. First, they rely excessively on\nn-gram statistics rather than hierarchical dependencies among words in a caption. Second, such\nmodels usually favor the frequent n-grams [11] in the training set, which, as shown in Figure 1, may\nlead to captions that are only correct syntactically but not semantically, containing semantic concepts\nthat are irrelevant to the conditioned image. Third, the entanglement of syntactic rules and semantics\nobscures the dependency structure and makes sequential models dif\ufb01cult to generalize.\nTo tackle these issues, we propose a new paradigm for image captioning, where the extraction of\nsemantics (i.e. what to say) and the construction of syntactically correct captions (i.e. how to say) are\ndecomposed into two stages. Speci\ufb01cally, it derives an explicit representation of the semantic content\nof the given image, which comprises a set of noun-phrases, e.g. a white cat, a cloudy sky or two\nmen. With these noun-phrases as the basis, it then proceeds to construct the caption through recursive\ncomposition until a complete caption is obtained. In particular, at each step of the composition, a\nhigher-level phrase is formed by joining two selected sub-phrases via a connecting phrase. It is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: This \ufb01gure shows three test images in MS-COCO [5] with captions generated by the neural\nimage captioner [3], which contain n-gram building with a clock that appeared frequently in the\ntraining set but is not semantically correct for these images.\n\nnoteworthy that the compositional procedure described above is not a hand-crafted algorithm. Instead,\nit consists of two parametric modular nets, a connecting module for phrase composition and an\nevaluation module for deciding the completeness of phrases.\nThe proposed paradigm has several key advantages compared to conventional captioning models: (1)\nThe factorization of semantics and syntax not only better preserves the semantic content of the given\nimage but also makes caption generation easy to interpret and control. (2) The recursive composition\nprocedure naturally re\ufb02ects the inherent structures of natural language and allows the hierarchical\ndependencies among words and phrases to be captured. Through a series of ablative studies, we show\nthat the proposed paradigm can effectively increase the diversity of the generated captions while\npreserving semantic correctness. It also generalizes better to new data and can maintain reasonably\ngood performance when the number of available training data is small.\n2 Related Work\n\nLiterature in image captioning is vast, with the increased interest received in the neural network era.\nThe early approaches were bottom-up and detection based, where a set of visual concepts such as\nobjects and attributes were extracted from images [12, 13]. These concepts were then assembled into\ncaptions by \ufb01lling the blanks in pre-de\ufb01ned templates [13, 14], learned templates [15], or served as\nanchors to retrieve the most similar captions from the training set [16, 12].\nRecent works on image captioning adopt an alternative paradigm, which applies convolutional\nneural networks [17] as image representation, followed by recurrent neural networks [18] for caption\ngeneration. Speci\ufb01cally, Vinyals et al [3] proposed the neural image captioner, which represents\nthe input image with a single feature vector, and uses an LSTM [18] conditioned on this vector to\ngenerate words one by one. Xu et al [4] extended their work by representing the input image with\na set of feature vectors, and applied an attention mechanism to these vectors at every time step of\nthe recurrent decoder in order to extract the most relevant image information. Lu et al [1] adjusted\nthe attention computation to also attend to the already generated text. Anderson et al [2] added an\nadditional LSTM to better control the attention computation. Dai et al [19] reformulated the latent\nstates as 2D maps to better capture the semantic information in the input image. Some of the recent\napproaches directly extract phrases or semantic words from the input image. Yao et al [20] predicted\nthe occurrences of frequent training words, where the prediction is fed into the LSTM as an additional\nfeature vector. Tan et al [21] treated noun-phrases as hyper-words and added them into the vocabulary,\nsuch that the decoder was able to produce a full noun-phrase in one time step instead of a single word.\nIn [22], the authors proposed a hierarchical approach where one LSTM decides on the phrases to\nproduce, while the second-level LSTM produced words for each phrase.\nDespite the improvement over the model architectures, all these approaches generate captions\nsequentially. This tends to favor frequent n-grams [11, 23], leading to issues such as incorrect\nsemantic coverage, and lack of diversity. On the contrary, our proposed paradigm proceeds in a\nbottom-up manner, by representing the input image with a set of noun-phrases, and then constructs\ncaptions according to a recursive composition procedure. With such explicit disentanglement between\nsemantics and syntax, the recursive composition procedure preserves semantics more effectively,\nrequires less data to learn, and also leads to more diverse captions.\nWork conceptually related to ours is by Kuznetsova et al [24], which mines four types of phrases\nincluding noun-phrases from the training captions, and generates captions by selecting one phrase\nfrom each category and composes them via dynamic programming. Since the composition procedure\nis not recursive, it can only generate captions containing a single object, thus limiting the versatile\n\n2\n\na large building with a clock towera building with a clock on the side of ita building with a clock on the side of it\fFigure 2: An overview of the proposed compositional paradigm. A set of noun-phrases is extracted\nfrom the input image \ufb01rst, serving as the initial pool of phrases for the compositional generation\nprocedure. The procedure then recursively uses a connecting module to compose two phrases from\nthe pool into a longer phrase, until an evaluation module determines that a complete caption is\nobtained.\n\nnature of image description. In our work, any number of phrases can be composed, and we exploit\npowerful neural networks to learn plausible compositions.\n\n3 Compositional Captioning\n\nThe structure of natural language is inherently hierarchical [8, 7], where the typical parsing of a\nsentence takes the form of trees [25, 26, 27]. Hence, it\u2019s natural to produce captions following such\na hierarchical structure. Speci\ufb01cally, we propose a two-stage framework for image captioning, as\nshown in Figure 2. Given an image, we \ufb01rst derive a set of noun-phrases as an explicit semantic\nrepresentation. We then construct the caption in a bottom-up manner, via a recursive compositional\nprocedure which we refer to as CompCap. This procedure can be considered as an inverse of the\nsentence parsing process. Unlike mainstream captioning models that primarily rely on the n-gram\nstatistics among consecutive words, CompCap can take into account the nonsequential dependencies\namong words and phrases of a sentence. In what follows, we will present these two stages in more\ndetail.\n\n3.1 Explicit Representation of Semantics\n\nConventional captioning methods usually encode the content of the given image into feature vectors,\nwhich are often dif\ufb01cult to interpret. In our framework, we represent the image semantics explicitly\nby a set of noun-phrases, e.g. \u201ca black cat\u201d, \u201ca cloudy sky\u201d and \u201ctwo boys\u201d. These noun-phrases\ncan capture not only the object categories but also the associated attributes.\nNext, we brie\ufb02y introduce how we extract such noun-phrases from the input image. It\u2019s worth\nnoting that extracting such explicit representation for an image is essentially related to tasks of visual\nunderstanding. While more sophisticated techniques can be applied such as object detection [28] and\nattribute recognition [29], we present our approach here in order to complete the paradigm.\nIn our study, we found that the number of distinct noun-phrases in a dataset is signi\ufb01cantly smaller\nthan the number of images. For example, MS-COCO [5] contains 120K images but only about 3K\ndistinct noun-phrases in the associated captions. Given this observation, it is reasonable to formalize\nthe task of noun-phrase extraction as a multi-label classi\ufb01cation problem.\nSpeci\ufb01cally, we derive a list of distinct noun-phrases {N P1, N P2, ..., N PK} from the training\ncaptions by parsing the captions and selecting those noun-phrases that occur for more than 50\ntimes. We treat each selected noun-phrase as a class. Given an image I, we \ufb01rst extract the visual\nfeature v via a Convolutional Neural Network as v = CNN(I), and further encode it via two\nfully-connected layers as x = F (v). We then perform binary classi\ufb01cation for each noun-phrase\n\n3\n\nNoun-phrase Extraction{a field, \u2026, a small dog, \u2026, a football, \u2026 }Candidate Pool {... }(a small dog, a football)a small dog playing with a footballThe Connecting Modulea small dog playing with a football in a fieldThe Evaluation ModuleIs a complete caption ?Update Pool and Repeat(a) Explicit Representation of Semantics(b) CompCap: Compositional Caption Construction\fN Pk as SC(N Pk|I) = \u03c3(wT\nk x), where wk is the weight vector corresponding to the class N Pk and\n\u03c3 denotes the sigmoid function.\nGiven {SC(N Pk|I)}k, the scores for individual noun-phrases, we choose to represent the input\nimage using n of them with top scores. While the selected noun-phrases may contain semantically\nsimilar concepts, we further prune this set through Semantic Non-Maximum Suppression, where only\nthose noun-phrases whose scores are the maximum among similar phrases are retained.\n\n3.2 Recursive Composition of Captions\n\nStarting with a set of noun-phrases, we construct the caption through a recursive compositional\nprocedure called CompCap. We \ufb01rst provide an overview, and describe details of all the components\nin the following paragraphs.\nAt each step, CompCap maintains a phrase pool P, and scans all ordered pairs of phrases from P. For\neach ordered pair P (l) and P (r), a Connecting Module (C-Module) is applied to generate a sequence\nof words, denoted as P (m), to connect the two phrases in a plausible way. This yields a longer phrase\nin the form of P (l) \u2295 P (m) \u2295 P (r), where \u2295 denotes the operation of sequence concatenation. The\nC-Module also computes a score for P (l) \u2295 P (m) \u2295 P (r). Among all phrases that can be composed\nfrom scanned pairs, we choose the one with the maximum connecting score as the new phrase Pnew.\nA parametric module could also be used to determine Pnew.\nSubsequently, we apply an Evaluation Module (E-Module) to assess whether Pnew is a complete\ncaption. If Pnew is determined to be complete, we take it as the resulting caption; otherwise, we\nupdate the pool P by replacing the corresponding constituents P (l) and P (r) with Pnew, and invoke\nthe pair selection and connection process again based on the updated pool. The procedure continues\nuntil a complete caption is obtained or only a single phrase remains in P.\nWe next introduce the connecting and the evaluation module, respectively.\n\nThe Connecting Module. The Connecting Module (C-Module) aims to select a connecting phrase\nP (m) given both the left and right phrases P (l) and P (r), and to evaluate the connecting score\nS(P (m) | P (l), P (r), I). While this task is closely related to the task of \ufb01lling in the blanks of\ncaptions [30], we empirically found that the conventional way of using an LSTM to decode the\nintermediate words fails. One possible reason is that inputs in [30] are always pre\ufb01x and suf\ufb01x of\na complete caption. The C-Module, by contrast, mainly deals with incomplete ones, constituting\na signi\ufb01cantly larger space. In this work, we adopt an alternative strategy, namely, to treat the\ngeneration of connecting phrases as a classi\ufb01cation problem. This is motivated by the observation\nthat the number of distinct connecting phrases is actually limited in the proposed paradigm, since\nsemantic words such as nouns and adjectives are not involved in the connecting phrases. For example,\nin MS-COCO [5], there are over 1 million samples collected for the connecting module, which\ncontain only about 1, 000 distinct connecting phrases.\nSpeci\ufb01cally, we mine a set of distinct connecting sequences from the training captions, denoted\nL }, and treat them as different classes. This can be done by walking along the\nas {P (m)\nparsing trees of captions. We then de\ufb01ne the connecting module as a classi\ufb01er, which takes the left\n| P (l), P (r), I) for\nand right phrases P (l) and P (r) as input and outputs a normalized score S(P (m)\neach j \u2208 {1, . . . , L}.\nIn particular, we adopt a two-level LSTM model [2] to encode P (l) and P (r) respectively, as shown\nin Figure 3. Here, xt is the word embedding for t-th word, and v and {u1, ..., uM} are, respectively,\nglobal and regional image features extracted from a Convolutional Neural Network. In this model, the\nlow-level LSTM controls the attention while interacting with the visual features, and the high-level\nLSTM drives the evolution of the encoded state. The encoders for P (l) and P (r) share the same\nstructure but have different parameters, as one phrase should be encoded differently based on its place\nin the ordered pair. Their encodings, denoted by z(l) and z(r), go through two fully-connected layers\nfollowed by a softmax layer, as\n\n, . . . , P (m)\n\n1\n\nj\n\nS(P (m)\n\nj\n\n| P (l), P (r), I) = Softmax(Wcombine \u00b7 (Wl \u00b7 z(l) + Wr \u00b7 z(r)))|j,\n\n\u2200 j = 1, ..., L. (1)\n\n4\n\n\f0\n\nh(att)\nh(att)\n\nt\n\n0\n\n= 0\n\n= h(lan)\n= LSTM(xt, v, h(lan)\n\nat = Attention(h(att)\n= LSTM(at, h(att)\n\nt\n\nt\n\nh(lan)\n\nt\n\nt\u22121 , h(att)\nt\u22121 )\n, u1, ..., uM )\n, h(lan)\nt\u22121 )\n\nz = h(lan)\n\nT\n\n(a) Structure of the Phrase Encoder\n\n(b) Computation of the Phrase Encoder\n\nFigure 3: This \ufb01gure shows the two-level LSTM used to encode phrases in the connecting and\nevaluation modules. Left: the structure of the phrase encoder, right: its updating formulas.\n\nj\n\n| P (l), P (r), I), are then used as the connecting scores,\nThe values of the softmax output, i.e. S(P (m)\nand the connecting phrase that yields the highest connecting score is chosen to connect P (l) and P (r).\nWhile not all pairs of P (l) and P (r) can be connected into a longer phrase, in practice a virtual\nconnecting phrase P (m)\nneg\nBased on the C-Module, we compute the score for a phrase as follow. For each noun-phrase P in\nthe initial set, we set its score to be the binary classi\ufb01cation score SC(P|I) obtained in the phrase-\nfrom-image stage. For each longer phrase produced via the C-Module, its score is computed as\n\nis added to serve as a negative class.\n\nP (l) \u2295 P (m) \u2295 P (r) | I\n\nS\n\n= S\n\nP (l) | I\n\n+ S\n\nP (r) | I\n\n+ S\n\nP (m) | P (l), P (r), I\n\n.\n\n(2)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\nThe Evaluation Module. The Evaluation Module (E-Module) is used to determine whether a\nphrase is a complete caption. Speci\ufb01cally, given an input phrase P , the E-Module encodes it into a\nvector ze, using a two-level LSTM model as described above, and then evaluates the probability of P\nbeing a complete caption as\n\nPr(P is complete) = \u03c3(cid:0)wT\n\ncpze\n\n(cid:1) .\n\n(3)\n\nIt\u2019s worth noting that other properties could also be checked by the E-Module besides the complete-\nness. e.g. using a caption evaluator [11] to check the quality of captions.\n\nExtensions.\nInstead of following the greedy search strategy described above, we can extend the\nframework for generating diverse captions for a given image, via beam search or probabilistic\nsampling. Particularly, we can retain multiple ordered pairs at each step and multiple connecting\nsequences for each retained pair. In this way, we can form multiple beams for beam search, and thus\navoid being stuck in local minima. Another possibility is to generate diverse captions via probabilistic\nsampling, e.g. sampling a part of the ordered pairs for pair selection instead of using all of them, or\nsampling the connecting sequences based on their normalized scores instead of choosing the one that\nyields the highest score.\nThe framework can also be extended to incorporate user preferences or other conditions, as it consists\nof operations that are interpretable and controllable. For example, one can in\ufb02uence the resultant\ncaptions by \ufb01ltering the initial noun phrases or modulating their scores. Such control is much easier\nto implement on an explicit representation, i.e. a set of noun phrases, than on an encoded feature\nvector. We show examples in the Experimental section.\n\n4 Experiments\n\n4.1 Experiment Settings\n\nAll experiments are conducted on MS-COCO [5] and Flickr30k [6]. There are 123, 287 images and\n31, 783 images respectively in MS-COCO and Flickr30k, each of which has 5 ground-truth captions.\n\n5\n\nasmalldog\u2026LSTM unitAttention unitencoding\fTable 1: This table lists results of different methods on MS-COCO [5] and Flickr30k [6]. Results of\nCompCap using ground-truth noun-phrases and composing orders are shown in the last two rows.\n\nNIC [3]\nAdapAtt [1]\nTopDown [2]\nLSTM-A5 [20]\nCompCap + Prednp\nCompCap + GTnp\nCompCap + GTnp + GTorder\n\nCOCO-of\ufb02ine\n\nCD\n\nB4\n\nRG MT\n\n30.2\n92.6\n97.0\n31.2\n101.1 32.4\n96.6\n31.2\n86.2\n25.1\n122.2 42.8\n182.6 64.1\n\n52.3\n53.0\n53.8\n53.0\n47.8\n55.3\n82.4\n\n24.3\n25.0\n25.7\n24.9\n24.3\n33.6\n45.1\n\nSP\n\n17.4\n18.1\n18.7\n18.0\n19.9\n36.8\n33.8\n\nFlickr30k\n\nSP\n\n12.0\n13.4\n13.8\n12.2\n14.9\n31.9\n29.8\n\nCD\n\nB4\n\nRG MT\n\n19.9\n40.7\n23.3\n48.2\n23.7\n49.8\n20.4\n43.7\n16.4\n42.0\n89.7\n37.8\n132.8 54.9\n\n42.9\n45.5\n45.6\n43.8\n39.4\n50.5\n77.1\n\n18.0\n19.3\n19.7\n18.2\n19.0\n28.7\n39.6\n\nWe follow the splits in [31] for both datasets. In both datasets, the vocabulary is obtained by turning\nwords to lowercase and removing words that have non-alphabet characters and appear less than 5\ntimes. The removed words are replaced with a special token UNK, resulting in a vocabulary of size\n9, 487 for MS-COCO, and 7, 000 for Flickr30k. In addition, training captions are truncated to have at\nmost 18 words. To collect training data for the connecting module and the evaluation module, we\nfurther parse ground-truth captions into trees using NLPtookit [32].\nIn all experiments, C-Module and E-Module are separately trained as in two standard classi\ufb01cation\ntasks. Consequently, the recursive compositional procedure is modularized, making it less sensitive\nto training statistics in terms of the composing order, and generalizes better. When testing, each step\nof the procedure is done via two forward passes (one for each module). We empirically found that a\ncomplete caption generally requires 2 or 3 steps to obtain.\nSeveral representative methods are compared with CompCap. They are 1) Neural Image Captioner\n(NIC) [3], which is the backbone network for state-of-the-art captioning models. 2) AdapAtt [1]\nand 3) TopDown [2] are methods that apply the attention mechanism and obtain state-of-the-art\nperformances. While all of these baselines encode images as semantical feature vectors, we also\ncompare CompCap with 4) LSTM-A5 [20], which predicts the occurrence of semantical concepts\nas additional visual features. Subsequently, besides being used to extract noun-phrases that fed into\nCompCap, predictions of the noun-phrase classi\ufb01ers also serve as additional features for LSTM-A5.\nTo ensure a fair comparsion, we have re-implemented all methods, and train all methods using\nthe same hyperparameters. Speci\ufb01cally, we use ResNet-152 [17] pretrained on ImageNet [33] to\nextract image features, where activations of the last convolutional and fully-connected layer are used\nrespectively as the regional and global feature vectors. During training, we \ufb01x ResNet-152 without\n\ufb01netuning, and set the learning rate to be 0.0001 for all methods. When testing, for all methods we\nselect parameters that obtain best performance on the validation set to generate captions. Beam-search\nof size 3 is used for baselines. As for CompCap, we empirically select n = 7 noun-phrases with top\nscores to represent the input image, which is a trade-off between semantics and syntax, as shown\nin Figure 8. Beam-search of size 3 is used for pair selection, while no beam-search is used for\nconnecting phrase selection.\n\n4.2 Experiment Results\n\nGeneral Comparison. We compare the quality of the generated captions on the of\ufb02ine test set of\nMS-COCO and the test set of Flickr30k, in terms of SPICE (SP) [34], CIDEr (CD) [35], BLEU-4\n(B4) [36], ROUGE (RG) [37], and METEOR (MT) [38]. As shown in Table 1, among all methods,\nCompCap with predicted noun-phrases obtains the best results under the SPICE metric, which\nhas higher correlation with human judgements [34], but is inferior to baselines in terms of CIDEr,\nBLEU-4, ROUGE and METEOR. These results well re\ufb02ect the properties of methods that generate\ncaptions sequentially and compositionally. Speci\ufb01cally, while SPICE focuses on semantical analysis,\nmetrics including CIDEr, BLEU-4, ROUGE and METEOR are known to favor frequent training\nn-grams [11], which are more likely to appear when following a sequential generation procedure. On\nthe contrary, the compositional generation procedure preserves semantic content more effectively, but\nmay contain more n-grams that are not observed in the training set.\n\n6\n\n\f(a) SPICE: Train on COCO using less data\n\n(b) CIDEr: Train on COCO using less data\n\nFigure 4: This \ufb01gure shows the performance curves of different methods when less data is used for\ntraining. Unlike baselines, CompCap obtains stable results as the ratio of used data decreases.\n\n(a) SPICE: COCO -> Flickr30k\n\n(b) CIDEr: COCO -> Flickr30k\n\n(c) SPICE: Flickr30k -> COCO\n\n(d) CIDEr: Flickr30k -> COCO\n\nFigure 5: This \ufb01gure compares the generalization ability of different methods, where they are trained\non one dataset, and tested on the other. Compared to baselines, CompCap is shown to generalize\nbetter across datasets.\n\nAn ablation study is also conducted on components of the proposed compositional paradigm, as\nshown in the last three rows of Table 1. In particular, we represented the input image with ground-\ntruth noun-phrases collected from 5 associated captions, leading to a signi\ufb01cant boost in terms of all\nmetrics. This indicates that CompCap effectively preserves the semantic content, and the better the\nsemantic understanding we have for the input image, CompCap is able to generate better captions for\nus. Moreover, we also randomly picked one ground-truth caption, and followed its composing order\nto integrate its noun-phrases into a complete caption, so that CompCap only accounts for connecting\nphrase selection. As a result, metrics except for SPICE obtain further boost, which is reasonable as\nwe only use a part of all ground-truth noun-phrases, and frequent training n-grams are more likely to\nappear following some ground-truth composing order.\nGeneralization Analysis. As the proposed compositional paradigm disentangles semantics and\nsyntax into two stages, and CompCap mainly accounts for composing semantics into a syntactically\ncorrect caption, CompCap is good at handling out-of-domain semantic content, and requires less data\nto learn. To verify this hypothesis, we conducted two studies. In the \ufb01rst experiment, we controlled\nthe ratio of data used to train the baselines and modules of CompCap, while leaving the noun-phrase\nclassi\ufb01ers being trained on full data. The resulting curves in terms of SPICE and CIDEr are shown in\n\n7\n\n0.00.10.40.71.0Ratio of Training Data Used0102030SPICETopDownAdapAttNICLSTM-A5CompCap0.00.10.40.71.0Ratio of Training Data Used050100150CIDErTopDownAdapAttNICLSTM-A5CompCapNICAdapAttTopDownLSTM-A5CompCap102030SPICETrain:COCO, Test:COCOTrain:Flickr30k, Test:COCONICAdapAttTopDownLSTM-A5CompCap50100150CIDErTrain:COCO, Test:COCOTrain:Flickr30k, Test:COCONICAdapAttTopDownLSTM-A5CompCap91623SPICETrain:Flickr30k, Test:Flickr30kTrain:COCO, Test:Flickr30kNICAdapAttTopDownLSTM-A5CompCap255075CIDErTrain:Flickr30k, Test:Flickr30kTrain:COCO, Test:Flickr30k\fTable 2: This table measures the diversity of generated captions from various aspects, which suggests\nCompCap is able to generate more diverse captions.\nNIC [3] AdapAtt [1]\n44.53%\n55.05%\n\nLSTM-A5 [20] CompCap\n90.48%\n83.86%\n\n50.06%\n62.61%\n\nTopDown [2]\n\nNovel Caption Ratio\nUnique Caption Ratio\nDiversity (Dataset)\nDiversity (Image)\nVocabulary Usage\n\nO\nC\nO\nC\n\n7.69\n2.25\n6.75%\n\n49.34%\n59.14%\n\n7.86\n3.61\n7.22%\n\n45.05%\n61.58%\n\n7.99\n2.30\n7.97%\n\n7.77\n3.70\n8.14%\n\n9.85\n5.57\n9.18%\n\nFigure 6: This \ufb01gure shows images with diverse captions generated by CompCap. In \ufb01rst two rows,\ncaptions are generated with same noun-phrases but different composing orders. And in the last row,\ncaptions are generated with different sets of noun-phrases.\n\nFigure 4, while other metrics follow similar trends. Compared to baselines, CompCap is steady and\nlearns how to compose captions even only 1% of the data is used.\nIn the second study, we trained baselines and CompCap on MS-COCO/Flickr30k, and tested them on\nFlickr30k/MS-COCO. Again, the noun-phrase classi\ufb01ers are trained with in-domain data. The results\nin terms of SPICE and CIDEr are shown in Figure 5, where signi\ufb01cant drops are observed for the\nbaselines. On the contrary, competitive results are obtained for CompCap trained using in-domain\nand out-of-domain data, which suggests the bene\ufb01t of disentangling semantics and syntax, as the\ndistribution of semantics often varies from dataset to dataset, but the distribution of syntax is relatively\nstable across datasets.\nDiversity Analysis. One important property of CompCap is the ability to generate diverse captions,\nas these can be obtained by varying the involved noun-phrases or the composing order. To analyze\nthe diversity of captions, we computed \ufb01ve metrics that evaluate the degree of diversity from various\naspects. As shown in Table 2, we computed the ratio of novel captions and unique captions [39],\nwhich respectively account for the percentage of captions that are not observed in the training set, and\nthe percentage of distinct captions among all generated captions. We further computed the percentage\nof words in the vocabulary that are used to generate captions, referred to as the vocabulary usage.\nFinally, we quantify the diversity of a set of captions by averaging their pair-wise editing distances,\nwhich leads to two additional metrics. Speci\ufb01cally, when only a single caption is generated for each\nimage, we report the average distance over captions of different images, which is de\ufb01ned as the\ndiversity at the dataset level. If multiple captions are generated for each image, we then compute the\naverage distance over captions of the same image, followed by another average over all images. The\n\ufb01nal average is reported as the diversity at the image level. The former measures how diverse the\ncaptions are for different images, and the latter measures how diverse the captions are for a single\nimage. In practice, we use 5 captions with top scores in the beam search to compute the diversity at\nthe image level, for each method.\n\n8\n\na living rooma tablea coucha table and a couch in a living rooma table in a living room next to a coucha living room with a table and a coucha traina wooded areatracksawooded area with a train on tracksa train on tracks in a wooded areaa train traveling through a wooded area on tracksa city with a bus driving down a streeta highway in a traffic jam in a citya cloudy day with various cars on a street{a city, a street, a bus}a road with people on the sidewalk near trees{a traffic jam, a city, a highway}{various cars, a street, a cloudy day}{people, trees, a road, the sidewalk}\fa bed and a television in\na bedroom\n\na bike and a car is on a\nparking meter\n\na man\nis eating food\nwhile wearing a white\nhat\nFigure 7: Some failure cases are included in this \ufb01gure, where errors are highlighted by underlines.\nThe \ufb01rst two cases are related to errors in the \ufb01rst stage (i.e. semantic extraction), and the last two\ncases are related to the second stage (i.e. caption construction).\n\na train on the train tracks\nis next to a building\n\n(a) SPICE\n\n(b) CIDEr\n\nFigure 8: As shown in this \ufb01gure, as the maximum number of noun-phrases increases, SPICE\nimproves but CIDEr decreases, which indicates although introducing more noun-phrases could lead\nto semantically richer captions, it may risk the syntactic correctness.\n\nCompCap obtained the best results in all metrics, which suggests that captions generated by CompCap\nare diverse and novel. We further show qualitative samples in Figure 6, where captions are generated\nfollowing different composing orders, or using different noun-phrases.\nError Analysis. We include several failure cases in Figure 7, which share similar errors with the\nresults listed in Figure 1. However, the causes are fundamentally different. Generally, errors in\ncaptions generated by CompCap mainly come from the misunderstanding of the input visual content,\nwhich could be \ufb01xed by applying more sophisticated techniques in the stage of noun-phrase extraction.\nIt\u2019s, by contrast, an intrinsic property for sequential models to favor frequent n-grams. With a perfect\nunderstanding of the visual content, sequential models may still generate captions containing incorrect\nfrequent n-grams.\n\n5 Conclusion\n\nIn this paper, we propose a novel paradigm for image captioning. While the typical existing ap-\nproaches encode images using feature vectors and generate captions sequentially, the proposed\nmethod generates captions in a compositional manner. In particular, our approach factorizes the\ncaptioning procedure into two stages. In the \ufb01rst stage, an explicit representation of the input image,\nconsisting of noun-phrases, is extracted. In the second stage, a recursive compositional procedure is\napplied to assemble extracted noun-phrases into a caption. As a result, caption generation follows a\nhierarchical structure, which naturally \ufb01ts the properties of human language. On two datasets, the\nproposed compositional procedure is shown to preserve semantics more effectively, require less data\nto train, generalize better across datasets, and yield more diverse captions.\n\nAcknowledgement This work is partially supported by the Big Data Collaboration Research grant\nfrom SenseTime Group (CUHK Agreement No. TS1610626), the General Research Fund (GRF) of\nHong Kong (No. 14236516).\n\n9\n\n57101520Maximum Number of Noun-Phrases Used10152025SPICE57101520Maximum Number of Noun-Phrases Used5090130CIDEr\fReferences\n[1] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention\n\nvia a visual sentinel for image captioning. arXiv preprint arXiv:1612.01887, 2016.\n\n[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei\nZhang. Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998,\n2017.\n\n[3] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\ncaption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 3156\u20133164, 2015.\n\n[4] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S\nZemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention.\nIn ICML, volume 14, pages 77\u201381, 2015.\n\n[5] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on\nComputer Vision, pages 740\u2013755. Springer, 2014.\n\n[6] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual\ndenotations: New similarity metrics for semantic inference over event descriptions. Transactions of the\nAssociation for Computational Linguistics, 2:67\u201378, 2014.\n\n[7] Christopher D Manning and Hinrich Sch\u00fctze. Foundations of statistical natural language processing. MIT\n\npress, 1999.\n\n[8] Andrew Carnie. Syntax: A generative introduction. John Wiley & Sons, 2013.\n\n[9] Danqi Chen and Christopher Manning. A fast and accurate dependency parser using neural networks. In\nProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages\n740\u2013750, 2014.\n\n[10] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classi\ufb01cation with\n\nmulti-task learning. arXiv preprint arXiv:1605.05101, 2016.\n\n[11] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image descriptions via\n\na conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, 2017.\n\n[12] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hocken-\nmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In European\nconference on computer vision, pages 15\u201329. Springer, 2010.\n\n[13] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C\nBerg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. IEEE\nTransactions on Pattern Analysis and Machine Intelligence, 35(12):2891\u20132903, 2013.\n\n[14] Siming Li, Girish Kulkarni, Tamara L Berg, Alexander C Berg, and Yejin Choi. Composing simple image\ndescriptions using web-scale n-grams. In Proceedings of the Fifteenth Conference on Computational\nNatural Language Learning, pages 220\u2013228. Association for Computational Linguistics, 2011.\n\n[15] Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. Generating multi-sentence lingual descriptions\n\nof indoor scenes. In BMVC, 2015.\n\n[16] Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C Lawrence Zitnick. Exploring\n\nnearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015.\n\n[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\narXiv preprint arXiv:1512.03385, 2015.\n\n[18] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[19] Bo Dai, Deming Ye, and Dahua Lin. Rethinking the form of latent states in image captioning.\n\nProceedings of the European Conference on Computer Vision (ECCV), pages 282\u2013298, 2018.\n\nIn\n\n[20] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. Boosting image captioning with attributes.\n\narXiv preprint arXiv:1611.01646, 2016.\n\n10\n\n\f[21] Ying Hua Tan and Chee Seng Chan. phi-lstm: a phrase-based hierarchical lstm model for image captioning.\n\nIn Asian Conference on Computer Vision, pages 101\u2013117. Springer, 2016.\n\n[22] Huan Ling and Sanja Fidler. Teaching machines to describe images via natural language feedback. In\n\nNIPS, 2017.\n\n[23] Bo Dai and Dahua Lin. Contrastive learning for image captioning. In Advances in Neural Information\n\nProcessing Systems, pages 898\u2013907, 2017.\n\n[24] Polina Kuznetsova, Vicente Ordonez, Tamara Berg, and Yejin Choi. Treetalk: Composition and com-\npression of trees for image descriptions. Transactions of the Association of Computational Linguistics,\n2(1):351\u2013362, 2014.\n\n[25] Dan Klein and Christopher D Manning. Accurate unlexicalized parsing. In Proceedings of the 41st annual\n\nmeeting of the association for computational linguistics, 2003.\n\n[26] Slav Petrov and Dan Klein. Improved inference for unlexicalized parsing. In Human Language Technologies\n2007: The Conference of the North American Chapter of the Association for Computational Linguistics;\nProceedings of the Main Conference, pages 404\u2013411, 2007.\n\n[27] Richard Socher, John Bauer, Christopher D Manning, et al. Parsing with compositional vector grammars.\nIn Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1:\nLong Papers), volume 1, pages 455\u2013465, 2013.\n\n[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\nwith region proposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n\n[29] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In\nComputer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1778\u20131785.\nIEEE, 2009.\n\n[30] Licheng Yu, Eunbyung Park, Alexander C Berg, and Tamara L Berg. Visual madlibs: Fill in the blank\ndescription generation and question answering. In Computer Vision (ICCV), 2015 IEEE International\nConference on, pages 2461\u20132469. IEEE, 2015.\n\n[31] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128\u20133137,\n2015.\n\n[32] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky.\nThe stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the\nassociation for computational linguistics: system demonstrations, pages 55\u201360, 2014.\n\n[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211\u2013252,\n2015.\n\n[34] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional\nimage caption evaluation. In European Conference on Computer Vision, pages 382\u2013398. Springer, 2016.\n\n[35] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description\nevaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n4566\u20134575, 2015.\n\n[36] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation\nof machine translation. In Proceedings of the 40th annual meeting on association for computational\nlinguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n[37] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches\n\nout: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.\n\n[38] Michael Denkowski Alon Lavie. Meteor universal: Language speci\ufb01c translation evaluation for any target\n\nlanguage. ACL 2014, page 376, 2014.\n\n[39] Yufei Wang, Zhe Lin, Xiaohui Shen, Scott Cohen, and Garrison W Cottrell. Skeleton key: Image captioning\nby skeleton-attribute decomposition. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 7272\u20137281, 2017.\n\n11\n\n\f", "award": [], "sourceid": 378, "authors": [{"given_name": "Bo", "family_name": "Dai", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Sanja", "family_name": "Fidler", "institution": "University of Toronto"}, {"given_name": "Dahua", "family_name": "Lin", "institution": "The Chinese University of Hong Kong"}]}