{"title": "Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model", "book": "Advances in Neural Information Processing Systems", "page_first": 314, "page_last": 324, "abstract": "We present a novel training framework for neural sequence models, particularly for grounded dialog generation. The standard training paradigm for these models is maximum likelihood estimation (MLE), or minimizing the cross-entropy of the human responses. Across a variety of domains, a recurring problem with MLE trained generative neural dialog models (G) is that they tend to produce 'safe' and generic responses like \"I don't know\", \"I can't tell\"). In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses. However, D is not useful in practice since it can not be deployed to have real conversations with users.   Our work aims to achieve the best of both worlds -- the practical usefulness of G and the strong performance of D -- via knowledge transfer from D to G. Our primary contribution is an end-to-end trainable generative visual dialog model, where G receives gradients from D as a perceptual (not adversarial) loss of the sequence sampled from G. We leverage the recently proposed Gumbel-Softmax (GS) approximation to the discrete distribution -- specifically, a RNN is augmented with a sequence of GS samplers, which coupled with the straight-through gradient estimator enables end-to-end differentiability. We also introduce a stronger encoder for visual dialog, and employ a self-attention mechanism for answer encoding along with a metric learning loss to aid D in better capturing semantic similarities in answer responses. Overall, our proposed model outperforms state-of-the-art on the VisDial dataset by a significant margin (2.67% on recall@10). The source code can be downloaded from https://github.com/jiasenlu/visDial.pytorch", "full_text": "Best of Both Worlds: Transferring Knowledge from\n\nDiscriminative Learning to a Generative Visual\n\nDialog Model\n\nJiasen Lu1\u2217, Anitha Kannan2\u2217, Jianwei Yang1, Devi Parikh3,1, Dhruv Batra3,1\n\n1 Georgia Institute of Technology, 2 Curai, 3 Facebook AI Research\n\n{jiasenlu, jw2yang, parikh, dbatra}@gatech.edu\n\nAbstract\n\nWe present a novel training framework for neural sequence models, particularly\nfor grounded dialog generation. The standard training paradigm for these models\nis maximum likelihood estimation (MLE), or minimizing the cross-entropy of the\nhuman responses. Across a variety of domains, a recurring problem with MLE\ntrained generative neural dialog models (G) is that they tend to produce \u2018safe\u2019\nand generic responses (\u2018I don\u2019t know\u2019, \u2018I can\u2019t tell\u2019). In contrast, discriminative\ndialog models (D) that are trained to rank a list of candidate human responses\noutperform their generative counterparts; in terms of automatic metrics, diversity,\nand informativeness of the responses. However, D is not useful in practice since it\ncan not be deployed to have real conversations with users.\nOur work aims to achieve the best of both worlds \u2013 the practical usefulness of\nG and the strong performance of D \u2013 via knowledge transfer from D to G. Our\nprimary contribution is an end-to-end trainable generative visual dialog model,\nwhere G receives gradients from D as a perceptual (not adversarial) loss of the se-\nquence sampled from G. We leverage the recently proposed Gumbel-Softmax (GS)\napproximation to the discrete distribution \u2013 speci\ufb01cally, a RNN augmented with a\nsequence of GS samplers, coupled with the straight-through gradient estimator to\nenable end-to-end differentiability. We also introduce a stronger encoder for visual\ndialog, and employ a self-attention mechanism for answer encoding along with a\nmetric learning loss to aid D in better capturing semantic similarities in answer\nresponses. Overall, our proposed model outperforms state-of-the-art on the VisDial\ndataset by a signi\ufb01cant margin (2.67% on recall@10).\n\n1\n\nIntroduction\n\nOne fundamental goal of arti\ufb01cial intelligence (AI) is the development of perceptually-grounded\ndialog agents \u2013 speci\ufb01cally, agents that can perceive or understand their environment (through vision,\naudio, or other sensors), and communicate their understanding with humans or other agents in natural\nlanguage. Over the last few years, neural sequence models (e.g. [47, 44, 46]) have emerged as the\ndominant paradigm across a variety of setting and datasets \u2013 from text-only dialog [44, 40, 23, 3] to\nmore recently, visual dialog [7, 9, 8, 33, 45], where an agent must answer a sequence of questions\ngrounded in an image, requiring it to reason about both visual content and the dialog history.\nThe standard training paradigm for neural dialog models is maximum likelihood estimation (MLE)\nor equivalently, minimizing the cross-entropy (under the model) of a \u2018ground-truth\u2019 human response.\nAcross a variety of domains, a recurring problem with MLE trained neural dialog models is that they\ntend to produce \u2018safe\u2019 generic responses, such as \u2018Not sure\u2019 or \u2018I don\u2019t know\u2019 in text-only dialog [23],\nand \u2018I can\u2019t see\u2019 or \u2018I can\u2019t tell\u2019 in visual dialog [7, 8]. One reason for this emergent behavior is that\n\n\u2217Work was done while at Facebook AI Research.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe space of possible next utterances in a dialog is highly multi-modal (there are many possible paths\na dialog may take in the future). In the face of such highly multi-modal output distributions, models\n\u2018game\u2019 MLE by latching on to the head of the distribution or the frequent responses, which by nature\ntend to be generic and widely applicable. Such safe generic responses break the \ufb02ow of a dialog and\ntend to disengage the human conversing with the agent, ultimately rendering the agent useless. It is\nclear that novel training paradigms are needed; that is the focus of this paper.\nOne promising alternative to MLE training proposed by recent work [36, 27] is sequence-level\ntraining of neural sequence models, speci\ufb01cally, using reinforcement learning to optimize task-\nspeci\ufb01c sequence metrics such as BLEU [34], ROUGE [24], CIDEr [48]. Unfortunately, in the case\nof dialog, all existing automatic metrics correlate poorly with human judgment [26], which renders\nthis alternative infeasible for dialog models.\nIn this paper, inspired by the success of adversarial training [16], we propose to train a generative\nvisual dialog model (G) to produce sequences that score highly under a discriminative visual dialog\nmodel (D). A discriminative dialog model receives as input a candidate list of possible responses and\nlearns to sort this list from the training dataset. The generative dialog model (G) aims to produce a\nsequence that D will rank the highest in the list, as shown in Fig. 1.\nNote that while our proposed approach is inspired by adversarial training, there are a number of\nsubtle but crucial differences over generative adversarial networks (GANs). Unlike traditional GANs,\none novelty in our setup is that our discriminator receives a list of candidate responses and explicitly\nlearns to reason about similarities and differences across candidates. In this process, D learns a\ntask-dependent perceptual similarity [12, 19, 15] and learns to recognize multiple correct responses\nin the feature space. For example, as shown in Fig. 1 right, given the image, dialog history, and\nquestion \u2018Do you see any bird?\u2019, besides the ground-truth answer \u2018No, I do not\u2019, D can also assign\nhigh scores to other options that are valid responses to the question, including the one generated by\nG: \u2018Not that I can see\u2019. The interaction between responses is captured via the similarity between the\nlearned embeddings. This similarity gives an additional signal that G can leverage in addition to the\nMLE loss. In that sense, our proposed approach may be viewed as an instance of \u2018knowledge transfer\u2019\n[17, 5] from D to G. We employ a metric-learning loss function and a self-attention answer encoding\nmechanism for D that makes it particularly conducive to this knowledge transfer by encouraging\nperceptually meaningful similarities to emerge. This is especially fruitful since prior work has\ndemonstrated that discriminative dialog models signi\ufb01cantly outperform their generative counterparts,\nbut are not as useful since they necessarily need a list of candidate responses to rank, which is only\navailable in a dialog dataset, not in real conversations with a user. In that context, our work aims to\nachieve the best of both worlds \u2013 the practical usefulness of G and the strong performance of D \u2013 via\nthis knowledge transfer.\nOur primary technical contribution is an end-to-end trainable generative visual dialog model, where\nthe generator receives gradients from the discriminator loss of the sequence sampled from G. Note\nthat this is challenging because the output of G is a sequence of discrete symbols, which na\u00efvely is not\namenable to gradient-based training. We propose to leverage the recently proposed Gumbel-Softmax\n(GS) approximation to the discrete distribution [18, 30] \u2013 speci\ufb01cally, a Recurrent Neural Network\n(RNN) augmented with a sequence of GS samplers, which when coupled with the straight-through\ngradient estimator [2, 18] enables end-to-end differentiability.\nOur results show that our \u2018knowledge transfer\u2019 approach is indeed successful. Speci\ufb01cally, our\ndiscriminator-trained G outperforms the MLE-trained G by 1.7% on recall@5 on the VisDial dataset,\nessentially improving over state-of-the-art [7] by 2.43% recall@5 and 2.67% recall@10. Moreover,\nour generative model produces more diverse and informative responses (see Table 3).\nAs a side contribution speci\ufb01c to this application, we introduce a novel encoder for neural visual\ndialog models, which maintains two separate memory banks \u2013 one for visual memory (where do\nwe look in the image?) and another for textual memory (what facts do we know from the dialog\nhistory?), and outperforms the encoders used in prior work.\n\n2 Related Work\n\nGANs for sequence generation. Generative Adversarial Networks (GANs) [16] have shown to be\neffective models for a wide range of applications involving continuous variables (e.g. images) c.f\n[10, 35, 22, 55]. More recently, they have also been used for discrete output spaces such as language\ngeneration \u2013 e.g. image captioning [6, 41], dialog generation [23], or text generation [53] \u2013 by either\nviewing the generative model as a stochastic parametrized policy that is updated using REINFORCE\n\n2\n\n\fFigure 1: Model architecture of the proposed model. Given the image, history, and question, the discriminator\nreceives as additional input a candidate list of possible responses and learns to sort this list. The generator aims\nto produce a sequence that discriminator will rank the highest in the list. The right most block is D\u2019s score for\ndifferent candidate answers. Note that the multiple plausible responses all score high. Image from the COCO\ndataset [25].\n\nwith the discriminator providing the reward [53, 6, 41, 23], or (closer to our approach) through\ncontinuous relaxation of discrete variables through Gumbel-Softmax to enable backpropagating the\nresponse from the discriminator [21, 41].\nThere are a few subtle but signi\ufb01cant differences w.r.t. to our application, motivation, and approach.\nIn these prior works, both the discriminator and the generator are trained in tandem, and from scratch.\nThe goal of the discriminator in those settings has primarily been to discriminate \u2018fake\u2019 samples\n(i.e. generator\u2019s outputs) from \u2018real\u2019 samples (i.e. from training data). In contrast, we would like\nto transfer knowledge from the discriminator to the generator. We start with pre-trained D and G\nmodels suited for the task, and then transfer knowledge from D to G to further improve G, while\nkeeping D \ufb01xed. As we show in our experiments, this procedure results in G producing diverse\nsamples that are close in the embedding space to the ground truth, due to perceptual similarity learned\nin D. One can also draw connections between our work and Energy Based GAN (EBGAN) [54] \u2013\nwithout the adversarial training aspect. The \u201cenergy\u201d in our case is a deep metric-learning based\nscoring mechanism, instantiated in the visual dialog application.\nModeling image and text attention. Models for tasks at the intersection of vision and language\n\u2013 e.g., image captioning [11, 13, 20, 49], visual question answering [1, 14, 31, 37], visual dialog\n[7, 9, 8, 45, 33] \u2013 typically involve attention mechanisms. For image captioning, this may be attending\nto relevant regions in the image [49, 51, 28]. For VQA, this may be attending to relevant image\nregions alone [4, 50, 52] or co-attending to image regions and question words/phrases [29].\nIn the context of visual dialog, [7] uses attention to identify utterances in the dialog history that may\nbe useful for answering the current question. However, when modeling the image, the entire image\nembedding is used to obtain the answer. In contrast, our proposed encoder HCIAE (Section 4.1)\nlocalizes the region in the image that can help reliably answer the question. In particular, in addition\nto the history and the question guiding the image attention, our visual dialog encoder also reasons\nabout the history when identifying relevant regions of the image. This allows the model to implicitly\nresolve co-references in the text and ground them back in the image.\n\n3 Preliminaries: Visual Dialog\n\nWe begin by formally describing the visual dialog task setup as introduced by Das et al. [7]. The\nmachine learning task is as follows. A visual dialog model is given as input an image I, caption c\n), and\n\ndescribing the image, a dialog history till round t \u2212 1, H = ( c(cid:124)(cid:123)(cid:122)(cid:125)\n\n, . . . , (qt\u22121, at\u22121)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nthe followup question qt at round t. The visual dialog agent needs to return a valid response to the\nquestion.\nGiven the problem setup, there are two broad classes of methods \u2013 generative and discriminative\nmodels. Generative models for visual dialog are trained by maximizing the log-likelihood of the\nt \u2208 At given the encoded representation of the input (I, H, qt).\nground truth answer sequence agt\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n, (q1, a1)\n\nH1\n\nH0\n\n(cid:124)\n\nHt\u22121\n\n3\n\nImageIDo\tyou\tsee\tany\tbirds?QuestionQtA\tgray\ttiger\tcat\tsitting\tunderneath\ta\tmetal\tbench.Is\tit\tin\tcolor?Yes\tit\tis.Is\tit\tday\ttime?Yes.Is\tthe\ttigerbig?No,\tit\u2019s\ta\tregular\tcat.troundsofhistoryHCIAEEncoderAnswerDecoderGumbelSamplerOption\tanswers\t(D)ScoreNobirdI\tdo\tnot\tsee\tany\tbirdsNoNo\t,\tI\tdo\tnotNo\t,\tNopeNot\tthat\tI\tcan\tseeyes\u2026\u2026MangoesWhiteI\tsee\tsmall\tshopsNot\tthat\tI\tcan\tseeNobirdI\tdo\tnot\tsee\tany\tbirdsNoNo\t,\tI\tdo\tnotNope\u2026yesMangoesWhiteI\tsee\tsmall\tshopsGeneratorHQIHQI\ud835\udc52\"AnswerEncoderDeepMetricLossDiscriminatorHQI\ud835\udc52\"HCIAEEncoder\fOn the other hand, discriminative models receive both an encoding of the input (I, H, qt) and as\n}. These models effectively\nadditional input a list of 100 candidate answers At = {a(1)\nlearn to sort the list. Thus, by design, they cannot be used at test time without a list of candidates\navailable.\n4 Approach: Backprop Through Discriminative Losses for Generative\n\n, . . . , a(100)\n\nt\n\nt\n\nTraining\n\nIn this section, we describe our approach to transfer knowledge from a discriminative visual dialog\nmodel (D) to generative visual dialog model (G). Fig. 1 (a) shows the overview of our approach.\nGiven the input image I, dialog history H, and question qt, the encoder converts the inputs into a\njoint representation et. The generator G takes et as input, and produces a distribution over answer\nsequences via a recurrent neural network (speci\ufb01cally an LSTM). At each word in the answer\nsequence, we use a Gumbel-Softmax sampler S to sample the answer token from that distribution.\nt and N \u2212 1 \u201cnegative\u201d\nThe discriminator D in it\u2019s standard form takes et, ground-truth answer agt\nanswers {a\u2212\nt,i}N\u22121\nt )) >\nt,\u00b7)), where f (\u00b7) is the embedding function. When we enable the communication\nsimilarity(et, f (a\u2212\nbetween D and G, we feed the sampled answer \u02c6at into discriminator, and optimize the generator G\nto produce samples that get higher scores in D\u2019s metric space.\nWe now describe each component of our approach in detail.\n\ni=1 as input, and learns an embedding space such that similarity(et, f (agt\n\n4.1 History-Conditioned Image Attentive Encoder (HCIAE)\nAn important characteristic in dialogs is the use of co-reference to avoid repeating entities that can\nbe contextually resolved. In fact, in the VisDial dataset [7] nearly all (98%) dialogs involve at least\none pronoun. This means that for a model to correctly answer a question, it would require a reliable\nmechanism for co-reference resolution.\nA common approach is to use an encoder architecture with an attention mechanism that implicitly\nperforms co-reference resolution by identifying the portion of the dialog history that can help in\nanswering the current question [7, 38, 39, 32]. while using a holistic representation for the image.\nIntuitively, one would also expect that the answer is also localized to regions in the image, and be\nconsistent with the attended history.\nWith this motivation, we propose a novel encoder architecture (called HCIAE) shown in Fig. 2. Our\nencoder \ufb01rst uses the current question to attend to the exchanges in the history, and then use the\nquestion and attended history to attend to the image, so as to obtain the \ufb01nal encoding.\nSpeci\ufb01cally, we use the spatial image features V \u2208\nRd\u00d7k from a convolution layer of a CNN. qt is\nt \u2208\nencoded with an LSTM to get a vector mq\nRd. Simultaneously, each previous round of history\n(H0, . . . , Ht\u22121) is encoded separately with another\nt \u2208 Rd\u00d7t. Conditioned on the question\nLSTM as M h\nembedding, the model attends to the history. The at-\ntended representation of the history and the question\nembedding are concatenated, and used as input to\nattend to the image:\n\na tanh(WhM h\n\nt + (Wqmq\n\nzh\nt = wT\nt = softmax(zh\n\u03b1h\nt )\n\nt )1T ) (1)\n(2)\nwhere 1 \u2208 Rt is a vector with all elements set to 1.\nWh, Wq \u2208 Rt\u00d7d and wa \u2208 Rk are parameters to be learned. \u03b1 \u2208 Rk is the attention weight over\nt is a convex combination of columns of Mt, weighted\nhistory. The attended history feature \u02c6mh\nappropriately by the elements of \u03b1h\nt as the query vector and get\nthe attended image feature \u02c6vt in the similar manner. Subsequently, all three components are used to\nobtain the \ufb01nal embedding et:\n\nFigure 2: Structure of the proposed encoder.\n\nt . We further concatenate mq\n\nt and \u02c6mh\n\nwhere We \u2208 Rd\u00d73d is weight parameters and [\u00b7] is the concatenation operation.\n\net = tanh(We[mq\n\nt , \u02c6mh\n\nt , \u02c6vt])\n\n(3)\n\n4\n\n\ud835\udc44\"\ud835\udc3b$\ud835\udc3b\"%&\u2026\u2026\ud835\udc3cCNN\u2026\u2026LSTMEncoderLSTMLSTM\ud835\udc52\"\f4.2 Discriminator Loss\nDiscriminative visual dialog models produce a distribution over the candidate answer list At and\nmaximize the log-likelihood of the correct option agt\nt . The loss function for D needs to be conducive\nfor knowledge transfer. In particular, it needs to encourage perceptually meaningful similarities.\nTherefore, we use a metric-learning multi-class N-pair loss [43] de\ufb01ned as:\n\n(cid:16){et, agt\n\nt ,{a\u2212\n\nt,i}N\u22121\n\ni=1 }, f\n\n(cid:17)\n\nLD = Ln\u2212pair\n\n(cid:122)\n\n(cid:32)\n\n=\n\nlog\n\n1 +\n\nN(cid:88)\n\ni=1\n\nexp\n\n(cid:16)\n\nlogistic loss\n\n(cid:125)(cid:124)\nt f (a\u2212\ne(cid:62)\n\n(cid:124)\n\n(cid:123)(cid:122)\nt,i) \u2212 e(cid:62)\n\nscore margin\n\n(cid:123)\n(cid:17)(cid:33)\n\n(cid:125)\n\nt f (agt\nt )\n\n(4)\n\nwhere f is an attention based LSTM encoder for the answer. This attention can help the discriminator\nbetter deal with paraphrases across answers. The attention weight is learnt through a 1-layer MLP\nover LSTM output at each time step. The N-pair loss objective encourages learning a space in which\nthe ground truth answer is scored higher than other options, and at the same time, encourages options\nsimilar to ground truth answers to score better than dissimilar ones. This means that, unlike the\nmulticlass logistic loss, the options that are correct but different from the correct option may not be\noverly penalized, and thus can be useful in providing a reliable signal to the generator. See Fig. 1 for\nan example. Follwing [43], we regularize the L2 norm of the embedding vectors to be small.\n\n4.3 Discriminant Perceptual Loss and Knowledge Transfer from D to G\nAt a high-level, our approach for transferring knowledge from D to G is as follows: G repeatedly\nqueries D with answers \u02c6at that it generates for an input embedding et to get feedback and update\nitself. In each such update, G\u2019s goal is to update its parameters to try and have \u02c6at score higher than\nthe correct answer, agt\nt , under D\u2019s learned embedding and scoring function. Formally, the perceptual\nloss that G aims to optimize is given by:\nt }, f\n\n(cid:16){et, \u02c6at, agt\n\nLG = L1\u2212pair\n\nt ) \u2212 e(cid:62)\n\ne(cid:62)\nt f (agt\n\n(cid:17)(cid:19)\n\nt f (\u02c6at)\n\n1 + exp\n\n(cid:18)\n\n= log\n\n(cid:17)\n\n(cid:16)\n\n(5)\n\nt under the discriminator\u2019s learned embedding f (\u00b7) and scoring function.\n\nwhere f is the embedding function learned by the discriminator as in (4). Intuitively, updating\ngenerator parameters to minimize LG can be interpreted as learning to produce an answer sequence\n\u02c6at that \u2018fools\u2019 the discriminator into believing that this answer should score higher than the human\nresponse agt\nWhile it is straightforward to sample an answer \u02c6at from the generator and perform a forward pass\nthrough the discriminator, na\u00efvely, it is not possible to backpropagate the gradients to the generator\nparameters since sampling discrete symbols results in zero gradients w.r.t. the generator parameters.\nTo overcome this, we leverage the recently introduced continuous relaxation of the categorical\ndistribution \u2013 the Gumbel-softmax distribution or the Concrete distribution [18, 30].\nAt an intuitive level, the Gumbel-Softmax (GS) approximation uses the so called \u2018Gumbel-Max trick\u2019\nto reparametrize sampling from a categorical distribution and replaces argmax with softmax to obtain\na continuous relaxation of the discrete random variable. Formally, let x denote a K-ary categorical\n1 denote K\nIID samples from the standard Gumbel distribution, gi \u223c F (g) = e\u2212e\u2212g. Now, a sample from the\nConcrete distribution can be produced via the following transformation:\n\nrandom variable with parameters denoted by (p1, . . . pK), or x \u223c Cat(p). Let(cid:0)gi\n\n(cid:1)K\n\n(cid:80)K\n\nyi =\n\ne(log pi+gi)/\u03c4\nj=1 e(log pj +gj )/\u03c4\n\n\u2200i \u2208 {1, . . . , K}\n\n(6)\n\nwhere \u03c4 is a temperature parameter that control how close samples y from this Concrete distribution\napproximate the one-hot encoding of the categorical variable x.\nAs illustrated in Fig. 1, we augment the LSTM in G with a sequence of GS samplers. Speci\ufb01cally,\nat each position in the answer sequence, we use a GS sampler to sample an answer token from\nthat conditional distribution. When coupled with the straight-through gradient estimator [2, 18]\nthis enables end-to-end differentiability. Speci\ufb01cally, during the forward pass we discretize the GS\nsamples into discrete samples, and in the backward pass use the continuous relaxation to compute\ngradients. In our experiments, we held the temperature parameter \ufb01xed at 0.5.\n\n5\n\n\f5 Experiments\nDataset and Setup. We evaluate our proposed approach on the VisDial dataset [7], which was\ncollected by Das et al. by pairing two subjects on Amazon Mechanical Turk to chat about an image.\nOne person was assigned the role of a \u2018questioner\u2019 and the other of \u2018answerer\u2019. One worker (the\nquestioner) sees only a single line of text describing an image (caption from COCO [25]); the image\nremains hidden to the questioner. Their task is to ask questions about this hidden image to \u201cimagine\nthe scene better\u201d. The second worker (the answerer) sees the image and caption and answers the\nquestions. The two workers take turns asking and answering questions for 10 rounds. We perform\nexperiments on VisDial v0.9 (the latest available release) containing 83k dialogs on COCO-train and\n40k on COCO-val images, for a total of 1.2M dialog question-answer pairs. We split the 83k into 82k\nfor train, 1k for val, and use the 40k as test, in a manner consistent with [7]. The caption is\nconsidered to be the \ufb01rst round in the dialog history.\nEvaluation Protocol. Following the evaluation protocol established in [7], we use a retrieval setting\nto evaluate the responses at each round in the dialog. Speci\ufb01cally, every question in VisDial is\ncoupled with a list of 100 candidate answer options, which the models are asked to sort for evaluation\npurposes. D uses its score to rank these answer options, and G uses the log-likelihood of these\noptions for ranking. Models are evaluated on standard retrieval metrics \u2013 (1) mean rank, (2) recall\n@k, and (3) mean reciprocal rank (MRR) \u2013 of the human response in the returned sorted list.\nPre-processing. We truncate captions/questions/answers longer than 24/16/8 words respectively. We\nthen build a vocabulary of words that occur at least 5 times in train, resulting in 8964 words.\nTraining Details In our experiments, all 3 LSTMs are single layer with 512d hidden state. We use\nVGG-19 [42] to get the representation of image. We \ufb01rst rescale the images to be 224 \u00d7 224 pixels,\nand take the output of last pooling layer (512 \u00d7 7 \u00d7 7) as image feature. We use the Adam optimizer\nwith a base learning rate of 4e-4. We pre-train G using standard MLE for 20 epochs, and D with\nsupervised training based on Eq (4) for 30 epochs. Following [43], we regularize the L2 norm of the\nembedding vectors to be small. Subsequently, we train G with LG + \u03b1LM LE, which is a combination\nof discriminative perceptual loss and MLE loss. We set \u03b1 to be 0.5. We found that including LM LE\n(with teacher-forcing) is important for encouraging G to generate grammatically correct responses.\n\n5.1 Results and Analysis\nBaselines. We compare our proposed techniques to the current state-of-art generative and discriminat-\nive models developed in [7]. Speci\ufb01cally, [7] introduced 3 encoding architectures \u2013 Late Fusion (LF),\nHierarchical Recurrent Encoder (HRE), Memory Network (MN) \u2013 each trained with a generative\n(-G) and discriminative (-D) decoder. We compare to all 6 models.\nOur approaches. We present a few variants of our approach to systematically study the individual\ncontributions of our training procedure, novel encoder (HCIAE), self-attentive answer encoding\n(ATT), and metric-loss (NP).\n\n\u2022 HCIAE-G-MLE is a generative model with our proposed encoder trained under the MLE ob-\njective. Comparing this variant to the generative baselines from [7] establishes the improvement\ndue to our encoder (HCIAE).\n\n\u2022 HCIAE-G-DIS is a generative model with our proposed encoder trained under the mixed MLE\nand discriminator loss (knowledge transfer). This forms our best generative model. Comparing\nthis model to HCIAE-G-MLE establishes the improvement due to our discriminative training.\n\u2022 HCIAE-D-MLE is a discriminative model with our proposed encoder, trained under the stand-\nard discriminative cross-entropy loss. The answer candidates are encoded using an LSTM\n(no attention). Comparing this variant to the discriminative baselines from [7] establishes the\nimprovement due to our encoder (HCIAE) in the discriminative setting.\n\n\u2022 HCIAE-D-NP is a discriminative model with our proposed encoder, trained under the n-pair\ndiscriminative loss (as described in Section 4.2). The answer candidates are encoded using an\nLSTM (no attention). Comparing this variant to HCIAE-D-MLE establishes the improvement\ndue to the n-pair loss.\n\n\u2022 HCIAE-D-NP-ATT is a discriminative model with our proposed encoder, trained under the\nn-pair discriminative loss (as described in Section 4.2), and using the self-attentive answer\nencoding. Comparing this variant to HCIAE-D-NP establishes the improvement due to the\nself-attention mechanism while encoding the answers.\n\n6\n\n\fTable 1: Results (generative) on VisDial dataset. \u201cMRR\u201d\nis mean reciprocal rank and \u201cMean\u201d is mean rank.\nMRR R@1 R@5 R@10 Mean\nModel\n0.5199 41.83 61.78 67.59 17.07\nLF-G [7]\n0.5242 42.28 62.33 68.17 16.79\nHREA-G [7]\nMN-G [7]\n0.5259 42.29 62.85 68.88 17.06\nHCIAE-G-MLE 0.5386 44.06 63.55 69.24 16.01\nHCIAE-G-DIS 0.5467 44.35 65.28 71.55 14.23\n\nTable 2: Results (discriminative) on VisDial dataset.\nMRR R@1 R@5 R@10 Mean\nModel\n0.5807 43.82 74.68 84.07 5.78\nLF-D [7]\n0.5868 44.82 74.81 84.36 5.66\nHREA-D [7]\nMN-D [7]\n0.5965 45.55 76.22 85.37 5.46\n0.6140 47.73 77.50 86.35 5.15\nHCIAE-D-MLE\nHCIAE-D-NP\n0.6182 47.98 78.35 87.16 4.92\nHCIAE-D-NP-ATT 0.6222 48.48 78.75 87.59 4.81\n\nResults. Tables 1, 2 present results for all our models and baselines in generative and discriminative\nsettings. The key observations are:\n\n1. Main Results for HCIAE-G-DIS: Our \ufb01nal generative model with all \u2018bells and whistles\u2019,\nHCIAE-G-DIS, uniformly performs the best under all the metrics, outperforming the\nprevious state-of-art model MN-G by 2.43% on R@5. This shows the importance of the\nknowledge transfer from the discriminator and the bene\ufb01t from our encoder architecture.\n\n2. Knowledge transfer vs. encoder for G: To understand the relative importance of the pro-\nposed history conditioned image attentive encoder (HCIAE) and the knowledge transfer, we\ncompared the performance of HCIAE-G-DIS with HCIAE-G-MLE, which uses our pro-\nposed encoder but without any feedback from the discriminator. This comparison highlights\ntwo points: \ufb01rst, HCIAE-G-MLE improves R@5 by 0.7% over the current state-of-art\nmethod (MN-D) con\ufb01rming the bene\ufb01ts of our encoder. Secondly, and importantly, its\nperformance is lower than HCIAE-G-DIS by 1.7% on R@5, con\ufb01rming that the modi\ufb01c-\nations to encoder alone will not be suf\ufb01cient to gain improvements in answer generation;\nknowledge transfer from D greatly improves G.\n\n3. Metric loss vs. self-attentive answer encoding: In the purely discriminative setting, our\n\ufb01nal discriminative model (HCIAE-D-NP-ATT) also beats the performance of the corres-\nponding state-of-art models [7] by 2.53% on R@5. The n-pair loss used in the discriminator\nis not only helpful for knowledge transfer but it also improves the performance of the\ndiscriminator by 0.85% on R@5 (compare HCIAE-D-NP to HCIAE-D-MLE). The im-\nprovements obtained by using the answer attention mechanism leads to an additional, albeit\nsmall, gains of 0.4% on R@5 to the discriminator performance (compare HCIAE-D-NP to\nHCIAE-D-NP-ATT).\n\n5.2 Does updating discriminator help?\nRecall that our model training happens as follows: we independently train the generative model\nHCIAE-G-MLE and the discriminative model HCIAE-D-NP-ATT. With HCIAE-G-MLE as the\ninitialization, the generative model is updated based on the feedback from HCIAE-D-NP-ATT and\nthis results in our \ufb01nal HCIAE-G-DIS.\nWe performed two further experiments to answer the following questions:\n\n\u2022 What happens if we continue training HCIAE-D-NP-ATT in an adversarial setting? In par-\nticular, we continue training by maximizing the score of the ground truth answer agt\nt and\nminimizing the score of the generated answer \u02c6at, effectively setting up an adversarial train-\ning regime LD = \u2212LG. The resulting discriminator HCIAE-GAN1 has signi\ufb01cant drop in\nperformance, as can be seen in Table. 4 (32.97% R@5). This is perhaps expected because\nHCIAE-GAN1 updates its parameters based on only two answers, the ground truth and the\ngenerated sample (which is likely to be similar to ground truth). This wrecks the structure that\nHCIAE-D-NP-ATT had previously learned by leveraging additional incorrect options.\n\n\u2022 What happens if we continue structure-preserving training of HCIAE-D-NP-ATT? In addition\nto providing HCIAE-D-NP-ATT samples from G as fake answers, we also include incorrect\noptions as negative answers so that the structure learned by the discriminator is preserved.\nHCIAE-D-NP-ATT continues to train under loss LD. In this case (HCIAE-GAN2 in Table. 4),\nwe \ufb01nd that there is a small improvement in the performance of G. The additional computational\noverhead to training the discriminator supersedes the performance improvement. Also note that\nHCIAE-D-NP-ATT itself gets worse at the dialog task.\n\n7\n\n\fTable 3: Qualitative comparison. \u201cOurs\u201d are samples from G-DIS model with different gumbel noise z. Images\nfrom the COCO dataset [25].\n\nQ: Is it a home or restaurant?\nA: I think restaurant.\nG-MLE: I can\u2019t tell.\nOurs (z1): Hard to say.\nOurs (z2): It looks like a restaur-\nant.\nOurs (z3): I can\u2019t tell because it\nis too close.\n\nQ: Can you see his face?\nA: I am not sure.\nG-MLE: Yes.\nOurs (z1):\nback of his body.\nOurs (z2): No.\nOurs (z3): No , he\u2019s too far\naway.\n\nI can only see the\n\nQ: How old does the man seem to\nbe?\nA: 20\u2019s.\nG-MLE: Late teens.\nOurs (z1): On his 20s.\nOurs (z2): In his twenties.\nOurs (z3): Mid 20\u2019s.\n\nQ: Can you see broccoli?\nA: Yes, 3 larger pieces and some\nsmall.\nG-MLE: No, just the broccoli.\nOurs (z1):\nslightly butter.\nOurs (z2): Yes, there is broccoli.\nOurs (z3): Yes, broccoli\nis\ngreen.\n\nI can see broccoli,\n\nTable 4: Adversarial training results on VisDial dataset.\n\nDiscriminative\n\nGenerative\n\nModel\nMRR\nHCIAE-D-NP-ATT 0.6222\nHCIAE-G-DIS\nHCIAE-GAN1\nHCIAE-GAN2\n\n0.2177\n0.6050\n\n-\n\nR@1 R@5 R@10 Mean MRR\n48.48\n\n78.75\n\n87.59\n\n4.81\n\n-\n\n-\n\nR@1 R@5 R@10 Mean\n\n-\n\n44.35\n43.12\n44.33\n\n-\n\n65.28\n62.74\n65.05\n\n-\n\n71.55\n68.58\n71.40\n\n-\n\n14.23\n16.25\n14.34\n\n-\n\n0.5467\n0.5298\n0.5459\n\n-\n\n-\n\n8.82\n46.20\n\n32.97\n77.92\n\n52.14\n87.20\n\n18.53\n4.97\n\nOne might wonder, why not train a GAN for visual dialog? Formulating the task in a GAN setting\nwould involve G and D training in tandem with D providing feedback as to whether a response that\nG generates is real or fake. We found this to be a particularly unstable setting, for two main reasons:\nFirst, consider the case when the ground truth answer and the generated answers are the same. This\nhappens for answers that are typically short or \u2018cryptic\u2019 (e.g. \u2018yes\u2019). In this case, D can not train itself\nor provide feedback, as the answer is labeled both positive and negative. Second, in cases where the\nground truth answer is descriptive but the generator provides a short answer, D can quickly become\npowerful enough to discard generated samples as fake. In this case, D is not able to provide any\ninformation to G to get better at the task. Our experience suggests that the discriminator, if one were\nto consider a \u2018GANs for visual dialog\u2019 setting, can not merely be focused on differentiating fake\nfrom real. It needs to be able to score similarity between the ground truth and other answers. Such a\nscoring mechanism provides a more reliable feedback to G. In fact, as we show in the previous two\nresults, a pre-trained D that captures this structure is the key ingredient in sharing knowledge with G.\nThe adversarial training of D is not central.\n\n5.3 Qualitative Comparison\nIn Table 3 we present a couple of qualitative examples that compares the responses generated by\nG-MLE and G-DIS. G-MLE predominantly produces \u2018safe\u2019 and less informative answers, such as\n\u2018Yes\u2019 and or \u2018I can\u2019t tell\u2019. In contrast, our proposed model G-DIS does so less frequently, and often\ngenerates more diverse yet informative responses.\n\n6 Conclusion\nGenerative models for (visual) dialog are typically trained with an MLE objective. As a result, they\ntend to latch on to safe and generic responses. Discriminative (or retrieval) models on the other hand\nhave been shown to signi\ufb01cantly outperform their generative counterparts. However, discriminative\nmodels can not be deployed as dialog agents with a real user where canned candidate responses\nare not available. In this work, we propose transferring knowledge from a powerful discriminative\nvisual dialog model to a generative model. We leverage the Gumbel-Softmax (GS) approximation to\nthe discrete distribution \u2013speci\ufb01cally, a RNN augmented with a sequence of GS samplers, coupled\nwith a ST gradient estimator for end-to-end differentiability. We also propose a novel visual dialog\nencoder that reasons about image-attention informed by the history of the dialog; and employ a\nmetric learning loss along with a self-attentive answer encoding to enable the discriminator to\nlearn meaningful structure in dialog responses. The result is a generative visual dialog model that\nsigni\ufb01cantly outperforms state-of-the-art.\n\n8\n\n\fReferences\n[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick,\n\nand Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.\n\n[2] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron C. Courville. Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.\n\n[3] Antoine Bordes and Jason Weston.\n\narXiv:1605.07683, 2016.\n\nLearning end-to-end goal-oriented dialog.\n\narXiv preprint\n\n[4] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. Abc-cnn: An atten-\ntion based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960,\n2015.\n\n[5] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer.\n\narXiv preprint arXiv:1511.05641, 2015.\n\n[6] Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fidler. Towards diverse and natural image descriptions via\n\na conditional gan. arXiv preprint arXiv:1703.06029, 2017.\n\n[7] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos\u00e9 M.F. Moura, Devi Parikh,\n\nand Dhruv Batra. Visual dialog. In CVPR, 2017.\n\n[8] Abhishek Das, Satwik Kottur, Jos\u00e9 MF Moura, Stefan Lee, and Dhruv Batra. Learning cooperative visual\n\ndialog agents with deep reinforcement learning. arXiv preprint arXiv:1703.06585, 2017.\n\n[9] Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville.\nGuesswhat?! visual object discovery through multi-modal dialogue. arXiv preprint arXiv:1611.08481,\n2016.\n\n[10] Emily L. Denton, Soumith Chintala, Arthur Szlam, and Robert Fergus. Deep generative image models\n\nusing a laplacian pyramid of adversarial networks. NIPS, 2015.\n\n[11] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,\nKate Saenko, and Trevor Darrell. Long-term Recurrent Convolutional Networks for Visual Recognition\nand Description. In CVPR, 2015.\n\n[12] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on\n\ndeep networks. In NIPS, 2016.\n\n[13] Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Doll\u00e1r, Jianfeng\nGao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From\nCaptions to Visual Concepts and Back. In CVPR, 2015.\n\n[14] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a\n\nmachine? dataset and methods for multilingual image question answering. In NIPS, 2015.\n\n[15] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint\n\narXiv:1508.06576, 2015.\n\n[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[17] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv\n\npreprint arXiv:1503.02531, 2015.\n\n[18] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[19] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and\n\nsuper-resolution. In ECCV, 2016.\n\n[20] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In\n\nCVPR, 2015.\n\n[21] Matt J. Kusner and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Gans for sequences of discrete elements with the\n\ngumbel-softmax distribution. CoRR, abs/1611.04051, 2016.\n\n[22] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, Johannes\nTotz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative\nadversarial network. CoRR, abs/1609.04802, 2016.\n\n[23] Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue\n\ngeneration. arXiv preprint arXiv:1701.06547, 2017.\n\n[24] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In ACL 2004 Workshop, 2004.\n[25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\n\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.\n\n9\n\n\f[26] Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau.\nHow not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for\ndialogue response generation. arXiv preprint arXiv:1603.08023, 2016.\n\n[27] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Optimization of image description\n\nmetrics using policy gradient methods. arXiv preprint arXiv:1612.00370, 2016.\n\n[28] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention\n\nvia a visual sentinel for image captioning. In CVPR, 2016.\n\n[29] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for\n\nvisual question answering. In NIPS, 2016.\n\n[30] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[31] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to\n\nanswering questions about images. In ICCV, 2015.\n\n[32] Hongyuan Mei, Mohit Bansal, and Matthew R Walter. Coherent dialogue with attention-based language\n\nmodels. arXiv preprint arXiv:1611.06997, 2016.\n\n[33] Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios P Spithourakis,\nand Lucy Vanderwende. Image-grounded conversations: Multimodal context for natural question and\nresponse generation. arXiv preprint arXiv:1701.08251, 2017.\n\n[34] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation\n\nof machine translation. In ACL, 2002.\n\n[35] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.\n\n[36] Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with\n\nrecurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.\n\n[37] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering.\n\nIn NIPS, 2015.\n\n[38] Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. Build-\ning end-to-end dialogue systems using generative hierarchical neural network models. arXiv preprint\narXiv:1507.04808, 2015.\n\n[39] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville,\nand Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv\npreprint arXiv:1605.06069, 2016.\n\n[40] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and\nYoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI,\n2017.\n\n[41] Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. Speaking the\nsame language: Matching machine to human captions by adversarial training. CoRR, abs/1703.10476,\n2017.\n\n[42] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[43] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016.\n[44] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-\nYun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-sensitive generation of\nconversational responses. arXiv preprint arXiv:1506.06714, 2015.\n\n[45] Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. End-to-end\noptimization of goal-driven and visually grounded dialogue systems. arXiv preprint arXiv:1703.05423,\n2017.\n\n[46] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In NIPS, 2015.\n[47] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nNIPS, 2014.\n\n[48] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description\n\nevaluation. In CVPR, 2015.\n\n[49] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\n\ncaption generator. In CVPR, 2015.\n\n[50] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for\n\nvisual question answering. In ECCV, 2016.\n\n10\n\n\f[51] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S.\nZemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention.\nCoRR, abs/1502.03044, 2015.\n\n[52] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image\n\nquestion answering. In CVPR, 2016.\n\n[53] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with\n\npolicy gradient. AAAI, 2017.\n\n[54] Junbo Jake Zhao, Micha\u00ebl Mathieu, and Yann LeCun. Energy-based generative adversarial network. CoRR,\n\nabs/1609.03126, 2016.\n\n[55] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n11\n\n\f", "award": [], "sourceid": 249, "authors": [{"given_name": "Jiasen", "family_name": "Lu", "institution": "Georgia Tech"}, {"given_name": "Anitha", "family_name": "Kannan", "institution": null}, {"given_name": "Jianwei", "family_name": "Yang", "institution": "Georgia Tech"}, {"given_name": "Devi", "family_name": "Parikh", "institution": "Georgia Tech / Facebook AI Research (FAIR)"}, {"given_name": "Dhruv", "family_name": "Batra", "institution": null}]}