{"title": "Cross-Modal Learning with Adversarial Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 10792, "page_last": 10802, "abstract": "With the rapid developments of deep neural networks, numerous deep cross-modal analysis methods have been presented and are being applied in widespread real-world applications, including healthcare and safety-critical environments. However, the recent studies on robustness and stability of deep neural networks show that a microscopic modification, known as adversarial sample, which is even imperceptible to humans, can easily fool a well-performed deep neural network and brings a new obstacle to deep cross-modal correlation exploring. In this paper, we propose a novel Cross-Modal correlation Learning with Adversarial samples, namely CMLA, which for the first time presents the existence of adversarial samples in cross-modal data. Moreover, we provide a simple yet effective adversarial sample learning method, where inter- and intra- modality similarity regularizations across different modalities are simultaneously integrated into the learning of adversarial samples. Finally, our proposed CMLA is demonstrated to be highly effective in cross-modal hashing based retrieval. Extensive experiments on two cross-modal benchmark datasets show that the adversarial examples produced by our CMLA are efficient in fooling a target deep cross-modal hashing network. On the other hand, such adversarial examples can significantly strengthen the robustness of the target network by conducting an adversarial training.", "full_text": "Cross-Modal Learning with Adversarial Samples\n\nDe Xie1 Wei Liu3,\u2217\nChao Li1,2\n1School of Electronic Engineering, Xidian University, Xi\u2019an, Shaanxi, China\n\nCheng Deng1,\u2217 Shangqian Gao2\n\n{chaolee.xd, chdeng.xd, xiede.xd}@gmail.com, shg84@pitt.edu, wl2223@columbia.edu\n\n2Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA\n\n3Tencent AI Lab, China\n\nAbstract\n\nWith the rapid developments of deep neural networks, numerous deep cross-modal\nanalysis methods have been presented and are being applied in widespread real-\nworld applications, including healthcare and safety-critical environments. However,\nthe recent studies on robustness and stability of deep neural networks show that a\nmicroscopic modi\ufb01cation, known as adversarial sample, which is even impercepti-\nble to humans, can easily fool a well-performed deep neural network and brings\na new obstacle to deep cross-modal correlation exploring. In this paper, we pro-\npose a novel Cross-Modal correlation Learning with Adversarial samples, namely\nCMLA, which for the \ufb01rst time presents the existence of adversarial samples in\ncross-modal data. Moreover, we provide a simple yet effective adversarial sample\nlearning method, where inter- and intra- modality similarity regularizations across\ndifferent modalities are simultaneously integrated into the learning of adversarial\nsamples. Finally, our proposed CMLA is demonstrated to be highly effective in\ncross-modal hashing based retrieval. Extensive experiments on two cross-modal\nbenchmark datasets show that the adversarial examples produced by our CMLA\nare ef\ufb01cient in fooling a target deep cross-modal hashing network. On the other\nhand, such adversarial examples can signi\ufb01cantly strengthen the robustness of the\ntarget network by conducting an adversarial training.\n\n1\n\nIntroduction\n\nCross-modal learning, such as cross-modal retrieval, enables a user to achieve what he/she prefers in\none modality (e.g., image) that is relevant to a given query in another (e.g., text). However, due to\nthe drastic growth of multimedia, learning in such tremendous amounts of multimedia data has been\na new challenge. The recent success of deep learning and its role in cross-modal learning seem to\nobviate concerns about the performance in both accuracy and speed: exploiting deep neural networks\nto map data samples of different modalities into compact hash codes and using fast bitwise XOR\noperations to perform retrieval. Extensive efforts [31, 32, 30, 29, 39, 28, 44, 33, 23, 47, 5, 18, 21, 24,\n27, 40, 12, 13, 49] have been made and achieved remarkable retrieval accuracy.\nThe current studies [17, 20, 37] show that deep networks are vulnerable against purposeful input\nsamples, namely adversarial samples. These samples can easily fool a well-performed deep learning\nmodel by only adding a little perturbation which is even imperceptible to humans. Besides, adversarial\nsamples have been observed in wide areas, such as image classi\ufb01cation [36], object recognition [9],\nobject detection [45], speech recognition [2], etc. However, the potential risks of deep neural networks\nbeing vulnerable to adversarial samples in cross-modal learning have not been delineated.\nIn this paper, we take cross-modal hashing retrieval as a representative case of cross-modal learning,\nwhere search space can roughly be divided into four parts: T2T, I2I, I2T/T2I, and NR, as shown in\n\n\u2217Corresponding authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: (a) The search space in cross-modal retrieval, which can be divided into four parts: I2I\n(query is image, database is image), T2T (query is text, database is text), I2T (query is image, database\nis text)/T2I (query is text, database is image), and NR (not relevant samples in two modalities). (b)\nAdversarial sample learning. (c) Adversarial training.\n\nFig. 1. NR means that the returned samples are not relevant to the query. In the T2T space, taking a\ntext as a query, only semantically similar texts can be successfully retrieved, and vice versa in I2I.\nDifferent from T2T and I2I, I2T/T2I is the main concern in the cross-modal retrieval community,\nwhich focuses on information retrieval across different modalities. Existing deep cross-modal hashing\nworks uniformly build the relationships across different modalities by constructing a multi-layer\nneural network, and learn hash codes via an objective of similarity metric. However, due to the lack\nof suf\ufb01cient training data points and neglecting the robustness of their hashing networks, the search\nspace of these methods cannot be suf\ufb01ciently explored, so the retrieval performance can easily be\ncompromised. Inspired by adversarial training [42] which successfully increases the robustness by\naugmenting training data with adversarial samples, we propose to explore and utilize such adversarial\nsamples to construct a more robust search space in I2T/T2I. Different from other tasks mentioned\nabove, for cross-modal retrieval, adversarial samples generated from four different spaces, i.e., T2T,\nI2I, T2I/I2T, and NR, have different capacities in attacking. However, a more ideal and deceptive\nadversarial sample for cross-modal not only can make an effective attack to cross-modal retrieval,\nbut also should keep the non-decreasing retrieval performance compared with a clean sample when\nexecuting single-modal retrieval. To be speci\ufb01c, given a cross-modal system, adversarial samples\nwith errors in both single-modal and cross-modal are suspicious and can be easily detected. On the\ncontrary, adversarial samples with errors merely in cross-modal but being correct in single-modal are\nmuch harder to be discovered, which are more deceptive adversarial samples.\nIn this paper, we propose a novel Cross-Modal Learning with Adversarial samples, namely CMLA,\nwhich can improve the robustness of cross-modal learning, such as a deep cross-modal hashing\nnetwork. As such, accurate connections across different modalities can be bridged and a more robust\ncross-modal retrieval system can be established. The highlights of our work can be summarized as\nfollows:\n\n\u2022 We propose a simple yet effective cross-modal learning method by exploring cross-modal\nadversarial samples, where adversarial sample is de\ufb01ned in two aspects: two perturbations\nfor different modalities are learned to fool a deep cross-modal network; the perturbations on\neach modality will not impact the performance within its modality.\n\u2022 A novel cross-modal adversarial sample learning algorithm is presented. To learn cross-\nmodal adversarial samples with high attacking capability, we propose to decrease the inter-\nmodality similarity and simultaneously keep intra-modality similarity by one optimization.\n\u2022 We additionally apply the proposed CMLA to cross-modal hash learning. Experiments on\ntwo widely used cross-modal retrieval benchmarks show the effectiveness of our CMLA in\nattacking a target retrieval system and further improving its robustness.\n\n2 Related Works\n\nCross-modal hashing methods focus on building the correlation between different modalities and\nlearning reliable hash codes, which can be mainly categorized into two settings: data-independent\nhashing methods and data-dependent ones. In data-independent hashing methods, hash codes are\nlearned based on random projections, e.g., locality-sensitive hashing (LSH) [16]. Compared with\n\n2\n\n(a) Cross-modal Search Space\ud835\udc35NR(b) Adversarial Sample Learning\ud835\udc34\ud835\udc37(c) Adversarial TrainingT2I/I2TI2IT2T\ud835\udc35NR\ud835\udc34\ud835\udc34(cid:4632)\ud835\udc37\ud835\udc37(cid:3553)\ud835\udc36\ud835\udc36\ud835\udc34(cid:4632)\ud835\udc34\ud835\udc36\ud835\udc37(cid:3553)\ud835\udc37\ud835\udc35T2I/I2TI2IT2T\fFigure 2: The pipeline of our proposed CMLA for cross-modal hash learning.\n\ndata-dependent methods, data-independent methods always require long bits to encode multi-modal\ndata. Thus, most currently proposed methods are data-dependent ones, which can learn compact\nhash codes using partial data as a training set. Collective Matrix Factorization Hashing (CMFH) [14]\nlearns hash codes for different modalities by executing collective matrix factorization of different\nviews. Brostein et al. [3] presented a cross-modal hashing by preserving intra-class similarity via\neigen-decomposition and boosting. Semantics-preserving hashing (SePH) [28] produces uni\ufb01ed\nbinary codes by modeling an af\ufb01nity matrix in a probability distribution while minimizing the\nKullback-Leibler divergence. Most of these methods, which depend on hand-crafted features, lack\nthe capacity of fully exploiting heterogeneous relations across modalities. Recently, models based\non deep neural networks can easily access more discriminative features, leading to a boost in the\ndevelopment of deep cross-modal hashing techniques [6, 15, 23, 43, 26, 11]. Deep cross-modal\nhashing (DCMH) [23] utilizes a deep neural network to perform feature extraction and hash code\nlearning from scratch. Pairwise relationship guided deep hashing (PRDH) [48] integrates different\npairwise constraints to purify the similarities of hash codes from inter-modality and intra-modality.\nSelf-supervised adversarial hashing (SSAH) [25] introduces a label network into the hash learning\nprocess, which facilitates hash code generation.\nDespite the outstanding performance and remarkable achievements, the DNN systems have recently\nbeen shown to be vulnerable to adversarial attacks. Szegedy et al. [42] \ufb01rst showed that a well-\ndesigned small perturbation on images can fool state-of-the-art deep neural networks with high\nprobability. In the following, a series of research efforts on attack [41, 46, 34] and defense [7, 35, 38]\nhave been presented. Fast gradient sign method (FGSM) [17], which uses a sign function on gradients\nfor the inputs to learn adversarial examples, is one of the representative ef\ufb01cient attack algorithms.\nFor the defense methods, adversarial training [42, 17] is adopted to augment their training data of the\nclassi\ufb01er by learning adversarial samples. Distillation technique [38] was presented to reduce the\nmagnitude of the gradients used for adversarial sample creation and also to increase the minimum\nfeature numbers to be modi\ufb01ed. However, concerning large-scale cross-modal retrieval, although\nplenty of deep cross-modal hashing networks have been constructed, there is no attempt to focus on\nthe security of DNNs in deep cross-modal hashing models.\n\n3 Cross-Modal Learning with Adversarial Samples\n\n3.1 Problem De\ufb01nition\nGiven a cross-modal benchmark O = {oi}N\ni), where ov\ni\ni are collected in pair, respectively, denoting image and textual description for the ith data\nand ot\npoint, ol\ni is a multi-label annotation assigned to oi. Let S denote a pairwise similarity matrix which\ndescribes semantic similarity between each pair of data points, where Sij = 1 means that oi and\noj are semantically similar, otherwise Sij = 0. In a multi-label setting, when oi and oj share at\n\ni=1 with N data points, oi = (ov\n\ni , ot\n\ni, ol\n\n3\n\n0100101101011010101111001010S01010000110101101101001010011-S-11+-11+-11+-1Intra-ModalSimilarity\ud835\udc3b(cid:3049)-11+-11+-1-11+-11+-1\ud835\udc3b(cid:3047)-11+-11+-1-11+-11+-1-11+-11+-11+-11+-1CNN-Based NetworkMultilayer PerceptronInter-Modal Similarity1+1+1+1+Hash Layer\ud835\udc3b(cid:3553)(cid:3047)\ud835\udc3b(cid:3553)(cid:3049)Hash Layer\fleast one label, Sij = 1; otherwise Sij = 0. The main task of cross-modal hashing is to learn\ntwo hash functions H\u2217,\u2217 \u2208 {v, t}, which build cross-modal correlations and generate hash codes\nB\u2217 \u2208 {\u22121, 1}K for cross-modal data, where K is code length, and H\u2217 are usually learned by deep\nneural networks in deep cross-modal hashing. We additionally de\ufb01ne the outputs of hash layers as\nH v and H t for image and text, respectively. Binary hash codes B\u2217 are generated by applying a sign\nfunction to H\u2217:\n\nB\u2217 = sign(H\u2217), \u2217 \u2208 {v, t}.\n\n(1)\nFor deep hashing networks H\u2217(o\u2217, \u03b8\u2217), let \u03b8\u2217 be network parameters and J(\u03b8v, \u03b8t, ov, ot) be the\nloss function, respectively. The aim of a cross-modal adversarial attack is to \ufb01nd the minimum\nperturbations \u03b4v and \u03b4t that cause the change of retrieval accuracy. Formally,\n\n(cid:52)(o\u2217, H\u2217) := min\n\n\u03b4\u2217 (cid:107)\u03b4\u2217(cid:107)p,\n\ns.t. max\n\n\u03b4\u2217 D (H\u2217 (o\u2217 + \u03b4\u2217; \u03b8\u2217) , H\u2217 (o\u2217; \u03b8\u2217)) , (cid:107)\u03b4\u2217(cid:107)p \u2264 \u0001, \u2217 \u2208 {v, t},\n\n(2)\n\nwhere hash codes H\u2217 are generated from hash layer H\u2217 by learning a deep network \u03b8\u2217, and D(\u00b7,\u00b7) is\na distance measure. (cid:107)\u00b7(cid:107)p , p = {1, 2,\u221e} which denotes Lp norm, denotes the distance between the\nlearned adversarial sample and the original sample.\n\n3.2 Proposed CMLA\n\nThe overall \ufb02owchart of the proposed CMLA model is illustrated in Fig. 2. For a target deep cross-\nmodal hashing network that consists of CNN-based Network and Multilayer Perceptron Network,\nfollowing which are two hash layers to output hash codes for different modalities. In addition, two\nregularizations, namely inter-modal similarity regularization and intra-modal similarity regularization,\nare combined to optimize the learned adversarial samples.\nFor a better understanding, taking an image data point ov for example, we intend to generate an\nadversarial sample \u02c6ov by learning a small perturbation \u03b4v, where \u02c6ov = ov + \u03b4v. In this way, feeding \u02c6o\ninto the target deep cross-modal hashing network, semantically irrelevant results in the text modality\nshould be returned. To achieve this goal, original text information is \ufb01rst fed into the deep hashing\nnetwork to generate regular hash codes H t = Ht(ot, \u03b8t). The correlation between the two modalities\nbuilt during cross-modal hash codes generation is treated as a supervision signal to learn the optimal\nperturbation for each modality. With H t, adversarial sample learning can be transferred to a problem\nof maximizing the Hamming distance between cross-modal hash codes. This can be solved by\nintroducing an inter-modal similarity regularization, with which \u03b4v will be optimized by maximizing\nthe Hamming distance between \u02c6H v and H t. We formulate this inter-modal similarity loss function\nas:\n\nJ v\ninter =\n\nmin\n\u03b4v\n\nN(cid:88)\n(cid:0)log(cid:0)1 + e\u2212\u0393ij(cid:1) + Sij\u0393ij\nN(cid:88)\n\ni (cid:107)p , \u0393ij =\n\n(cid:107)\u02c6ov\n\ni \u2212 ov\n\ni,j=1\n\n+\n\n(cid:1)\n\ni=1\n\n( \u02c6H v\n\ni )(H t\n\nj )(cid:62).\n\n1\n2\n\n(3)\n\nMoreover, compared with single-modal retrieval, a notable difference existing in cross-modal learning\nis that the latter not only can build correlations across modalities but also can remain the correlations\nwithin each modality. Thus, during adversarial sample learning for cross-modal data, the modality\ncorrelation within an individual modality should be kept, which means that the learned cross-modal\nperturbation cannot change the intra-modal similarity relationship. To learn this perturbation, an\nadditional intra-modal similarity regularization function is adopted in our CMLA, which can be\nwritten as:\n\nN(cid:88)\n\n(cid:0)log(cid:0)1 + e\u0398ij(cid:1) \u2212 Sij\u0398ij\n\n(cid:1) , \u0398ij =\n\nJ v\nintra =\n\nmin\n\u03b4v\n\n( \u02c6H v\n\ni )(H v\n\nj )(cid:62).\n\n1\n2\n\n(4)\n\ni,j=1\n\n4\n\n\fAlgorithm 1 Cross-Modal correlation Learning with Adversarial samples (CMLA).\nInput: target deep cross-modal hashing networks: H(o\u2217, \u03b8\u2217),\u2217 \u2208 {v, t}, and a cross-modal dataset with N\n\ndata points: {image, text, and label};\n\nOutput: optimal perturbations: \u03b4v, \u03b4t;\n1 Maximum iteration = Tmax, Batch_Size = 128, n = (cid:100)N/128(cid:101);\n2 for j = 1, j \u2264 n do\n\n3\n\n4\n\n5\n6\n\ninitialize iter = 0;\nwhile iter \u2264 Tmax do\n\ncompute\nH v = Hv(ov, \u03b8v), H t = Ht(ot, \u03b8t);\nif not converged then\nupdate \u03b4v and \u03b4t:\n\u03b4v = arg min\u03b8v J v(ov, \u03b8v, H v, H t);\n\u03b4t = arg min\u03b8t J t(ot, \u03b8t, H v, H t);\n\nend\n\nend\n\n7\n8\n9 end\n10 return \u03b4v and \u03b4t.\n\nTherefore, the objective function of our CMLA for image modality adversarial sample learning is\nformulated as:\n\n(cid:1) + \u03b2\n\nN(cid:88)\n\n(cid:0)log(cid:0)1 + e\u0398ij(cid:1) \u2212 Sij\u0398ij\n\n(cid:1)\n\ni,j=1\n\n(5)\n\nJ v = \u03b1\n\nmin\n\u03b4v\n\nJ t = \u03bb\n\nmin\n\u03b4t\n\nN(cid:88)\n(cid:0)log(cid:0)1 + e\u2212\u0393ij(cid:1) + Sij\u0393ij\nN(cid:88)\n\ni,j=1\n\n+ \u03b3\n\n(cid:107)\u02c6ov\n\ni \u2212 ov\n\ni (cid:107)p ,\nj )(cid:62), \u0398ij = 1\n\ni\n\nN(cid:88)\n(cid:0)log(cid:0)1 + e\u2212\u03a5ij(cid:1) + Sij\u03a5ij\nN(cid:88)\n(cid:13)(cid:13)\u02c6ot\n\n(cid:13)(cid:13)p ,\n\ni \u2212 ot\n\ni\n\ni,j=1\n\n+ \u03b7\n\nj )(cid:62), and \u03b1, \u03b2, and \u03b3 are hyper-parameters. In a\nwhere \u0393ij = 1\nsimilar way, adversarial samples for the text modality can be learned, where the objective function is\nwritten as:\n\ni )(H v\n\n2 ( \u02c6H v\n\n2 ( \u02c6H v\n\ni )(H t\n\n(cid:1) + \u03be\n\nN(cid:88)\n\n(cid:0)log(cid:0)1 + e\u03a8ij(cid:1) \u2212 Sij\u03a8ij\n\n(cid:1)\n\ni,j=1\n\n(6)\n\ni\n\n2 ( \u02c6H t\n\ni )(H v\n\nj )(cid:62), \u03a8ij = 1\n\nj )(cid:62), and \u03bb, \u03be, and \u03b7 are hyper-parameters. To solve\nwhere \u03a5ij = 1\nthe problems in Eqs.(5)(6), we \ufb01x the network parameters and optimize \u03b4v and \u03b4t. Considering the\nstructure difference between perturbations of image and text, CMLA learns different perturbations\nfor two modalities, with \u03b4v and \u03b4t being updated iteratively. Algorithm 1 summarizes the learning\nprocedure of the proposed CMLA.\n\n2 ( \u02c6H t\n\ni )(H t\n\n4 Experiments\n\n4.1 Experimental Setup\n\nExtensive experiments on two benchmarks: MIRFlickr-25K [22] and NUS-WIDE [10] are conducted\nto evaluate the performances of our proposed CMLA and two state-of-the-art deep cross-modal\nhashing networks as well as their variations.\nMIRFlickr-25K [22] is collected from Flickr, which contains 25,000 images. Each image is labeled\nwith an associated text description. 20,015 image-text pairs are selected in our experiments, and each\nimage-text pair is annotated with at least one of 24 unique labels. For the text modality, each text is\nrepresented by a 1,386-dimensional bag-of-words vector.\nNUS-WIDE [10] is a public web image dataset containing 269,648 web images. 81 ground-truth\nconcepts have been annotated for retrieval evaluation. After pruning the data point that has no label\n\n5\n\n\fTable 1: Comparison in terms of MAP scores of two retrieval tasks on MIRFlickr-25K and NUS-\nWIDE datasets with different lengths of hash codes.\n\nTask\n\nImage Query\n\nv.s.\n\nText Database\n\nText Query\n\nv.s.\n\nImage Database\n\nMethod\nDCMH\nDCMH+\nSSAH\nSSAH+\nDCMH\nDCMH+\nSSAH\nSSAH+\n\n16\n\n0.736\n0.805\n0.797\n0.804\n0.796\n0.810\n0.798\n0.808\n\nMIRFlickr-25K\n\n32\n\n0.749\n0.816\n0.805\n0.815\n0.797\n0.820\n0.805\n0.809\n\n48\n\n0.756\n0.825\n0.807\n0.826\n0.804\n0.820\n0.807\n0.814\n\n64\n\n0.761\n0.828\n0.807\n0.829\n0.806\n0.819\n0.804\n0.815\n\n16\n\n0.595\n0.658\n0.645\n0.660\n0.601\n0.679\n0.661\n0.671\n\nNUS-WIDE\n32\n48\n\n0.607\n0.679\n0.660\n0.675\n0.614\n0.691\n0.677\n0.685\n\n0.620\n0.686\n0.670\n0.690\n0.623\n0.693\n0.681\n0.693\n\n64\n\n0.641\n0.683\n0.672\n0.694\n0.645\n0.690\n0.684\n0.697\n\nM\n\n(cid:113)(cid:80) (\u02c6o\u2217\u2212o\u2217)2\n\nor text information, a subset of 190,421 image-text pairs that belong to the 21 most-frequent concepts\nare selected as a dataset in our experiments. We use a 1,000-dimensional bag-of-words vector to\nrepresent each text data point.\nEvaluations. In order to evaluate the performance of the proposed CMLA, we follow previous\nworks [28, 4, 5] and adopt three commonly used evaluation criteria in cross-modal retrieval: Mean\nAverage Precision (MAP) which is used to measure the accuracy of the Hamming distances, precision-\nrecall curve (PR curve) which is used to measure the accuracy of hash lookups, and Precision@1000\ncurve which is used to evaluate the precision with respect to top 1,000 retrieved results. The\ndistortion D between the original modality data o\u2217 and distorted modality data \u02c6o\u2217 is measured by\n,\u2217 \u2208 {v, t}. M is set as 150,528 (224 \u2217 224 \u2217 3) for the image modality, while\nD =\nfor the text modality, we set M as 1,380 and 1,000 for MIRFlickr-25K and NUS-WIDE, respectively,\ndepending on their dimensions.\nBaselines. DCMH [23] and SSAH [25], which are two representative deep cross-modal hashing\nnetworks, are selected as the targeted cross-modal hashing models in [23] [25]. Following the setting\nin [23] [25], we retrain DCMH and SSAH. Moreover, in order to evaluate the cross-network transfer,\nwe additionally construct two improved versions DCMH+ and SSAH+. DCMH+ and SSAH+ are\nbuilt by replacing the vgg-f [8] network with the ResNet50 [19] network.\nImplementation Details. Our proposed CMLA is implemented via TensorFlow [1] and is run on a\nserver with two NVIDIA Tesla P40 GPUs holding a graphics memory capacity of 24GB for each one.\nAll images are resized to 224\u00d7224\u00d73 before being used as the inputs. In adversarial sample learning,\nwe use the Adam optimizer respectively with initial learning rates 0.5 and 0.002 for the image and\ntext modalities, and train each sample for Tmax iterations. All hyper-parameters \u03b1, \u03b2, \u03bb, \u03be, \u03b3, and \u03b7\nare set as 1 empirically. The mini-batch size is \ufb01xed at 128. \u0001v is set as 8 for the image modality, and\n\u0001t is set as 0.01 for the text modality. After the adversarial sample is generated, we clip the image\ninto 0 \u223c 255 and clip text into 0 \u223c 1, respectively. The results reported in our experiments are all\naverage results after a run for 10 times.\n\n4.2 Results\n\nFor MIRFlickr-25K, 2,000 data points are randomly selected as a query set, 10,000 data points are\nused as a training set to train the target retrieval network model, and the remainder is kept as a retrieval\ndatabase. 5,000 data points from the training set are further sampled to learn adversarial samples. For\nNUS-WIDE, we randomly sample 2,100 data points as a query set and 10,500 data points as a training\nset. Similarly, 5,000 data points from the training set are sampled to learn adversarial samples. The\nsource codes of DCMH and SSAH are provided by the authors. Moreover, two variations DCMH+\nand SSAH+ are constructed by replacing the vgg-f network with the ResNet50 network. All models\nare retrained from scratch and their performances are shown in Table 1. It can be seen that the target\nnetworks DCMH and SSAH achieve similar performances to their original papers and can achieve\nmore promising results after equipped with ResNet50 which has more layers than vgg-f.\nGiven a target deep cross-modal hashing network, to evaluate the attacking performance of the\nproposed cross-modal adversarial samples learning method, we \ufb01rst \ufb01x the network parameters and\n\n6\n\n\fTable 2: Comparison in terms of MAP scores and distortions (D) of two retrieval tasks on MIRFlickr-\n25K and NUS-WIDE datasets with 32-bit code length.\n\nTask\n\nImage Query\n\nv.s.\n\nText Database\n\nText Query\n\nv.s.\n\nImage Database\n\nNUS-WIDE\n\nMIRFlickr-25K\n\nD\n\nD\n\nIteration\n100 MAP 0.579\n0.039\n200 MAP 0.563\n0.023\n500 MAP 0.521\n0.019\n100 MAP 0.615\n0.048\n200 MAP 0.587\n0.027\n500 MAP 0.561\n0.019\n\nD\n\nD\n\nD\n\nD\n\nDCMH DCMH+ SSAH SSAH+ DCMH DCMH+ SSAH SSAH+\n0.591\n0.025\n0.543\n0.026\n0.502\n0.024\n0.523\n0.025\n0.474\n0.023\n0.427\n0.019\n\n0.631\n0.041\n0.599\n0.038\n0.554\n0.029\n0.619\n0.037\n0.577\n0.033\n0.564\n0.021\n\n0.587\n0.032\n0.534\n0.029\n0.460\n0.026\n0.501\n0.042\n0.454\n0.035\n0.351\n0.017\n\n0.526\n0.031\n0.499\n0.026\n0.457\n0.025\n0.523\n0.037\n0.447\n0.035\n0.371\n0.030\n\n0.679\n0.034\n0.671\n0.028\n0.665\n0.020\n0.603\n0.031\n0.595\n0.025\n0.589\n0.023\n\n0.609\n0.033\n0.583\n0.031\n0.578\n0.028\n0.628\n0.035\n0.549\n0.031\n0.533\n0.027\n\n0.681\n0.038\n0.699\n0.032\n0.674\n0.023\n0.611\n0.021\n0.605\n0.019\n0.593\n0.017\n\n(a)\n\n(e)\n\n(b)\n\n(f)\n\n(c)\n\n(g)\n\n(d)\n\n(h)\n\nFigure 3: PR and Precision@1000 evaluated on MIRFlickr-25K and NUS-WID datasets with 32 bits.\n\nexecute CMLA to generate the adversarial sample for each query data point. Then, we respectively\nuse each adversarial query sample to retrieve data points across different modalities, where the\nevaluation settings are the same as those in Table 1. Taking hash code learning with the code length\nof 32-bit as an example, we train each sample with different iterations from 100 to 500. The results\nare shown in Table 2, where we provide MAP values and distortions (D) to show the relationship\nbetween retrieval performance and distortion with the growth of training iterations. Compared with\nthe results reported in Table 1, it is obvious that: (1) The performances of both DCMH and SSAH\nare severely decreased by only adding a small distortion to original modality data; (2) With the\ngrowth of training iterations, CMLA can simultaneously maintain a high attacking performance and\ncontinuously reduce the magnitude of learned disturbance. The same results are shown in Fig. 3,\nwhere PR-curves and Precision@1000 curves are provided to show the effectiveness of our proposed\nCMLA. Some adversarial samples and corresponding original data points are also given in Fig. 4. The\nimages listed above are original images, where their corresponding adversarial samples are shown\nbelow them. It is nearly imperceptible to humans\u2019 eyes. For the text modality, we show the learned\nadversarial sample, which is constructed by mixing learned distortions and the original bag-of-words\nvector. Compared with the image modality, the text adversarial sample has relatively large distortions\ndue to its nature of discrete representation.\n\n7\n\n0.10.20.30.40.50.60.70.80.91Recall0.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.951PrecisionImage-query-Text on MIRFlickr-25KDCMH-CDCMH-ASSAH-CSSAH-ADCMH+-CDCMH+-ASSAH+-CSSAH+-A0.10.20.30.40.50.60.70.80.91Recall0.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.951PrecisionText-query-Image on MIRFlickr-25KDCMH-CDCMH-ASSAH-CSSAH-ADCMH+-CDCMH+-ASSAH+-CSSAH+-A00.10.20.30.40.50.60.70.80.91R(*1000)0.30.350.40.450.50.550.60.650.70.750.80.850.90.95PrecisionImage-query-Text on MIRFlickr-25KDCMH-CDCMH-ASSAH-CSSAH-ADCMH+-CDCMH+-ASSAH+-CSSAH+-A00.10.20.30.40.50.60.70.80.91R(*1000)0.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.95PrecisionText-query-Image on MIRFlickr-25KDCMH-CDCMH-ASSAH-CSSAH-ADCMH+-CDCMH+-ASSAH+-CSSAH+-A0.10.20.30.40.50.60.70.80.91Recall0.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.951PrecisionImage-query-Text on NUS-WIDEDCMH-CDCMH-ASSAH-CSSAH-ADCMH+-CDCMH+-ASSAH+-CSSAH+-A0.10.20.30.40.50.60.70.80.91Recall0.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.951PrecisionText-query-Image on NUS-WIDEDCMH-CDCMH-ASSAH-CSSAH-ADCMH+-CDCMH+-ASSAH+-CSSAH+-A00.10.20.30.40.50.60.70.80.91R(*1000)0.30.350.40.450.50.550.60.650.70.750.80.850.90.95PrecisionImage-query-Text on NUS-WIDEDCMH-CDCMH-ASSAH-CSSAH-ADCMH+-CDCMH+-ASSAH+-CSSAH+-A00.10.20.30.40.50.60.70.80.91R(*1000)0.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.95PrecisionText-query-Image on NUS-WIDEDCMH-CDCMH-ASSAH-CSSAH-ADCMH+-CDCMH+-ASSAH+-CSSAH+-A\fFigure 4: Adversarial samples of different modalities learned by the proposed CMLA.\n\n(a) Adversarial Image\n\n(b) Adversarial Text\n\n(c) Adversarial Image\n\n(d) Adversarial Text\n\nFigure 5: Ablation studies of intra-modal similarity evaluated on MIRFlickr-25K and NUS-WIDE\ndatasets with 32 bits.\n\n4.3 Further Analysis\n\nAs mentioned above, there is a main difference between cross-modal retrieval and single-modal\nretrieval that only retrieves data points within an identical modality. A well-designed cross-modal re-\ntrieval system can achieve both same- and different- modality data points in high accuracy. Therefore,\na more deceptive cross-modal adversarial sample not only can fool a cross-modal retrieval system\nbut also can simultaneously maintain high performance for single-modal retrieval. For a detailed\nelaboration, an ablation study is conducted by cutting off the intra-modal loss in our CMLA. We take\nSSAH as an example. Two single-modal retrieval tasks I2I and T2T are respectively executed on\nMIRFlickr-25K and NUS-WIDE. Fig. 5 shows the retrieval results, from which it is obvious that:\nThe adversarial sample learned without using an intra-modal similarity constraint on MIRFlickr-25K\nachieves 15.7% and 9.3% performance drops on I2T and T 2I compared with that learned equipped\nwith this constraint, but the retrieval performances in single modality also obviously decrease from\n0.674 (0.549) to 0.420 (0.426) on I2I (T 2T ). These adversarial samples, which have a lower per-\nformance in single-modal retrieval, can be easily detected by a single-modal retrieval veri\ufb01cation\nbefore feeding into a cross-modal retrieval system. While equipped with our intra-modal similarity\nconstraint, the variance of CMLA is further constrained and forced to learn more deceptive adversarial\nsamples. In this way, CMLA can fool a cross-modal retrieval system and will not be detected by a\nsingle-modal retrieval veri\ufb01cation.\nThe main goal of our adversarial sample learning is to improve the robustness of a deep cross-modal\nhashing network. Thus, we additionally learn adversarial samples from the training set and then\nconduct an adversarial training by combining the adversarial samples with the training set together.\nThe retrieval performances of the retrained deep cross-modal hashing network under identical attacks\n\n8\n\n10 1.0 1.0 0.8 0.8 0.8 06 0.6 0.6 04 0.4 0.4 02 0.2 0.2 0.0 0.0 0.0 \u3002200 400 600 800 1000 1200 1400 \u3002200 400 600 800 1000 1200 1400 \u3002200 400 600 800 1000 1200 1400 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 04 0.4 0.4 02 0.2 0.2 00 0.0 0.0 \u3002200 400 600 800 1000 1200 1400 \u3002200 400 600 800 1000 1200 1400 \u3002200 400 600 800 1000 1200 1400 00.20.40.6With Intra-Modal LossWithout Intra-Modal lossMAPAblation Study of Intra-Modal Similarity on MIRFlickr-25KI2II2T00.20.40.6With Intra-Modal LossWithout Intra-Modal lossMAPAblation Study of Intra-Modal Similarity on MIRFlickr-25KT2TT2I00.10.20.30.40.5With Intra-Modal LossWithout Intra-Modal lossMAPAblation Study of Intra-Modal Similarity on NUS-WIDE I2II2T00.10.20.30.40.5With Intra-Modal LossWithout Intra-Modal lossMAPAblation Study of Intra-Modal Similarity on NUS-WIDET2TT2I\fTable 3: Comparison in terms of MAP scores on MIRFlickr-25K and NUS-WIDE datasets with\ndifferent lengths of hash codes. All networks are evaluated after adversarial training.\n\nTask\n\nImage Query\n\nv.s.\n\nText Database\n\nText Query\n\nv.s.\n\nImage Database\n\nMethod\nDCMH\nDCMH+\nSSAH\nSSAH+\nDCMH\nDCMH+\nSSAH\nSSAH+\n\n16\n\n0.711\n0.779\n0.783\n0.784\n0.771\n0.793\n0.790\n0.789\n\nMIRFlickr-25K\n\n32\n\n0.723\n0.781\n0.784\n0.788\n0.775\n0.801\n0.788\n0.789\n\n48\n\n0.735\n0.803\n0.788\n0.787\n0.782\n0.803\n0.792\n0.794\n\n64\n\n0.759\n0.801\n0.785\n0.789\n0.786\n0.799\n0.791\n0.793\n\n16\n\n0.578\n0.647\n0.615\n0.621\n0.610\n0.655\n0.638\n0.641\n\nNUS-WIDE\n32\n48\n\n0.587\n0.649\n0.640\n0.635\n0.603\n0.673\n0.659\n0.664\n\n0.611\n0.666\n0.658\n0.662\n0.611\n0.675\n0.660\n0.668\n\n64\n\n0.628\n0.677\n0.661\n0.670\n0.620\n0.676\n0.664\n0.667\n\nof adversarial query samples are shown in Table 3. Comparing Table 1, Table 2, and Table 3, we can\nsee that each deep cross-modal hashing network achieves a signi\ufb01cant performance increase after\nadversarial training. Thus, the proposed CMLA can effectively learn adversarial samples, and in turn\nthe learned adversarial samples can also be leveraged to improve the robustness of the targeted deep\ncross-modal hashing network.\n\n5 Conclusions\n\nThis paper presents a novel Cross-Modal Learning method with Adversarial samples, dubbed CMLA.\nFirst, we made an observation of the existence of the adversarial samples across two different\nmodalities. Second, by simultaneously maximizing inter-modality similarity and minimizing intra-\nmodality similarity, an effective adversarial sample learning method was proposed. Moreover, a task\non cross-modal hashing retrieval was conducted to verify our proposed CMLA, where extensive\nresults were shown in experiments. As our main purpose is to build an accurate relationship across\nmodalities and to improve the robustness of a target retrieval system, additional adversarial training\nwas enforced for the targeted cross-modal hashing network. The experiments on two widely-used\ncross-modal retrieval datasets show the high attacking ef\ufb01ciency of our proposed adversarial sample\nlearning method. Besides, these adversarial samples, in turn, can further improve the robustness of\nexisting deep cross-modal hashing networks and achieve state-of-the-art performances on cross-modal\nretrieval tasks.\n\nAcknowledgments\n\nThis work was partially supported by the National Natural Science Foundation of China 61572388,\nthe National Key Research and Development Program of China (2017YFE0104100), and the Key\nR&D Program-The Key Industry Innovation Chain of Shaanxi under Grant 2018ZDXM-GY-176.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for large-scale machine\nlearning. In OSDI, pages 265\u2013283, 2016.\n\n[2] Giuseppe Ateniese, Giovanni Felici, Luigi V Mancini, Angelo Spognardi, Antonio Villani, and Domenico\nVitali. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning\nclassi\ufb01ers. arXiv preprint arXiv:1306.4447, 2013.\n\n[3] Michael M. Bronstein, Alexander M. Bronstein, Fabrice Michel, and Nikos Paragios. Data fusion through\n\ncross-modality metric learning using similarity-sensitive hashing. In CVPR, pages 3594\u20133601, 2010.\n\n[4] Juan C Caicedo and Svetlana Lazebnik. Active object localization with deep reinforcement learning. In\n\nCVPR, pages 2488\u20132496, 2015.\n\n[5] Yue Cao, Bin Liu, Mingsheng Long, and Jianmin Wang. Cross-modal hamming hashing. In ECCV, pages\n\n207\u2013223, 2018.\n\n9\n\n\f[6] Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. Deep visual-semantic hashing\n\nfor cross-modal retrieval. In KDD, pages 1445\u20131454, 2016.\n\n[7] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection\n\nmethods. In ACM AISEC, pages 3\u201314. ACM, 2017.\n\n[8] Ken Chat\ufb01eld, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details:\n\nDelving deep into convolutional nets. In arXiv preprint arXiv:1405.3531, 2014.\n\n[9] Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng Polo Chau. Shapeshifter: Robust physical\n\nadversarial attack on faster r-cnn object detector. In ECML, pages 52\u201368. Springer, 2018.\n\n[10] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. Nus-wide: a\n\nreal-world web image database from national university of singapore. In CIVR, page 48, 2009.\n\n[11] Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. Triplet-based deep hashing\n\nnetwork for cross-modal retrieval. IEEE Transactions on Image Processing, 27(8):3893\u20133903, 2018.\n\n[12] Cheng Deng, Erkun Yang, Tongliang Liu, Jie Li, Wei Liu, and Dacheng Tao. Unsupervised semantic-\npreserving adversarial hashing for image search. IEEE Transactions on Image Processing, 28(8):4032\u2013\n4044, 2019.\n\n[13] Cheng Deng, Erkun Yang, Tongliang Liu, and Dacheng Tao. Two-stream deep hashing with class-speci\ufb01c\ncenters for supervised image search. IEEE Transactions on Neural Networks and Learning Systems, 2019.\n\n[14] Guiguang Ding, Yuchen Guo, and Jile Zhou. Collective matrix factorization hashing for multimodal data.\n\nIn CVPR, pages 2083\u20132090, 2014.\n\n[15] Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. Cross-modal deep variational hashing. In\n\nICCV, pages 4077\u20134085, 2017.\n\n[16] Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In\n\nVldb, volume 99, pages 518\u2013529, 1999.\n\n[17] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.\n\narXiv preprint arXiv:1412.6572, 2014.\n\n[18] Jiuxiang Gu, Jianfei Cai, Sha\ufb01q R Joty, Li Niu, and Gang Wang. Look, imagine and match: Improving\n\ntextual-visual cross-modal retrieval with generative models. In CVPR, pages 7181\u20137189, 2018.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, pages 770\u2013778, 2016.\n\n[20] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv\n\npreprint arXiv:1503.02531, 2015.\n\n[21] Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. Learning semantic concepts and order for image\n\nand sentence matching. In CVPR, pages 6163\u20136171, 2018.\n\n[22] Mark J Huiskes and Michael S Lew. The mir \ufb02ickr retrieval evaluation. In MIPR, pages 39\u201343, 2008.\n\n[23] Qing-Yuan Jiang and Wu-Jun Li. Deep cross-modal hashing. In CVPR, pages 3232\u20133240, 2017.\n\n[24] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for\n\nimage-text matching. In ECCV, pages 201\u2013216, 2018.\n\n[25] Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. Self-supervised adversarial\n\nhashing networks for cross-modal retrieval. In CVPR, pages 4242\u20134251, 2018.\n\n[26] Chao Li, Cheng Deng, Lei Wang, De Xie, and Xianglong Liu. Coupled cyclegan: Unsupervised hashing\n\nnetwork for cross-modal retrieval. In AAAI, 2019.\n\n[27] Yeqing Li, Wei Liu, and Junzhou Huang. Sub-selective quantization for learning binary codes in large-scale\nimage search. IEEE transactions on pattern analysis and machine intelligence, 40(6):1526\u20131532, 2018.\n\n[28] Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang. Semantics-preserving hashing for cross-view\n\nretrieval. In CVPR, pages 3864\u20133872, 2015.\n\n[29] Wei Liu, Cun Mu, Sanjiv Kumar, and Shih-Fu Chang. Discrete graph hashing. In NIPS, pages 3419\u20133427,\n\n2014.\n\n10\n\n\f[30] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In\n\nCVPR, pages 2074\u20132081. IEEE, 2012.\n\n[31] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs. In ICML, pages 1\u20138, 2011.\n\n[32] Wei Liu, Jun Wang, Yadong Mu, Sanjiv Kumar, and Shih-Fu Chang. Compact hyperplane hashing with\n\nbilinear functions. In ICML, pages 467\u2013474, 2012.\n\n[33] Wei Liu and Tongtao Zhang. Multimedia hashing and networking. IEEE MultiMedia, 23(3):75\u201379, 2016.\n\n[34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards\n\ndeep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.\n\n[35] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial\n\nperturbations. arXiv preprint arXiv:1702.04267, 2017.\n\n[36] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate\n\nmethod to fool deep neural networks. In CVPR, pages 2574\u20132582, 2016.\n\n[37] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from\n\nphenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.\n\n[38] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense\n\nto adversarial perturbations against deep neural networks. In SP, pages 582\u2013597. IEEE, 2016.\n\n[39] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete hashing. In CVPR, pages\n\n37\u201345, 2015.\n\n[40] Yuming Shen, Li Liu, and Ling Shao. Unsupervised binary representation learning with deep variational\n\nnetworks. International Journal of Computer Vision, pages 1\u201315, 2019.\n\n[41] Mengying Sun, Fengyi Tang, Jinfeng Yi, Fei Wang, and Jiayu Zhou. Identify susceptible locations in\nmedical records via adversarial attacks on deep predictive models. In ACM SIGKDD, pages 793\u2013801.\nACM, 2018.\n\n[42] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and\n\nRob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.\n\n[43] Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. Adversarial cross-modal retrieval.\n\nIn ACMMM, pages 154\u2013162, 2017.\n\n[44] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. Learning to hash for indexing big data\u2014a survey.\n\nProceedings of the IEEE, 104(1):34\u201357, 2015.\n\n[45] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. Adversarial examples\n\nfor semantic segmentation and object detection. In ICCV, pages 1369\u20131378, 2017.\n\n[46] Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darell, and Dawn Song. Can you fool ai\n\nwith adversarial examples on a visual turing test. arXiv preprint arXiv:1709.08693, 2017.\n\n[47] Erkun Yang, Cheng Deng, Chao Li, Wei Liu, Jie Li, and Dacheng Tao. Shared predictive cross-modal deep\n\nquantization. IEEE Transactions on Neural Networks and Learning Systems, (99):1\u201312, 2018.\n\n[48] Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. Pairwise relationship\n\nguided deep hashing for cross-modal retrieval. In AAAI, pages 1618\u20131625, 2017.\n\n[49] Erkun Yang, Tongliang Liu, Cheng Deng, Wei Liu, and Dacheng Tao. Distillhash: Unsupervised deep\n\nhashing by distilling data pairs. In CVPR, pages 2946\u20132955, 2019.\n\n11\n\n\f", "award": [], "sourceid": 5762, "authors": [{"given_name": "CHAO", "family_name": "LI", "institution": "Xidian University"}, {"given_name": "Shangqian", "family_name": "Gao", "institution": "University of Pittsburgh"}, {"given_name": "Cheng", "family_name": "Deng", "institution": "Xidian University"}, {"given_name": "De", "family_name": "Xie", "institution": "XiDian University"}, {"given_name": "Wei", "family_name": "Liu", "institution": "Tencent AI Lab"}]}