{"title": "Domain-Invariant Projection Learning for Zero-Shot Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1019, "page_last": 1030, "abstract": "Zero-shot learning (ZSL) aims to recognize unseen object classes without any training samples, which can be regarded as a form of transfer learning from seen classes to unseen ones. This is made possible by learning a projection between a feature space and a semantic space (e.g. attribute space). Key to ZSL is thus to learn a projection function that is robust against the often large domain gap between the seen and unseen classes. In this paper, we propose a novel ZSL model termed domain-invariant projection learning (DIPL). Our model has two novel components: (1) A domain-invariant feature self-reconstruction task is introduced to the seen/unseen class data, resulting in a simple linear formulation that casts ZSL into a min-min optimization problem. Solving the problem is non-trivial, and a novel iterative algorithm is formulated as the solver, with rigorous theoretic algorithm analysis provided. (2) To further align the two domains via the learned projection, shared semantic structure among seen and unseen classes is explored via forming superclasses in the semantic space. Extensive experiments show that our model outperforms the state-of-the-art alternatives by significant margins.", "full_text": "Domain-Invariant Projection Learning\n\nfor Zero-Shot Recognition\n\nAn Zhao1,\u2022 Mingyu Ding1,\u2022 Jiechao Guan1,\u2022 Zhiwu Lu1,\u2217 Tao Xiang2,3 Ji-Rong Wen1\n\n1Beijing Key Laboratory of Big Data Management and Analysis Methods\nSchool of Information, Renmin University of China, Beijing 100872, China\n2School of EECS, Queen Mary University of London, London E1 4NS, U.K.\n\n3Samsung AI Centre, Cambridge, U.K.\n\nzhiwu.lu@gmail.com\n\u2022 Equal contribution\n\nt.xiang@qmul.ac.uk\n\u2217 Corresponding author\n\nAbstract\n\nZero-shot learning (ZSL) aims to recognize unseen object classes without any\ntraining samples, which can be regarded as a form of transfer learning from seen\nclasses to unseen ones. This is made possible by learning a projection between\na feature space and a semantic space (e.g. attribute space). Key to ZSL is thus\nto learn a projection function that is robust against the often large domain gap\nbetween the seen and unseen classes. In this paper, we propose a novel ZSL model\ntermed domain-invariant projection learning (DIPL). Our model has two novel\ncomponents: (1) A domain-invariant feature self-reconstruction task is introduced\nto the seen/unseen class data, resulting in a simple linear formulation that casts\nZSL into a min-min optimization problem. Solving the problem is non-trivial,\nand a novel iterative algorithm is formulated as the solver, with rigorous theoretic\nalgorithm analysis provided. (2) To further align the two domains via the learned\nprojection, shared semantic structure among seen and unseen classes is explored\nvia forming superclasses in the semantic space. Extensive experiments show that\nour model outperforms the state-of-the-art alternatives by signi\ufb01cant margins.\n\nIntroduction\n\n1\nThe recent focus on object recognition has been on large-scale recognition problems such as the\nImageNet ILSVRC challenge [47]. Since the latest deep neural network (DNN) based models\n[49, 53, 12, 19] are reported to achieve super-human performance on the ILSVRC 1K recognition\ntask, a question arises: are we close to solving the large-scale recognition problem? The answer\nclearly relies on how large the scale is: 1) There are approximately 8.7 million animal species on\nearth; in that context, the ILSVRC 1K recognition task is nowhere near large-scale; 2) Most existing\nobject recognition models (particularly those DNN based ones) require hundreds of image samples to\nbe collected from each object class, but many of the object classes are rare and it is impossible to\ncollect suf\ufb01cient training samples for some of the rare classes even with the help from social media\nplatforms (e.g., most of the beetle species have never been photoed by amateurs). Therefore, there is\nstill a long way to go before a computer vision model can recognize all object categories.\nOne approach to overcoming the above challenge is zero-shot learning (ZSL) [48, 25, 46, 50, 8, 69,\n10, 1, 61]. ZSL aims to recognize a new/unseen class without any training samples from the class.\nAll existing ZSL models assume that each class name is embedded in a semantic space, such as\nattribute space [22, 25] or word vector space [14, 54]. Given a set of seen class samples, the visual\nfeatures are \ufb01rst extracted, typically using a DNN pretrained on ImageNet. With the visual feature\nrepresentation of the images and the semantic representation of the class names, the next task is to\nlearn a joint embedding space using the seen class data. In such a space, both feature and semantic\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\frepresentations are projected so that they can be directly compared. Once the projection functions are\nlearned, they are applied to the unseen test images and unseen class names, and the \ufb01nal recognition\nis conducted by simple search of the nearest neighbour class name for each test image.\nOne of the biggest challenges in ZSL is the domain gap between the seen and unseen classes. As\nmentioned above, the projection functions learned from the seen classes with labelled data are applied\nto the unseen class data in ZSL. However, the unseen classes are often visually very different from\nthe seen ones. Therefore, the domain gap between the seen and unseen class domains can be large.\nConsequently, the same projection function may not be able to project an unseen class image to\nbe close to its corresponding class name in the joint embedding space for correct recognition. To\ntackle the projection domain shift [15, 23, 45] caused by the domain gap, a number of ZSL models\nresort to transductive learning [69, 18, 64, 26, 58, 65] in order to narrow the domain gap using the\nunlabelled unseen class samples. However, without any labels, the unseen class data has limited\neffect in overcoming the domain gap using existing transductive ZSL models.\nIn this paper, we propose a novel ZSL model termed domain-invariant projection learning (DIPL). Our\nmodel is based on transductive learning but differs signi\ufb01cantly from existing models in two aspects.\nFirst, we introduce a domain-invariant task, namely visual feature self reconstruction. Speci\ufb01cally,\nafter projecting a feature vector representing the object visual appearance into a semantic embedding\nspace, it should be able to be projected back in the reverse direction to reconstruct the original feature\nvector (see explanation in Sec. 3.2). By imposing such forward and reverse projection learning on\nthe seen/useen class data, our DIPL model takes a simple linear formulation that casts ZSL into a\nmin-min optimization problem. Solving the problem is non-trivial. A novel iterative algorithm is thus\ndeveloped as the solver, followed by rigorous theoretic algorithm analysis. Note that the proposed\nalgorithm could potentially be used for solving other vision problems with min-min optimization\ninvolved. Second, we align the two domains by exploiting shared superclasses. The idea is simple:\nalthough the seen and unseen classes are different, they site in an object taxonomy where the root node\nis \u2018object\u2019. Tracing towards the root, the classes in the two domains will share the same ancestors or\nsuperclasses. In this work, we take a data driven approach without the need for manually de\ufb01ned\ntaxonomy. Concretely, the superclasses are generated automatically by k-means clustering in the\nsemantic space, which then act as a bridge to align the two domains using our DIPL model.\nOur contributions are: (1) A novel transductive ZSL model is proposed which aligns the seen and\nunsee class domains using domain-invariant feature self-reconstruction and superclasses shared across\ndomain alignment. (2) We formulate ZSL as a min-min optimization problem with a simple linear\nformulation that can be solved by a novel iterative algorithm. Note that the proposed algorithm could\npotentially be used for solving other vision problems with min-min optimization involved. (3) We\nprovide rigorous theoretic analysis for the proposed algorithm. Extensive experiments show that the\nproposed model yields state-of-the-art results. The improvements over alternative ZSL models are\nespecially signi\ufb01cant under the more challenging pure and generalized ZSL settings.\n\n2 Related Work\nSemantic Space. Various semantic spaces are used as representations of class names for ZSL. The\nattribute space [67, 61] is the most widely used. However, for large-scale problems, annotating\nattributes for each class becomes very dif\ufb01cult. Recently, semantic word vector space has begun to be\npopular especially in large-scale problems [14], since no manually de\ufb01ned ontology is required and\nany class name can be represented as a word vector for free. In addition, in [2], the manually-de\ufb01ned\nobject taxonomy was also used to form the semantic space for ZSL. In this paper, although we also\nleverage superclasses in ZSL, we take a data driven approach based on k-means clustering without\nthe need for manually-de\ufb01ned taxonomy.\nProjection Learning. Relying on how the projection function is established, existing ZSL models\ncan be organized into three groups: (1) The \ufb01rst group learns a projection function from a visual\nfeature space to a semantic space (i.e. in a forward projection direction) by employing conventional\nregression/ranking models [25, 2] or deep neural network regression/ranking models [54, 14, 44, 4].\n(2) The second group chooses the reverse projection direction [50, 23, 51, 66], i.e. from the semantic\nspace to the feature space, to alleviate the hubness problem suffered by nearest neighbour search in a\nhigh dimensional space [42]. (3) The third group learns an intermediate space as the embedding space,\nwhere both the feature space and the semantic space are projected to [31, 68, 8]. As a combination of\nthe \ufb01rst and second groups, our DIPL model integrates both forward and reverse projections for ZSL.\n\n2\n\n\fMore importantly, different from existing projection learning models, our model is also formulated\nfor transductive learning and ZSL with superclasses to address the domain gap problem. Note that\nour transductive formulation is non-trivial, and a novel iterative algorithm is formulated as the solver,\nwith rigorous theoretic algorithm analysis provided.\nTransductive ZSL. Transductive ZSL is proposed to tackle the projection domain shift [15, 23, 45]\ncaused by the domain gap, through learning with not only the training set of labelled seen class data\nbut also the test set of unlabelled unseen class data. According to whether the predicted labels of the\ntest images are iteratively used for model learning, existing transductive ZSL models fall into two\ncategories: (1) The \ufb01rst category [15, 17, 26, 45, 64] \ufb01rst constructs a graph in the semantic space\nand then transfers to the test set by label propagation. A variant is the structured prediction model\n[69] which employs a Gaussian parametrization of the unseen class domain label predictions. (2) The\nsecond category [18, 23, 27, 51, 58, 65] involves using the predicted labels of the unseen class data\nin an iterative model update/adaptation process as in self-training [62, 63]. Our DIPL model can be\nconsidered as a combination of these two categories of transductive ZSL models.\nZSL with Superclasses. There is little attention on ZSL with superclasses. Two exceptions are: 1)\n[20] learns the relation between attributes and superclasses for semantic embedding; 2) [39] uses the\ntaxonomy to de\ufb01ne the semantic representation of each object class. Note that these two methods\nhave a limitation that manually de\ufb01ned taxonomy must be provided at advance. In this paper, our\nmethod is more \ufb02exible by generating the superclasses with k-means clustering.\n\ni\n\ni\n\n, l(s)\n\ni\n\nq\n\n, y(s)\nl(s)\ni\n\n1 , ..., y(s)\n\n1 , ..., y(u)\n\np ] \u2208 Rk\u00d7p and Yu = [y(u)\n\ni \u2208 {1, ..., p} is the label of x(s)\n\n, and Ns denotes the total number of labelled images. Let Du = {(x(u)\n\n3 Methodology\n3.1 Problem De\ufb01nition\nLet S = {s1, ..., sp} denote a set of seen classes and U = {u1, ..., uq} denote a set of unseen classes,\nwhere p and q are the total numbers of seen and unseen classes, respectively. These two sets of classes\n] \u2208\nare disjoint, i.e. S \u2229 U = \u03c6. Similarly, Ys = [y(s)\nRk\u00d7q denote the corresponding seen and unseen class semantic representations (e.g. k-dimensional\nattribute vector). We are given a set of labelled training images Ds = {(x(s)\n) : i =\n1, ..., Ns}, where x(s)\ni \u2208 Rd\u00d71 is the d-dimensional visual feature vector of the i-th image in the\ntraining set, l(s)\nis the semantic representation\nof x(s)\n) : i =\ni \u2208 Rd\u00d71 is the d-dimensional visual\n1, ..., Nu} denote a set of unlabelled test images, where x(u)\nfeature vector of the i-th image in the test set, l(u)\naccording\nto U, y(u)\n, and Nu denotes the total number of\nl(u)\ni\nunlabelled images. The goal of zero-shot learning is to predict the labels of test images by learning a\nclassi\ufb01er f : Xu \u2192 U, where Xu = {x(u)\n3.2 Model Formulation\nAs we have mentioned, our DIPL model integrates both forward and reverse projections for ZSL, so\nthat a feature vector representing the visual appearance of an object will be projected into a semantic\nspace and back to reconstruct itself. Such a self-reconstruction task can help narrow the domain gap\n(see more explanation below). Speci\ufb01cally, assuming that the forward and reverse projections have\nthe same importance for ZSL, our DIPL model solves the following optimization problem:\n\ni \u2208 {1, ..., q} is the unknown label of x(u)\n\nis the unknown semantic representation of x(u)\n\naccording to S, y(s)\nl(s)\ni\n\n: i = 1, ..., Nu}.\n\n, y(u)\nl(u)\ni\n\n, l(u)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\nmin\nW\n\n(cid:40) Ns(cid:88)\n(cid:18)\nNu(cid:88)\n\ni=1\n\n+\u03b3\n\ni=1\n\n(cid:107)WT x(s)\n\nl\n\n(s)\ni\n\ni \u2212 y(s)\n\n(cid:16)(cid:107)WT x(u)\n\nmin\n\nj\n\n(cid:107)2\n2 + (cid:107)x(s)\n\ni \u2212 Wy(s)\n\nl\n\n(s)\ni\n\n(cid:107)2\n\n2\n\n+ \u03bb(cid:107)W(cid:107)2\n\nF\n\n(cid:19)\n\n(cid:17)(cid:41)\n\ni \u2212 y(u)\n\nj (cid:107)2\n\n2 + (cid:107)x(u)\n\ni \u2212 Wy(u)\n\nj (cid:107)2\n\n2\n\n,\n\n(1)\n\nwhere W \u2208 Rd\u00d7k is a projection matrix from the semantic space to the feature space, and \u03bb, \u03b3 are\nthe regularization parameters. The \ufb01rst term of Eq. (1) integrates the losses of the forward and reverse\nprojections between the feature and semantic representations of the seen class samples.\n\n3\n\n\fOur motivation can be explained as follows: (1) Adding the losses of the forward and reverse\nprojections imposes a self-reconstruction constraint on our regression model, similar to that used in\nautoencoder [5, 24]. This is motivated by the fact that adding an autoencoder style self-reconstruction\ntask can improve the model generalization ability as demonstrated in many other problems [30, 3].\nIn our ZSL problem, this improved generalization ability makes the learned regression model more\napplicable to the unseen class domain. (2) We also apply the similar loss function to the unlabelled\nunseen class samples (i.e. the third term of Eq. (1)), so that for each unseen class image, its nearest\nunseen class is found and their distance in the embedding space is minimized. This induces a\ntransductive learning formulation into our model that enables the exploitation of the unlabelled\nunseen class data for narrowing down the domain gap. In summary, the combination of the auxiliary\nself-reconstruction task and transductive learning formulation distinguishes our model from existing\nones and explains its superior performance. In particular, the generalized ZSL results in Table 3(b)\nshow that our model produces the smallest gap between the seen and unseen class accuracies whilst\nexisting ZSL models heavily favor one over the other. More importantly, although our model only\ntakes a simple linear formulation, it is clearly shown to outperform existing nonlinear autoencoder-\nbased ZSL models (including transductive ones) [59, 38] (see Table 2).\n\n3.3 Optimization\nSince the third term of the objective function in Eq. (1) is denoted as a sum of minimums, it is\nnon-trivial to solve the optimization problem in Eq. (1). In the following, we will formulate our solver\nas a novel iterative gradient-based algorithm. Note that the contentional alternating optimization\nalgorithms (like k-means) have been employed for solving this type of min-min optimization problems\nin many existing transductive ZSL models [51, 58, 65]. However, our optimization algorithm is\nclearly shown to yield better results than these contentional optimization algorithms (see Table 2).\nThis is also the place where our main contribution lies.\nGiven the projection matrix W(t) at iteration t during model learning, we de\ufb01ne the loss function\ni \u2212 y(u)\nj (cid:107)2\ni1 , ..., f (t)\niq ]T for the test image x(u)\nf (t)\ni = [f (t)\nx(u)\n2 +\nj (cid:107)2\ni \u2212 W(t)y(u)\n(cid:107)x(u)\n, we de\ufb01ne its gradient\ni = [\u03b7(t)\n\u03b7(t)\ni1 , ..., \u03b7(t)\niq ]T with respect to f (t)\n\n2 (j = 1, ..., q). For the minimum function min f (t)\n\n(i = 1, ..., Nu), where f (t)\n\nij = (cid:107)W(t)T\n\nas follows:\n\ni\n\ni\n\ni\n\n\u03b7(t)\nij =\n\n1/n(t)\n\ni\n\n0\n\nij = min f (t)\n\nif f (t)\n\n,\n, otherwise\n\ni\n\n,\n\n(2)\n\nij (j = 1, ..., q) being equal to min f (t)\n\ni\n\n. Taking the Taylor expansion,\n\nwhere n(t)\ni\nwe have the following approximation:\n\nis the number of f (t)\n\n(cid:16)(cid:107)W(t+1)T\n\nmin\n\nj\n\ni \u2212 y(u)\nx(u)\n\nj (cid:107)2\n\n2 + (cid:107)x(u)\n\ni \u2212 W(t+1)y(u)\n\nj (cid:107)2\n\n2\n\n= min f (t+1)\n\ni\n\n\u2248 min f (t)\n\ni + \u03b7(t)\n\ni\n\nT\n\n(f (t+1)\n\ni\n\n\u2212 f (t)\n\ni\n\nT\n\n) = \u03b7(t)\n\ni\n\nf (t+1)\ni\n\n.\n\n(cid:17)\n\n(cid:19)\n\n(cid:107)2\n\n2\n\n(cid:26)\n\nNs(cid:88)\n\ni=1\n\n(cid:18)\nNu(cid:88)\n\nThe objective function in Eq. (1) at iteration t + 1 can be estimated as:\n\nF(W(t+1)) =\n\n(cid:107)W(t+1)T\n\ni \u2212 y(s)\nx(s)\n\n(s)\nl\ni\n\n(cid:107)2\n2 + (cid:107)x(s)\n\ni \u2212 W(t+1)y(s)\n\nl\n\n(s)\ni\n\n+ \u03b3\n\nT\n\n\u03b7(t)\ni\n\nf (t+1)\ni\n\n+ \u03bb(cid:107)W(t+1)(cid:107)2\nF .\n\nLet \u2202F (W(t+1))\n\n\u2202W(t+1) = 0, we obtain a linear equation as follows:\n\ni=1\n\nA(t)W(t+1) + W(t+1)B(t) = C(t),\n\nwhere A(t) = (cid:80)Ns\n\u03b3(cid:80)Nu\n\n(cid:80)q\n\nj=1 \u03b7(t)\n\nij y(u)\n\ni=1\n\ni=1 x(s)\nj y(u)\n\nT\n\n+ \u03b3(cid:80)Nu\n, and C(t) = 2(cid:80)Ns\n\nT\n\ni x(s)\n\ni\n\ni=1 x(u)\n\ni x(u)\n\ni\n\nT\n\n+ \u03bbI, B(t) = (cid:80)Ns\n+ 2\u03b3(cid:80)Nu\n\n(cid:80)q\n\nj=1 \u03b7(t)\n\ni=1\n\nT\n\ni=1 y(s)\nl(s)\ni\ni y(u)\nij x(u)\n\ny(s)\nl(s)\ni\nT\n\nj\n\n. Let\n\u03b1t = \u03b3/(1 + \u03b3) \u2208 (0, 1) and \u03b2 = \u03bb/(1 + \u03b3). In this paper, we empirically set \u03b1t = 0.99t\u03b1 (\u03b10 =\n\ni y(s)\nl(s)\ni\n\ni=1 x(s)\n\nj\n\n(3)\n\n(4)\n\n(5)\n\nT\n\n+\n\n4\n\n\fAlgorithm 1 Domain-Invariant Projection Learning\n\nInput: training and test sets Ds,Xu; semantic prototypes Ys, Yu; parameter \u03b1\nOutput: W\u2217\n1. Initialize W(0) with our DIPL model (\u03b1 = 0) at t = 0;\nrepeat\n\n2. Set \u03b1t = 0.99t\u03b1;\n3. With the learned projection matrix W(t), compute the gradient \u03b7(t)\n\n4. Compute (cid:98)A(t), (cid:98)B(t), and (cid:98)C(t) with Eqs. (6)\u2013(8), and update W(t+1) by solving Eq. (9);\n\nij with Eq. (2);\n\n5. Set t = t + 1;\n\nuntil a stopping criterion is met\n6. W\u2217 = W(t).\n\n\u03b1 \u2208 (0, 1)) and \u03b2 = 0.01 in all experiments. We thus have:\nNu(cid:88)\nNu(cid:88)\nq(cid:88)\nq(cid:88)\nNu(cid:88)\n\nNs(cid:88)\n(cid:98)A(t) = (1 \u2212 \u03b1t)\nNs(cid:88)\n(cid:98)B(t) = (1 \u2212 \u03b1t)\nNs(cid:88)\n(cid:98)C(t) = 2(1 \u2212 \u03b1t)\n\nx(s)\ni x(s)\n\ni y(s)\nx(s)\n\n+ 2\u03b1t\n\n+ \u03b1t\n\n+ \u03b1t\n\ny(s)\n\ny(s)\n\n(s)\ni\n\n(s)\ni\n\nj=1\n\ni=1\n\ni=1\n\ni=1\n\nl\n\nl\n\ni=1\n\nT\n\nT\n\ni\n\nT\n\n(s)\nl\ni\n\ni=1\n\ni=1\n\nj=1\n\nx(u)\ni x(u)\n\ni\n\nT\n\n+ \u03b2I,\n\nij y(u)\n\u03b7(t)\n\nj y(u)\n\nj\n\nT\n\n,\n\nij x(u)\n\u03b7(t)\n\ni y(u)\n\nj\n\nT\n\n.\n\nThe linear equation in Eq. (5) is then reformulated as follows:\n\n(cid:98)A(t)W(t+1) + W(t+1)(cid:98)B(t) = (cid:98)C(t),\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nwhich is a Sylvester equation and it can be solved ef\ufb01ciently by the Bartels-Stewart algorithm [6].\nThe DIPL algorithm is given in Algorithm 1, with rigorous theoretic algorithm analysis in the suppl.\nmaterial. Note that any ZSL model can be used to obtain the initial projection matrix W(0). In this\npaper, we choose our DIPL model with \u03b1 = 0 for this initialization. Once learned, given the optimal\nprojection matrix W\u2217 found by our DIPL algorithm, we predict the label of a test image x(u)\nas:\nl(u)\ni = arg minj\n\n(cid:16)(cid:107)W\u2217T x(u)\n\ni \u2212 W\u2217y(u)\nj (cid:107)2\n\ni \u2212 y(u)\nj (cid:107)2\n\n2 + (cid:107)x(u)\n\n(cid:17)\n\n.\n\n2\n\ni\n\nWe provide the time complexity analysis for Algorithm 1 as follows. The computation of [\u03b7(t)\n\n(cid:98)A(t), (cid:98)B(t), and (cid:98)C(t) has a time complexity of O(qNu), O(d2(Ns + Nu)), O(k2Ns + k2Nu), and\n(cid:98)B(t) and (cid:98)C(t). Moreover, given (cid:98)A(t) \u2208 Rd\u00d7d and (cid:98)B(t) \u2208 Rk\u00d7k, the time complexity of solving\n\nO(dkNs + dkNu), respectively. Here, the sparsity of [\u03b7(t)\n\nij ] is used to reduce the cost of computing\n\nEq. (9) is O(d3 + k3). To sum up, one iteration has a linear time complexity of O(qNu + (d2 + dk +\nk2)(Ns + Nu)) (d, k, q (cid:28) (Ns + Nu)) with respect to the data size Ns + Nu. Since Algorithm 1 is\nshown to converge very quickly (t \u2264 5), it is ef\ufb01cient even for large-scale ZSL problems.\n\nij ]Nu\u00d7q,\n\n3.4 ZSL with Superclasses\n\nWe \ufb01nally apply our DIPL algorithm to ZSL with superclasses. This is motivated by the fact that there\nexist unseen/seen classes that fall into the same superclass, i.e., the unseen class samples become\n\u2018seen\u2019 at the superclass level and thus easier to recognize. Speci\ufb01cally, our DIPL model is employed\nfor ZSL with superclasses as follows: 1) Group all unseen and seen class prototypes [Ys, Yu] into\nr clusters by k-means clustering and represent the superclass prototypes with the cluster centers\nZ = [z1, ..., zr]; 2) Run our DIPL algorithm over superclasses by replacing the original semantic\nprototypes [Ys, Yu] by the superclass prototypes Z; 3) Predict the top 5 superclass labels of each\nunlabelled unseen sample x(u)\nand then generate the set of the most possible unseen class labels\nN (x(u)\naccording to the k-means clustering results; 4) Run our DIPL algorithm over the\noriginal semantic prototypes [Ys, Yu] by computing \u03b7(t)\n\nij with the constraint j \u2208 N (x(u)\n\n) for x(u)\n\n).\n\ni\n\ni\n\ni\n\ni\n\n5\n\n\fTable 1: Five benchmark datasets used for performance evaluation. Notations: \u2018SS\u2019 \u2013 semantic space,\n\u2018SS-D\u2019 \u2013 the dimension of semantic space, \u2018A\u2019 \u2013 attribute, and \u2018W\u2019 \u2013 word vector. The two splits of\nthe SUN dataset are separated by \u2018|\u2019.\n# images\n30,475\n11,788\n15,339\n14,340\n218,000\n\nDataset\nAwA\nCUB\naPY\nSUN\nImNet\n\nSS-D\n85\n312\n64\n102\n1,000\n\nSS\nA\nA\nA\nA\nW\n\n# seen/unseen\n\n40/10\n150/50\n20/12\n\n707/10|645/72\n\n1,000/360\n\n4 Experiments\n\n4.1 Datasets and Settings\n\nDatasets. Five widely-used benchmark datasets are selected in this paper. Four of them are\nof medium-size: Animals with Attributes (AwA) [25], CUB-200-2011 Birds (CUB) [56], aPas-\ncal&Yahoo (aPY) [13], and SUN Attribute (SUN) [41]. One large-scale dataset is ILSVRC2012/2010\n[47] (ImNet), where the 1,000 classes of ILSVRC2012 are used as seen classes and 360 classes of\nILSVRC2010 (not included in ILSVRC2012) are used as unseen classes, as in [16]. The details of\nthese benchmark datasets are given in Table 1.\nSemantic Spaces. Two types of semantic spaces are considered for ZSL: attributes are employed to\nform the semantic space for the four medium-scale datasets, while word vectors are used as semantic\nrepresentation for the large-scale ImNet dataset. In this paper, we train a skip-gram text model on a\ncorpus of 4.6M Wikipedia documents to obtain the word2vec [37] word vectors.\nVisual Spaces. All recent ZSL models use the visual features extracted by CNN models [53, 55, 19],\nwhich are pre-trained on the 1K classes in ILSVRC 2012 [47]. In this paper, we extract the visual\nfeatures with pre-trained GoogLeNet [55]. Note that the same visual features (GoogLeNet) are used\nfor most compared methods throughout this paper. The only exception is Table 2, where although\nmost results were obtained with GoogLeNet features, a number of more recent ZSL models used\nVGG19 [53] and ResNet101 [19] features. Without source code of these models, we cannot report\ntheir results with the same GoogLeNet features. However, as demonstrated in [28], the VGG19 and\nResNet101 features typically lead to better performance in the ZSL task than the GoogLeNet features.\nSince our model does not use stronger features, the comparisons in Table 2 are still fair.\nZSL Settings. (1) Standard ZSL: This setting is widely used in previous works [2, 44]. The\nseen/unseen class splits of the \ufb01ve datasets are presented in Table 1. (2) Pure ZSL: A new \u2018pure\u2019\nZSL setting [61, 29] is recently proposed to overcome the weakness of the standard setting. More\nconcretely, most recent ZSL models extract the visual features using ImageNet ILSVRC2012 1K\nclasses pretrained CNN models, but the unseen classes in the standard splits overlap with the 1K\nImageNet classes. The zero-shot rule is thus violated. Under the pure setting, the overlapped\nImageNet classes are removed from the test set of unseen classes for the new benchmark ZSL dataset\nsplits. (3) Generalized ZSL: The third ZSL setting that emerges recently [43, 7] is the generalized\nsetting under which the test set contains data samples from both seen and unseen classes. This setting\nis clearly more re\ufb02ective of real-world application scenarios.\nEvaluation Metrics. (1) Standard and Pure ZSL: For the four medium-scale datasets, we compute\nthe multi-way classi\ufb01cation accuracy as in previous works. For the large-scale ImNet dataset, the \ufb02at\nhit@5 accuracy is computed over all test samples as in [16]. (2) Generalized ZSL: Three metrics\nare de\ufb01ned as: 1) accs \u2013 the accuracy of classifying the data samples from the seen classes to all\nthe classes (both seen and unseen); 2) accu \u2013 the accuracy of classifying the data samples from the\nunseen classes to all the classes; 3) HM \u2013 the harmonic mean of accs and accu.\nParameter Settings. Our full DIPL model (including superclasses) has only two free parameters to\ntune: \u03b1 \u2208 (0, 1) (see Step 2 in Algorithm 1) and r (the number of superclasses used in Sec. 3.4). As\nin [50, 24], the parameters are selected by class-wise cross-validation on the training set.\nCompared Methods. In this paper, a wide range of existing ZSL models are selected for performance\ncomparison. Under each ZSL setting, we focus on the recent and representative ZSL models that\nhave achieved the state-of-the-art results.\n\n6\n\n\fTable 2: Comparative accuracies (%) under the standard ZSL setting. For SUN, the results are\nobtained for the 707/10 and 645/72 splits, separated by \u2018|\u2019. For ImNet, the hit@5 accuracy is used for\nevaluation. Visual features: G \u2013 GoogLeNet [55]; V \u2013 VGG19 [53]; R \u2013 ResNet101 [19].\n\nModel\n\nRPL [50]\nSSE [67]\nSJE [2]\nJLSE [68]\nSynC [8]\nSAE [24]\nLAD [21]\nEXEM [9]\nSCoRe [39]\nLESD [11]\nCVA [38]\nf-CLSWGAN [60]\nSS-Voc [16]\nSP-ZSR [69]\nSSZSL [51]\nDSRL [64]\nTSTD [65]\nBiDiLEL [58]\nDMaP [28]\nVZSL [59]\nFull DIPL (our)\n\nFeatures\n\nG\nV\nG\nV\nG\nG\nV\nG\nV\nV/G\nV/R\nR\nV\nV\nV\nV\nV\nV/G\n\nV\nG\n\nV+G+R\n\nTrans.?\n\nN\nN\nN\nN\nN\nN\nN\nN\nN\nN\nN\nN\nY\nY\nY\nY\nY\nY\nY\nY\nY\n\nAwA\n80.4\n76.3\n73.9\n80.5\n72.9\n84.7\n82.5\n77.2\n82.8\n82.8\n85.8\n69.9\n78.3\n92.1\n88.6\n87.2\n90.3\n95.0\n90.5\n94.8\n96.1\n\nCUB\n52.4\n30.4\n51.7\n42.1\n54.7\n61.4\n56.6\n59.8\n59.5\n56.2\n54.3\n61.5\n\n\u2013\n\n55.3\n58.8\n57.1\n58.2\n62.8\n67.7\n66.5\n68.2\n\naPY\n48.8\n46.2\n\n50.4\n\n55.4\n53.7\n\n58.8\n\n\u2013\n\n\u2013\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\n69.7\n49.9\n56.3\n\n\u2013\n\u2013\n\u2013\n\u2013\n87.8\n\nSUN\n84.5| \u2013\n82.5| \u2013\n\u2013 |56.1\n83.8| \u2013\n\u2013 |62.7\n91.5|65.2\n85.0| \u2013\n\u2013 |69.6\n\u2013 | \u2013\n88.3| \u2013\n88.5| \u2013\n\u2013 |62.1\n\u2013 | \u2013\n89.5| \u2013\n86.2| \u2013\n85.4| \u2013\n\u2013 | \u2013\n\u2013 | \u2013\n\u2013 | \u2013\n87.8| \u2013\n93.5|70.0\n\nImNet\n\n27.2\n\n24.7\n\n16.8\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n23.1\n31.7\n\nTable 3: (a) Comparative accuracies (%) under the pure ZSL setting (as in [61, 29]). (b) Comparative\nresults (%) of generalized ZSL (as in [10]). For the SUN dataset, only the 645/72 split is used.\n\nModel\n\nDeViSE [14]\nConSE [40]\nSSE [67]\nSJE [2]\nALE [1]\nSynC [8]\nCLN+KRR [29]\nFull DIPL (our)\n\nAwA CUB aPY SUN\n56.5\n54.2\n38.8\n45.6\n51.5\n60.1\n65.6\n53.7\n58.1\n59.9\n56.3\n54.0\n60.0\n68.2\n85.6\n67.9\n(a)\n\n39.8\n26.9\n34.0\n32.9\n39.7\n23.9\n44.8\n69.6\n\n52.0\n34.3\n43.9\n53.9\n54.9\n55.6\n58.1\n65.4\n\nAwA\n\nCUB\n\nModel\naccs accu HM accs accu HM\n7.5\n4.7 55.1 4.0\n77.9 2.4\nDAP [25]\n76.8 1.7\n3.3 69.4 1.0\n2.0\nIAP [25]\n75.9 9.5 16.9 69.9 1.8\n3.5\nConSE [40]\n43.2 61.7 50.8 23.4 39.9 29.5\nAPD [43]\n81.3 32.3 46.2 72.0 26.9 39.2\nGAN [7]\nSAE [24]\n67.6 43.3 52.8 36.1 28.0 31.5\nFull DIPL (our) 83.7 68.9 75.6 44.8 41.7 43.2\n\n(b)\n\n4.2 Comparative Results\n\nStandard ZSL. The comparative results under the standard ZSL setting are shown in Table 2. For\ncomprehensive comparison, both transductive and non-transductive state-of-the-art ZSL models are\nincluded. It can be seen that: (1) Our model performs the best on all \ufb01ve datasets, validating that the\ncombination of domain-invariant feature self-reconstruction and superclasses shared across domain\nalignment is indeed effective for learning domain-invariant projection. (2) For the four medium-scale\ndatasets, the improvements obtained by our model over the strongest competitor range from 0.4%\nto 18.1%. This actually creates new baselines in the area of ZSL, given that most of the compared\nmodels take far more complicated nonlinear formulations and some of them even combine two or\nmore feature/semantic spaces. (3) For the large-scale ImNet dataset, our model achieves a 4.5%\nimprovement over the state-of-the-art SAE [24], showing its scalability to large-scale problems.\nPure ZSL. Taking the same \u2018pure\u2019 ZSL setting as in [61, 29], we remove the overlapped ImageNet\nILSVRC2012 1K classes from the test set of unseen classes for the four medium-scale datasets. The\ncomparative results in Table 3(a) show that, as expected, under this stricter ZSL setting, all ZSL\nmodels suffer from performance degradation. However, the performance of our model drops the\nleast among all ZSL models, and the improvement over the strongest competitor becomes more\nsigni\ufb01cant for each of the four datasets. This provides further evidence that our model tends to learn\na domain-invariant projection even under this stricter ZSL setting.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Ablation study results on the four medium-scale datasets under the pure ZSL setting. (b)\nConvergence analysis of our DIPL algorithm on the CUB dataset under the pure ZSL setting.\n\nGeneralized ZSL. We follow the same generalized ZSL setting of [10]. Speci\ufb01cally, we hold out\n20% of the data samples from the seen classes and mix them with the data samples from the unseen\nclasses. The comparative results on AwA and CUB are presented in Table 3(b), where our model\nis compared with six other ZSL alternatives. We have the following observations: (1) Different\nZSL models have a different trade-off between the seen and unseen class accuracies, and the overall\nperformance is thus best measured by HM. (2) Our model clearly performs the best over the two\ndatasets, and its advantage over other competitors is even more signi\ufb01cant for this more challenging\nsetting. (3) Our model produces the smallest gap between the seen and unseen class accuracies whilst\nexisting ZSL models heavily favor one over the other. This means that our model has the strongest\ngeneralization ability under this more realistic ZSL setting.\n\n4.3 Further Evaluations\n\nAblation Study. Our full DIPL model can be simpli\ufb01ed as follows: (1) When the superclasses are not\nused for ZSL, our full DIPL model degrades to the original DIPL model proposed in Sec. 3.3, denoted\nas DIPL1; (2) For \u03b1 = 0, the DIPL1 model further degrades to an inductive ZSL model (including\nboth forward and reverse projections), denoted as DIPL0; (3) When the forward projection is not\nconsidered for ZSL, the DIPL0 model \ufb01nally degrades to the original reverse projection learning\nmodel [50], denoted as RPL. To evaluate the contributions of the main components of our full DIPL\nmodel, we compare it with the simpli\ufb01ed versions RPL, DIPL0, and DIPL1 under the same pure ZSL\nsetting. The ablation study results in Figure 1(a) show that: (1) The transductive learning induced\nby our DIPL1 model yields signi\ufb01cant improvements (see DIPL1 vs. DIPL0), ranging from 5%\nto 30%. (2) Our enhanced ZSL method using superclasses achieves about 1\u20132% gains (see Full\nDIPL vs. DIPL1), validating its effectiveness. This is still very impressive since the DIPL1 model\nhas already achieved state-of-the-art results. More results of ZSL with superclasses are provided in\nTable 4. (3) The combination of both forward and reverse projections is also important for ZSL (see\nDIPL0 vs. RPL), resulting in 2\u20134% improvements.\nConvergence Analysis. To provide more convergence analysis for Algorithm 1, we de\ufb01ne three\nbaseline projection matrices based on the DIPL0 model: 1) Wall \u2013 learned by DIPL0 using the\nwhole dataset (all are labelled); 2) Wtr \u2013 learned by DIPL0 only using the training set; 3) Wte \u2013\nlearned by DIPL0 only using the test set (but labelled). Let Wour be learned by our DIPL model\nusing the test set (unlabelled) and the training set. We can directly compare Wour, Wtr, and Wte\nto Wall by computing the matrix distances among these matrices. Note that Wall is considered to\nbe the best possible projection matrix (upper bound). The results in Figure 1(b) show that: (1) Our\nDIPL algorithm converges very quickly (\u22645 iterations). (2) Wour gets closer to Wall with more\niterations and it is the closest to Wall at convergence, i.e., our model can narrow the domain gap by\nnot over\ufb01tting to the training domain.\nQualitative Results. We present the qualitative results of superclass generation on the ImNet\ndataset in Table 4. The superclasses are generated by k-means clustering (with r = 500 clusters)\non all seen/unseen class prototypes. We have the following observations: (1) There indeed exist\n\n8\n\nAwACUBaPYSUN0102030405060708090Accuracy (%) RPLDIPL0DIPL1Full DIPL012345\u22121\u22120.500.511.522.533.5IterationsRelative\u0394W ||Wte\u2212Wall||2F/||Wall||2F||Wour\u2212Wall||2F/||Wall||2F||Wtr\u2212Wall||2F/||Wall||2F\fTable 4: Examples of the superclasses generated by k-means clustering on the ImNet dataset.\n\nSuperclasses\n\nID: 1\nID: 2\nID: 3\nID: 4\nID: 5\n\nSeen/unseen classes within a superclass\nseen: unicycle; unseen: hard hat\nseen: freight car; unseen: ferris wheel\nseen: hair slide; unseen: nail polish\nseen: ox; seen: bison\nunseen: coffee bean; unseen: Arabian coffee; unseen: cacao\n\nsuperclasses that consist of semantically-related seen and unseen classes, which means that the unseen\nclass samples can become \u2018seen\u2019 in ZSL with superclasses and thus easier to recognize. (2) When\nonly unseen (or only seen) classes are included in a superclass, they are also semantically related,\nwhich can be used as the context to improve the performance of label prediction.\n\n4.4 Discussions\n\nWe have reformulated our model with the soft assignment (e.g. using softmax loss), but the results\nare clearly worse. One of the possible reasons is in the solver: with the min-min formulation, the\nproblem can be solved explicitly using a linear equation at each iteration (see Eq. (9). In contrast, the\nnonlinear min-softmax problem is harder to solve and the standard gradient-based solver does not\nhave the nice convergence property as in the min-min formulation.\nAmong the \ufb01ve datasets used in our experiments, our DIPL model is shown to be only able to\nachieve small improvements on the CUB dataset. Unlike the other four datasets, CUB is much more\n\ufb01ne-grained \u2013 all classes are sub-species of birds. As a result, the unseen classes of CUB are very\nsimilar. It thus becomes hard for our DIPL model to \ufb01nd the best unseen class label for an unlabelled\nunseen class sample during training. This shortcoming can potentially be overcome by generalized\ncompetitive learning [33, 52, 35, 32]: examining the second most likely unseen class label and forcing\nthe projection to distinguish the best and second best ones \u2013 essentially pushing the unseen classes\nfurther away to each other after projection.\nNote that our DIPL model is essentially a bidirectional one, with an autoencoder style self-\nreconstruction task involved. Although only evaluated in the area of ZSL, our bidirectional model\ncan be applied to other problem settings where a mapping between a feature and a semantic space\nis required. For example, in our ongoing research, our model has been generalized to social image\nclassi\ufb01cation [36] (where social tags form the semantic space) and cross-modal image retrieval [34]\n(where texts form the semantic space). In both, we \ufb01nd that a bidirectional model is clearly better\nthan a one-direction one. We also notice that recently bidirectional models have found success in\nproblems involving identity space, e.g., face recognition and person re-identi\ufb01cation [57].\n\n5 Conclusion\n\nIn this paper, we have proposed a domain-invariant projection learning (DIPL) model for zero-shot\nrecognition. A novel iterative algorithm has been developed for model optimization, followed by\nrigorous theoretic algorithm analysis. Our model has also been extended to ZSL with superclasses.\nExtensive experiments on \ufb01ve benchmark datasets show that our DIPL model yields state-of-the-art\nresults under all three ZSL settings. It is worth pointing out that the proposed optimization algorithm\nis by no means restricted to the ZSL problem \u2013 many other vision problems need to deal with a\nmin-min problem and thus our gradient-based formulation can be induced similarly. Our current\nefforts thus include its generalization to solve a wider range of vision problems (e.g. social image\nclassi\ufb01cation, cross-modal image retrieval, and person re-identi\ufb01cation).\n\nAcknowledgements\n\nThis work was partially supported by National Natural Science Foundation of China (61573363),\nthe Fundamental Research Funds for the Central Universities and the Research Funds of Renmin\nUniversity of China (15XNLQ01), and European Research Council FP7 Project SUNNY (313243).\n\n9\n\n\fReferences\n[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classi\ufb01cation. TPAMI,\n\n38(7):1425\u20131438, 2016.\n\n[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for \ufb01ne-grained\n\nimage classi\ufb01cation. In CVPR, pages 2927\u20132936, 2015.\n\n[3] S. C. AP, S. Lauly, H. Larochelle, M. Khapra, B. Ravindran, V. C. Raykar, and A. Saha. An autoencoder\nIn Advances in Neural Information Processing\n\napproach to learning bilingual word representations.\nSystems, pages 1853\u20131861, 2014.\n\n[4] L. J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov. Predicting deep zero-shot convolutional neural\n\nnetworks using textual descriptions. In ICCV, pages 4247\u20134255, 2015.\n\n[5] P. Baldi. Autoencoders, unsupervised learning, and deep architectures. In ICML Workshop on Unsupervised\n\nand Transfer Learning, pages 37\u201349, 2012.\n\n[6] R. Bartels and G. Stewart. Solution of the matrix equation AX + XB = C. Communications of the ACM,\n\n15(9):820\u2013826, 1972.\n\n[7] M. Bucher, S. Herbin, and F. Jurie. Generating visual representations for zero-shot classi\ufb01cation. In ICCV\nWorkshops: Transferring and Adapting Source Knowledge in Computer Vision, pages 2666\u20132673, 2017.\n\n[8] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classi\ufb01ers for zero-shot learning. In CVPR,\n\npages 5327\u20135336, 2016.\n\n[9] S. Changpinyo, W.-L. Chao, and F. Sha. Predicting visual exemplars of unseen classes for zero-shot\n\nlearning. In ICCV, pages 3476\u20133485, 2017.\n\n[10] W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha. An empirical study and analysis of generalized zero-shot\n\nlearning for object recognition in the wild. In ECCV, pages 52\u201368, 2016.\n\n[11] Z. Ding, M. Shao, and Y. Fu. Low-rank embedded ensemble semantic dictionary for zero-shot learning. In\n\nCVPR, pages 2050\u20132058, 2017.\n\n[12] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep\n\nconvolutional activation feature for generic visual recognition. In ICML, pages 647\u2013655, 2014.\n\n[13] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, pages\n\n1778\u20131785, 2009.\n\n[14] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. DeViSE: A\ndeep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages\n2121\u20132129, 2013.\n\n[15] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. TPAMI,\n\n37(11):2332\u20132345, 2015.\n\n[16] Y. Fu and L. Sigal. Semi-supervised vocabulary-informed learning. In CVPR, pages 5337\u20135346, 2016.\n\n[17] Z. Fu, T. Xiang, E. Kodirov, and S. Gong. Zero-shot object recognition by semantic manifold distance. In\n\nCVPR, pages 2635\u20132644, 2015.\n\n[18] Y. Guo, G. Ding, X. Jin, and J. Wang. Transductive zero-shot recognition via shared model space learning.\n\nIn AAAI, pages 3494\u20133500, 2016.\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages\n\n770\u2013778, 2016.\n\n[20] S. J. Hwang and L. Sigal. A uni\ufb01ed semantic embedding: Relating taxonomies and attributes. In Advances\n\nin Neural Information Processing Systems, pages 271\u2013279, 2014.\n\n[21] H. Jiang, R. Wang, S. Shan, Y. Yang, and X. Chen. Learning discriminative latent attributes for zero-shot\n\nclassi\ufb01cation. In ICCV, pages 4223\u20134232, 2017.\n\n[22] P. Kankuekul, A. Kawewong, S. Tangruamsub, and O. Hasegawa. Online incremental attribute-based\n\nzero-shot learning. In CVPR, pages 3657\u20133664, 2012.\n\n10\n\n\f[23] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Unsupervised domain adaptation for zero-shot learning. In\n\nICCV, pages 2452\u20132460, 2015.\n\n[24] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning.\n\n3174\u20133183, 2017.\n\nIn CVPR, pages\n\n[25] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classi\ufb01cation for zero-shot visual object\n\ncategorization. TPAMI, 36(3):453\u2013465, 2014.\n\n[26] A. Li, Z. Lu, L. Wang, T. Xiang, and J.-R. Wen. Zero-shot scene classi\ufb01cation for high spatial resolution\n\nremote sensing images. IEEE Trans. Geoscience and Remote Sensing, 55(7):4157\u20134167, 2017.\n\n[27] X. Li, Y. Guo, and D. Schuurmans. Semi-supervised zero-shot classi\ufb01cation with label representation\n\nlearning. In ICCV, pages 4211\u20134219, 2015.\n\n[28] Y. Li, D. Wang, H. Hu, Y. Lin, and Y. Zhuang. Zero-shot recognition using dual visual-semantic mapping\n\npaths. In CVPR, pages 3279\u20133287, 2017.\n\n[29] T. Long, X. Xu, F. Shen, L. Liu, N. Xie, and Y. Yang. Zero-shot learning via discriminative representation\n\nextraction. Pattern Recognition Letters, 109:27\u201334, 2018.\n\n[30] X. Lu, Y. Tsao, S. Matsuda, and C. Hori. Speech enhancement based on deep denoising autoencoder. In\n\nInterspeech, pages 436\u2013440, 2013.\n\n[31] Y. Lu. Unsupervised learning on neural network outputs: with application in zero-shot learning. arXiv\n\npreprint arXiv:1506.00990, 2015.\n\n[32] Z. Lu. An iterative algorithm for entropy regularized likelihood learning on Gaussian mixture with\n\nautomatic model selection. Neurocomputing, 69(13-15):1674\u20131677, 2006.\n\n[33] Z. Lu and H. H. Ip. Generalized competitive learning of Gaussian mixture models. IEEE Trans. Systems,\n\nMan, and Cybernetics, Part B, 39(4):901\u2013909, 2009.\n\n[34] Z. Lu and Y. Peng. Uni\ufb01ed constraint propagation on multi-view data. In AAAI, pages 640\u2013646, 2013.\n\n[35] Z. Lu, Y. Peng, and H. H. Ip.\n\n31(1):36\u201343, 2010.\n\nImage categorization via robust pLSA. Pattern Recognition Letters,\n\n[36] Z. Lu, L. Wang, and J.-R. Wen. Direct semantic analysis for social image classi\ufb01cation. In AAAI, pages\n\n1258\u20131264, 2014.\n\n[37] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words\nand phrases and their compositionality. In Advances in Neural Information Processing Systems, pages\n3111\u20133119, 2013.\n\n[38] A. Mishra, M. Reddy, A. Mittal, and H. A. Murthy. A generative model for zero shot learning using\n\nconditional variational autoencoders. arXiv preprint arXiv:1709.00663, 2017.\n\n[39] P. Morgado and N. Vasconcelos. Semantically consistent regularization for zero-shot recognition. In CVPR,\n\npages 6060\u20136069, 2017.\n\n[40] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot\nIn International Conference on Learning\n\nlearning by convex combination of semantic embeddings.\nRepresentations (ICLR), 2014.\n\n[41] G. Patterson, C. Xu, H. Su, and J. Hays. The sun attribute database: Beyond categories for deeper scene\n\nunderstanding. IJCV, 108(1):59\u201381, 2014.\n\n[42] M. Radovanovi\u00b4c, A. Nanopoulos, and M. Ivanovi\u00b4c. Hubs in space: Popular nearest neighbors in high-\n\ndimensional data. Journal of Machine Learning Research, 11(9):2487\u20132531, 2010.\n\n[43] S. Rahman, S. H. Khan, and F. Porikli. A uni\ufb01ed approach for conventional zero-shot, generalized zero-shot\n\nand few-shot learning. arXiv preprint arXiv:1706.08653, 2017.\n\n[44] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of \ufb01ne-grained visual descriptions.\n\nIn CVPR, pages 49\u201358, 2016.\n\n[45] M. Rohrbach, S. Ebert, and B. Schiele. Transfer learning in a transductive setting. In Advances in Neural\n\nInformation Processing Systems, pages 46\u201354, 2013.\n\n11\n\n\f[46] B. Romera-Paredes and P. H. S. Torr. An embarrassingly simple approach to zero-shot learning. In ICML,\n\npages 2152\u20132161, 2015.\n\n[47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV,\n115(3):211\u2013252, 2015.\n\n[48] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition. TPAMI,\n\n35(7):1757\u20131772, 2013.\n\n[49] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.\n\n[50] Y. Shigeto, I. Suzuki, K. Hara, M. Shimbo, and Y. Matsumoto. Ridge regression, hubness, and zero-shot\nlearning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases,\npages 135\u2013151, 2015.\n\n[51] S. M. Shojaee and M. S. Baghshah. Semi-supervised zero-shot learning by a clustering-based approach.\n\narXiv preprint arXiv:1605.09016, 2016.\n\n[52] T. C. Silva and L. Zhao. Stochastic competitive learning in complex networks. IEEE Trans. Neural\n\nNetworks and Learning Systems, 23(3):385\u2013398, 2012.\n\n[53] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[54] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In\n\nAdvances in Neural Information Processing Systems, pages 935\u2013943, 2013.\n\n[55] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. In CVPR, pages 1\u20139, 2015.\n\n[56] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n[57] H. Wang, X. Zhu, S. Gong, and T. Xiang. Person re-identi\ufb01cation in identity regression space. IJCV, pages\n\n1\u201323, 2018.\n\n[58] Q. Wang and K. Chen. Zero-shot visual recognition via bidirectional latent embedding. IJCV, 124(3):356\u2013\n\n383, 2017.\n\n[59] W. Wang, Y. Pu, V. K. Verma, K. Fan, Y. Zhang, C. Chen, P. Rai, and L. Carin. Zero-shot learning via\n\nclass-conditioned deep generative models. In AAAI, pages 4211\u20134218, 2018.\n\n[60] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In CVPR,\n\npages 5542\u20135551, 2018.\n\n[61] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning - the good, the bad and the ugly. In CVPR, pages\n\n4582\u20134591, 2017.\n\n[62] X. Xu, T. Hospedales, and S. Gong. Semantic embedding space for zero-shot action recognition. In IEEE\n\nInternational Conference on Image Processing (ICIP), pages 63\u201367, 2015.\n\n[63] X. Xu, T. Hospedales, and S. Gong. Transductive zero-shot action recognition by word-vector embedding.\n\nIJCV, 123(3):309\u2013333, 2017.\n\n[64] M. Ye and Y. Guo. Zero-shot classi\ufb01cation with discriminative semantic representation learning. In CVPR,\n\npages 7140\u20137148, 2017.\n\n[65] Y. Yu, Z. Ji, X. Li, J. Guo, Z. Zhang, H. Ling, and F. Wu. Transductive zero-shot learning with a self-training\n\ndictionary approach. arXiv preprint arXiv:1703.08893, 2017.\n\n[66] L. Zhang, T. Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. In CVPR,\n\npages 2021\u20132030, 2017.\n\n[67] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, pages\n\n4166\u20134174, 2015.\n\n[68] Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In CVPR, pages\n\n6034\u20136042, 2016.\n\n[69] Z. Zhang and V. Saligrama. Zero-shot recognition via structured prediction. In ECCV, pages 533\u2013548,\n\n2016.\n\n12\n\n\f", "award": [], "sourceid": 549, "authors": [{"given_name": "An", "family_name": "Zhao", "institution": "Renmin University of China"}, {"given_name": "Mingyu", "family_name": "Ding", "institution": "Renmin University of China"}, {"given_name": "Jiechao", "family_name": "Guan", "institution": "Renmin University of China"}, {"given_name": "Zhiwu", "family_name": "Lu", "institution": "Renmin University of China"}, {"given_name": "Tao", "family_name": "Xiang", "institution": "Samsung AI Centre, Cambridge"}, {"given_name": "Ji-Rong", "family_name": "Wen", "institution": "Renmin University of China"}]}